Python中的文本处理

2025年1月5日 | 阅读 4 分钟

内容处理对于自然语言处理（NLP）、机器学习和数据分析至关重要。在进一步研究或处理内容数据之前，必须对其进行编辑和修改。Python广泛的库环境提供了处理各种文本处理任务的强大功能。本课程教授Python文本处理的核心策略和库。

1. 基本字符串操作

Python内置的字符串功能构成了基本文本处理的基础。

拆分字符串：使用split()将字符串分割成子字符串。

代码

text = "Olivia, Greens"
words = text.split()
# Output: ['Olivia,', 'Greens']

输出

['Olivia,', 'Greens']

连接字符串：使用join()将多个字符串连接成一个字符串。

代码

words = ['Olivia,', 'Greens']
sentence = ' '.join(words)

输出

"Olivia, Greens"

替换子字符串：要用另一个子字符串替换一个子字符串的实例，请调用replace()。

代码

Input_text = "Olivia, Greens"
new_text = Input_text.replace("Greens", "Thakur")

输出

"Olivia, Thakur"

更改大小写：要更改字符串的大小写，请调用upper()、lower()、capitalize()或title()。

代码

Input_text = "Olivia, Greens"
print(Input_text.upper())  
print(Input_text.lower())  

输出

"OLIVIA, GREENS"
"olivia, greens!"

2. 正则表达式

正则表达式（regex）支持文本操作和复杂的模式匹配。Python的re模块提供了正则表达式功能。

查找模式：使用re.findall()查找所有模式。

代码

import re
input_text_ = "Ohio is a place in US"
matches = re.findall(r"\b\w{4}\b", input_text_)

输出

['place']

替换模式：使用re.sub()将模式的所有实例替换为指定的字符串。

代码

text = " Ohio is a place in US "
new_text = re.sub(r"Ohio", "Chicago", text)

输出

" Chicago is a place in US "

3. 分词

在Python中，分词是将文本、句子、段落或整个文本文档分解成更小的部分（例如，单个单词或短语）的过程。标记是单独的、更小的单元。

词语分词

代码

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
input_text1 = "Lake Tahoe is a beautiful lake in US"
words = word_tokenize(text) 

输出

['Lake', 'Tahoe', 'is', 'a', 'beautiful', 'lake', 'in', 'US']

句子分词

代码

from nltk.tokenize import sent_tokenize
input_text = " Lake Tahoe is a beautiful lake in US "
sentences = sent_tokenize(text)

输出

['Lake Tahoe is a beautiful lake in US.']

4. 词干提取和词形还原

词干提取和词形还原将单词简化为其基本形式或词根形式。nltk库为两者都提供了工具。

词干提取

代码

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
input_words = ["Eating", "Scanning", "Dancing"]
stems = [stemmer.stem(input_word) for input_word in input_words]
print(stems)

输出

['eat', 'scan', 'danc']

词形还原

代码

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
input_words = ["Eating", "Scanning", "Dancing"]
lemmas = [lemmatizer.lemmatize(input_word, pos='v') for input_word in input_words]

输出

['eat', 'scan', 'dance']

5. 删除停用词

停用词是常用词（例如，“and”、“the”、“is”），这些词经常从文本中删除，以便为更重要的词腾出空间。nltk库包含一个停用词列表。

代码

from nltk.corpus import stopwords
nltk.download('stopwords')
eng_stop_words = set(stopwords.words('english'))
input_words = ['Lake', 'Tahoe', 'is', 'a', 'beautiful', 'lake', 'in', 'US']
filtering_of_words = [input_word for input_word in input_words if input_word.lower() not in eng_stop_words]
print(filtering_of_words)

输出

['Lake', 'Tahoe', 'beautiful', 'lake', 'US']

6. 词袋模型和TF-IDF

这些是将文本转换为数值表示的方法。

词袋模型（BoW）：文本表示为词频的集合。

代码

from sklearn.feature_extraction.text import CountVectorizer
inpu_texts_ex = ["I love US.", "US is beautiful!"]
vectorizer = CountVectorizer()
D = vectorizer.fit_transform(input_texts_ex)
print(D.toarray())
print(vectorizer.get_feature_names_out())

输出

[[0 1 1 1]
 [1 0 1 1]]
['beautiful' 'US' 'is' 'love']

TF-IDF（词频-逆文档频率）：TF-IDF根据词语的重要性调整词频。

代码

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
D = vectorizer.fit_transform(texts)

输出

[[0.         0.         0.62276601 0.         0.78229941]
 [0.62276601 0.62276601 0.         0.62276601 0.        ]]
['beautiful' 'US' 'is' 'love']

7. 情感分析

情感分析决定文本的情感语境。textblob库使这一过程更加容易。

代码

from textblob import TextBlob
input_text_ex_ = "Charminar is beautiful, It is one of the main attraction of US"
blob = TextBlob(input_text_ex_)
sentiment = blob.sentiment
print(sentiment)

输出

Sentiment(polarity=0.85, subjectivity=0.95)

8. 命名实体识别（NER）

NER识别和分类文本中的命名实体（如人名、地点和组织）。spaCy库通常用于NER。

代码

import spacy
nlp = spacy.load("en_core_web_sm")
input_text_ex_ = "It is very hot in US, the temperature is around 45 Degrees"
doc1 = nlp(input_text_ex_)
doc_entities = [(ent.text, ent.label_) for ent in doc1.ents]
print(doc_entities)

输出

[('US', 'GPE'), ('around 45 Degrees', 'QUANTITY')]

在处理文本数据时，数据科学家、分析师和工程师必须熟悉Python的文本处理。Python的各种库和工具允许您有效地执行广泛的任务，从基本的字符串操作到高级的自然语言处理。您将能够使用这些工具在将文本数据用于进一步分析或机器学习应用之前对其进行清理、分析和修改。学习Python文本处理方法不仅可以提高您分析数据的能力，还可以为更复杂的应用程序和跨不同领域提供更深入的见解。

下一主题Python中的Unittest框架断言

← 上一个下一个 →

Python中的文本处理

1. 基本字符串操作

2. 正则表达式

3. 分词

4. 词干提取和词形还原

5. 删除停用词

6. 词袋模型和TF-IDF

7. 情感分析

8. 命名实体识别（NER）

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

其他

Python中的文本处理

1. 基本字符串操作

2. 正则表达式

3. 分词

4. 词干提取和词形还原

5. 删除停用词

6. 词袋模型和TF-IDF

7. 情感分析

8. 命名实体识别（NER）

相关帖子

Python中的os.urandom()方法

Python中的Matplotlib.pyplot.clf()

Python并发入门

Python中的from...import语句的用途是什么

Python datetime.date类中的Weekday()函数

Python中的os.unlink()方法

Python中的UnitTest框架异常测试

如何在Python中使用Pandas进行vLookup

使用Python将文本文件转换为DataFrame

Python中的__init_subclass__

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器