Python 中的分词器

2025年3月17日 | 阅读11分钟

众所周知，互联网上有海量文本数据。但是，我们中的大多数人可能不熟悉开始处理这些文本数据的方法。此外，我们也知道，在机器学习中处理我们语言的字母是一项棘手的任务，因为机器只能识别数字，而不是字母。

那么，如何进行文本数据处理和清洗以创建模型呢？为了回答这个问题，让我们来探索一些 自然语言处理 (NLP) 背后的精彩概念。

解决 NLP 问题是一个分为多个阶段的过程。首先，在进入建模阶段之前，我们需要清理非结构化文本数据。数据清理包含一些关键步骤。这些步骤如下：

词语分词
为每个词语预测词性
文本词形还原
停用词识别和去除，以及更多。

在接下来的教程中，我们将学习更多关于最基本的一个步骤——分词 (Tokenization)。我们将理解什么是分词，以及它为什么对自然语言处理 (NLP) 至关重要。此外，我们还将探索一些在 Python 中执行分词的独特方法。

理解分词

分词 (Tokenization) 被定义为将大量文本分割成称为词元 (Tokens) 的小片段。这些片段或词元对于发现模式非常有用，并且被认为是词干提取和词形还原的基础步骤。分词还有助于用非敏感数据替换敏感数据元素。

自然语言处理 (NLP) 用于创建文本分类、情感分析、智能聊天机器人、语言翻译等应用程序。因此，为了实现上述目标，理解文本模式变得很重要。

但现在，将词干提取和词形还原视为使用自然语言处理 (NLP) 清理文本数据的主要步骤。文本分类或垃圾邮件过滤等任务使用 NLP 以及深度学习库，如 Keras 和 Tensorflow。

理解分词在 NLP 中的重要性

为了理解分词的重要性，让我们以英语为例。在理解以下部分时，让我们记住任何一个句子。

在处理自然语言之前，我们需要识别构成字符字符串的单词。因此，分词被证明是进行自然语言处理 (NLP) 最基本的一步。

此步骤是必要的，因为可以通过分析文本中的每个单词来解释文本的实际含义。

现在，让我们以以下字符串为例

我的名字是 Jamie Clark。

在对上述字符串执行分词后，我们将得到如下所示的输出：

['My', 'name', 'is', 'Jamie', 'Clark']

执行此操作有多种用途。我们可以利用分词形式来：

计算文本中的总词数。
计算词语频率，即特定词语出现的总次数，以及更多。

现在，让我们在 Python 中了解在自然语言处理 (NLP) 中执行分词的几种方法。

在 Python 中执行分词的一些方法

对文本数据进行分词有各种独特的方法。其中一些独特的方法如下所述：

使用 Python 中的 split() 函数进行分词

split() 函数是分割字符串的基本方法之一。此函数通过提供的分隔符分割字符串后返回字符串列表。split() 函数默认在每个空格处分割字符串。但是，我们可以根据需要指定分隔符。

让我们看下面的例子：

示例 1.1：使用 split() 函数进行词语分词

my_text = """Let's play a game, Would You Rather! It's simple, you have to pick one or the other. Let's get started. Would you rather try Vanilla Ice Cream or Chocolate one? Would you rather be a bird or a bat? Would you rather explore space or the ocean? Would you rather live on Mars or on the Moon? Would you rather have many good friends or one very best friend? Isn't it easy though? When we have less choices, it's easier to decide. But what if the options would be complicated? I guess, you pretty much not understand my point, neither did I, at first place and that led me to a Bad Decision."""

print(my_text.split())

输出

['Let's', 'play', 'a', 'game,', 'Would', 'You', 'Rather!', 'It's', 'simple,', 'you', 'have', 'to', 'pick', 'one', 'or', 'the', 'other.', 'Let's', 'get', 'started.', 'Would', 'you', 'rather', 'try', 'Vanilla', 'Ice', 'Cream', 'or', 'Chocolate', 'one?', 'Would', 'you', 'rather', 'be', 'a', 'bird', 'or', 'a', 'bat?', 'Would', 'you', 'rather', 'explore', 'space', 'or', 'the', 'ocean?', 'Would', 'you', 'rather', 'live', 'on', 'Mars', 'or', 'on', 'the', 'Moon?', 'Would', 'you', 'rather', 'have', 'many', 'good', 'friends', 'or', 'one', 'very', 'best', 'friend?', 'Isn't', 'it', 'easy', 'though?', 'When', 'we', 'have', 'less', 'choices,', 'it's', 'easier', 'to', 'decide.', 'But', 'what', 'if', 'the', 'options', 'would', 'be', 'complicated?', 'I', 'guess,', 'you', 'pretty', 'much', 'not', 'understand', 'my', 'point,', 'neither', 'did', 'I,', 'at', 'first', 'place', 'and', 'that', 'led', 'me', 'to', 'a', 'Bad', 'Decision.']

说明

在上面的示例中，我们使用了 split() 方法将段落分割成更小的片段，即词语。同样，我们也可以通过将分隔符指定为 split() 函数的参数来将段落分割成句子。如我们所知，句子通常以句号 "." 结尾；这意味着我们可以将 "." 用作分隔符来分割字符串。

让我们在下面的例子中看同样的情况：

示例 1.2：使用 split() 函数进行句子分词

my_text = """Dreams. Desires. Reality. There is a fine line between dream to become a desire and a desire to become a reality but expectations are way far then the reality. Nevertheless, we live in a world of mirrors, where we always want to reflect the best of us. We all see a dream, a dream of no wonder what; a dream that we want to be accomplished no matter how much efforts it needed but we try."""

print(my_text.split('. '))

输出

['Dreams', 'Desires', 'Reality', 'There is a fine line between dream to become a desire and a desire to become a reality but expectations are way far then the reality', 'Nevertheless, we live in a world of mirrors, where we always want to reflect the best of us', 'We all see a dream, a dream of no wonder what; a dream that we want to be accomplished no matter how much efforts it needed but we try.']

说明

在上面的示例中，我们使用了 split() 函数，并将句号 (.) 作为其参数，以便在句号处分割段落。使用 split() 函数的一个主要缺点是该函数一次只能接受一个参数。因此，我们只能使用一个分隔符来分割字符串。此外，split() 函数不将标点符号视为单独的片段。

使用 Python 中的 RegEx（正则表达式）进行分词

在继续下一种方法之前，让我们简要了解一下正则表达式。正则表达式，也称为 RegEx，是一系列特殊的字符，它允许用户通过该序列作为模式来查找或匹配其他字符串或字符串集。

为了开始使用 RegEx（正则表达式），Python 提供了名为 re 的库。re 库是 Python 的预装库之一。

让我们来看一些基于使用 Python 中的 RegEx 方法的词语分词和句子分词的例子。

示例 2.1：使用 Python 中的 RegEx 方法进行词语分词

import re

my_text = """Joseph Arthur was a young businessman. He was one of the shareholders at Ryan Cloud's Start-Up with James Foster and George Wilson. The Start-Up took its flight in the mid-90s and became one of the biggest firms in the United States of America. The business was expanded in all major sectors of livelihood, starting from Personal Care to Transportation by the end of 2000. Joseph was used to be a good friend of Ryan."""

my_tokens = re.findall

输出

['Joseph', 'Arthur', 'was', 'a', 'young', 'businessman', 'He', 'was', 'one', 'of', 'the', 'shareholders', 'at', 'Ryan', 'Cloud', 's', 'Start', 'Up', 'with', 'James', 'Foster', 'and', 'George', 'Wilson', 'The', 'Start', 'Up', 'took', 'its', 'flight', 'in', 'the', 'mid', '90s', 'and', 'became', 'one', 'of', 'the', 'biggest', 'firms', 'in', 'the', 'United', 'States', 'of', 'America', 'The', 'business', 'was', 'expanded', 'in', 'all', 'major', 'sectors', 'of', 'livelihood', 'starting', 'from', 'Personal', 'Care', 'to', 'Transportation', 'by', 'the', 'end', 'of', '2000', 'Joseph', 'was', 'used', 'to', 'be', 'a', 'good', 'friend', 'of', 'Ryan']

说明

在上面的示例中，我们导入了 re 库以使用其函数。然后，我们使用了 re 库的 findall() 函数。此函数帮助用户查找所有匹配参数中模式的单词，并将它们存储在列表中。

此外，"\w" 用于表示任何单词字符，指字母数字（包括字母、数字）和下划线 (_)。"+" 表示任何频率。因此，我们遵循了 [\w']+ 模式，以便程序能够查找所有字母数字字符，直到遇到其他字符为止。

现在，让我们看一下使用 RegEx 方法的句子分词。

示例 2.2：使用 Python 中的 RegEx 方法进行句子分词

import re

my_text = """The Advertisement was telecasted nationwide, and the product was sold in around 30 states of America. The product became so successful among the people that the production was increased. Two new plant sites were finalized, and the construction was started. Now, The Cloud Enterprise became one of America's biggest firms and the mass producer in all major sectors, from transportation to personal care. Director of The Cloud Enterprise, Ryan Cloud, was now started getting interviewed over his success stories. Many popular magazines were started publishing Critiques about him."""

my_sentences = re.compile('[.!?] ').split(my_text)
print(my_sentences)

输出

['The Advertisement was telecasted nationwide, and the product was sold in around 30 states of America', 'The product became so successful among the people that the production was increased', 'Two new plant sites were finalized, and the construction was started', "Now, The Cloud Enterprise became one of America's biggest firms and the mass producer in all major sectors, from transportation to personal care", 'Director of The Cloud Enterprise, Ryan Cloud, was now started getting interviewed over his success stories', 'Many popular magazines were started publishing Critiques about him.']

说明

在上面的示例中，我们使用了 re 库的 compile() 函数，参数为 '[.?!]'，并使用 split() 方法根据指定的分隔符来分割字符串。结果，程序在遇到这些字符中的任何一个时都会分割句子。

使用 Python 中的自然语言工具包进行分词

自然语言工具包，也称为 NLTK，是一个用 Python 编写的库。NLTK 库通常用于符号和统计自然语言处理，并且能很好地处理文本数据。

自然语言工具包 (NLTK) 是一个第三方库，可以使用以下命令在命令提示符或终端中安装：

为了验证安装，可以在程序中导入 nltk 库并按如下方式执行：

如果程序没有引发错误，则表示库已成功安装。否则，建议再次按照上述安装步骤操作，并阅读官方文档以获取更多详细信息。

自然语言工具包 (NLTK) 有一个名为 tokenize() 的模块。该模块进一步分为两个子类别：词语分词 (Word Tokenize) 和句子分词 (Sentence Tokenize)。

词语分词：word_tokenize() 方法用于将字符串分割成词元或单词。
句子分词：sent_tokenize() 方法用于将字符串或段落分割成句子。

让我们来看一些基于这两种方法的示例：

示例 3.1：使用 Python 中的 NLTK 库进行词语分词

from nltk.tokenize import word_tokenize

my_text = """The Advertisement was telecasted nationwide, and the product was sold in around 30 states of America. The product became so successful among the people that the production was increased. Two new plant sites were finalized, and the construction was started. Now, The Cloud Enterprise became one of America's biggest firms and the mass producer in all major sectors, from transportation to personal care. Director of The Cloud Enterprise, Ryan Cloud, was now started getting interviewed over his success stories. Many popular magazines were started publishing Critiques about him."""

print(word_tokenize(my_text))

输出

['The', 'Advertisement', 'was', 'telecasted', 'nationwide', ',', 'and', 'the', 'product', 'was', 'sold', 'in', 'around', '30', 'states', 'of', 'America', '.', 'The', 'product', 'became', 'so', 'successful', 'among', 'the', 'people', 'that', 'the', 'production', 'was', 'increased', '.', 'Two', 'new', 'plant', 'sites', 'were', 'finalized', ',', 'and', 'the', 'construction', 'was', 'started', '.', 'Now', ',', 'The', 'Cloud', 'Enterprise', 'became', 'one', 'of', 'America', "'s", 'biggest', 'firms', 'and', 'the', 'mass', 'producer', 'in', 'all', 'major', 'sectors', ',', 'from', 'transportation', 'to', 'personal', 'care', '.', 'Director', 'of', 'The', 'Cloud', 'Enterprise', ',', 'Ryan', 'Cloud', ',', 'was', 'now', 'started', 'getting', 'interviewed', 'over', 'his', 'success', 'stories', '.', 'Many', 'popular', 'magazines', 'were', 'started', 'publishing', 'Critiques', 'about', 'him', '.']

说明

在上面的程序中，我们从 NLTK 库的 tokenize 模块导入了 word_tokenize() 方法。因此，结果是该方法将字符串分割成不同的词元并存储在列表中。最后，我们打印了列表。此外，此方法会将句号和其他标点符号作为单独的词元。

示例 3.1：使用 Python 中的 NLTK 库进行句子分词

from nltk.tokenize import sent_tokenize

my_text = """The Advertisement was telecasted nationwide, and the product was sold in around 30 states of America. The product became so successful among the people that the production was increased. Two new plant sites were finalized, and the construction was started. Now, The Cloud Enterprise became one of America's biggest firms and the mass producer in all major sectors, from transportation to personal care. Director of The Cloud Enterprise, Ryan Cloud, was now started getting interviewed over his success stories. Many popular magazines were started publishing Critiques about him."""

print(sent_tokenize(my_text))

输出

['The Advertisement was telecasted nationwide, and the product was sold in around 30 states of America.', 'The product became so successful among the people that the production was increased.', 'Two new plant sites were finalized, and the construction was started.', "Now, The Cloud Enterprise became one of America's biggest firms and the mass producer in all major sectors, from transportation to personal care.", 'Director of The Cloud Enterprise, Ryan Cloud, was now started getting interviewed over his success stories.', 'Many popular magazines were started publishing Critiques about him.']

说明

在上面的程序中，我们从 NLTK 库的 tokenize 模块导入了 sent_tokenize() 方法。因此，结果是该方法将段落分割成不同的句子并存储在列表中。最后，我们打印了列表。

结论

在上述教程中，我们发现了分词 (Tokenization) 的概念及其在整个自然语言处理 (NLP) 管道中的作用。我们还讨论了几种在 Python 中从特定文本或字符串执行分词的方法（包括词语分词和句子分词）。

下一主题如何在 Python 中添加两个列表

Python 中的分词器

理解分词

理解分词在 NLP 中的重要性

在 Python 中执行分词的一些方法

使用 Python 中的 split() 函数进行分词

使用 Python 中的 RegEx（正则表达式）进行分词

使用 Python 中的自然语言工具包进行分词

结论

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

Python 问题

Python 中的分词器

理解分词

理解分词在 NLP 中的重要性

在 Python 中执行分词的一些方法

使用 Python 中的 split() 函数进行分词

使用 Python 中的 RegEx（正则表达式）进行分词

使用 Python 中的自然语言工具包进行分词

结论

相关帖子

Python 可变数据类型

PyCaret 入门

如何在 Python 中添加空格

编写 Python 程序查找列表中第一个重复的元素

Python 中的 LRU 缓存

使用 Matplotlib 在 Python 中绘制箱线图

Python 中的自动化交易

使用 bcrypt 在 Python 中加密密码

TypeError: string indices must be an integer

Python memory-profiler 模块

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器