NLP 中的 CountVectorizer

2025年6月17日 | 阅读5分钟

CountVectorizer 是一个基础的自然语言处理工具，用于将文本数据转换为数值表示。它是词袋 (BoW) 模型的一个组成部分，该模型根据文档或语料库中单词的频率来表示文本数据。这种转换对于机器学习算法至关重要，因为大多数任务（如文本分类、情感分析和主题建模）都需要数值输入。

CountVectorizer 的工作原理很简单，例如对文档或文本进行分词，将文本分割成单词或称为 n-gram 的单词序列，然后构建一个唯一单词的字典，之后为语料库中的所有文档构建一个稀疏矩阵表示，其中每一行对应语料库中的一个文档，每一列对应词汇表中的一个词项。矩阵中的值是每个文档中词项的频率。例如，如果一个词项在文档中出现两次，则其在矩阵中的出现计数为 2。

CountVectorizer 最吸引人的功能之一可能是其在过程中调整任何分词的灵活性，通过过滤停用词以及应用最小和最大词项频率。这意味着更好地控制文本数据的表示，并允许创建 n-gram，这反过来有助于通过连续的词序列而不是像每个单词单独出现的情况那样独立的单元来表示上下文。

现在我们将有效地使用 CountVectorizer 对 URL 进行分词，使模型能够根据字符或单词学习模式。

导入库

import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

加载数据集

# Loading the data
dataset = pd.read_csv("/kaggle/input/malicious-urls-dataset/malicious_phish.csv", nrows = 150000)

数据预处理

# Preprocessing the data
# Converting types into numeric labels
mapping_label = {"benign": 0, "phishing": 1, "defacement": 2, "malware" : 3}
dataset['label'] = dataset['type'].map(mapping_label)

特征工程

# Feature Engineering
# Convert URLs into letters or words.
vectorizer = CountVectorizer(analyzer='char', lowercase=False)
X = vectorizer.fit_transform(data['url'])

# Split data into features (X) and labels (y)
y = data['label']

分割数据集

#Divide the data into sets for testing and training.
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=32)

代码

输出

<1x173 sparse matrix of type '<class 'numpy.int64'>' 	with 11 stored elements in Compressed Sparse Row format>

代码

输出

<1x173 sparse matrix of type '<class 'numpy.int64'>' 	with 11 stored elements in Compressed Sparse Row format>

CountVectorizer 用于特征提取

代码

# An illustration of how the provided URL might be changed by the vectorizer
url1 = ['https://www.google.com/']
pb = vectorizer.fit_transform(url1)
print(pb)

输出

代码

输出

scipy.sparse._csr.csr_matrix

代码

print(f"train_X Shape : {train_X.shape}")
print(f"train_y Shape : {train_y.shape}")
print(f"test_X  Shape : {test_X.shape}")
print(f"test_y  Shape : {test_y.shape}")

输出

现在我们将尝试将以下数据输入各种模型并查看它们的准确性。

逻辑回归

代码

from sklearn.linear_model import LogisticRegression

one_classifier = LogisticRegression(max_iter=1000, random_state=24)
one_classifier.fit(train_X, train_y)
one_predy = one_classifier.predict(test_X)
from sklearn.metrics import confusion_matrix
onecm = confusion_matrix(test_y, one_predy)
print(onecm)

输出

KNeighborsClassifier

代码

from sklearn.neighbors import KNeighborsClassifier

two_classifier = KNeighborsClassifier(metric='euclidean')
two_classifier.fit(train_X, train_y)
two_predy = two_classifier.predict(test_X)
from sklearn.metrics import confusion_matrix

twocm = confusion_matrix(test_y, two_predy)
print(twocm)

输出

MultinomialNB

代码

from sklearn.naive_bayes import MultinomialNB

three_classifier = MultinomialNB()
three_classifier.fit(train_X, train_y)
three_predy = three_classifier.predict(test_X)
threecm = confusion_matrix(test_y, three_predy)
print(threecm)

输出

DecisionTreeClassifier

代码

from sklearn.tree import DecisionTreeClassifier 

four_classifier = DecisionTreeClassifier(criterion = 'entropy', random_state=22)
four_classifier.fit(train_X, train_y)
four_predy = four_classifier.predict(test_X)
fourcm = confusion_matrix(test_y, four_predy)
print(fourcm)

输出

RandomForestClassifier

代码

from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
five_classifier = RandomForestClassifier(n_estimators=100,max_features='sqrt')
five_classifier.fit(train_X,train_y)
five_predy = five_classifier.predict(test_X)
fivecm = confusion_matrix(test_y, five_predy)
print(fivecm)

输出

XGBClassifier

代码

from xgboost import XGBClassifier

# Considering that trains X and Y are DataFrame objects
six_classifier = XGBClassifier(random_state=32)
six_classifier.fit(train_X, train_y)
six_predy = six_classifier.predict(test_X)
sixcm = confusion_matrix(test_y, six_predy)
print(sixcm)

输出

绩效指标

代码

from sklearn.metrics import classification_report, precision_score, recall_score, f1_score, accuracy_score

# Presuming that you have specified the classifiers' predictions and imported them
classifiers = [one_predy, two_predy, three_predy, four_predy, five_predy, six_predy]
names_of_model = ['LogisticRegression', 'KNeighborsClassifier', 'MultinomialNB', 'DecisionTreeClassifier', 'RandomForestClassifier', 'XGBClassifier']

# Make a vocabulary that links the names of the models to the corresponding forecasts.
map_classifier = dict(zip(names_of_model, classifiers))

# You can now iterate over each model name and its predictions using this dictionary.
for model_name, y_pred in map_classifier.items():
    accuracy = accuracy_score(test_y, y_pred) 
    precision = precision_score(test_y, y_pred, average='weighted') 
    recall = recall_score(test_y, y_pred, average='weighted') 
    f1score = f1_score(test_y, y_pred, average='weighted') 
    class_report = classification_report(test_y, y_pred)
    
    print(f"\nMetrics for Model '{model_name}':")
    print(f"Accuracy: {accuracy}")
    print(f"Precision: {precision}")
    print(f"Recall: {recall}")
    print(f"F1-score: {f1score}")
    print(f"Classification Report:\n{class_report}")

输出

让我们用条形图显示结果。

代码

import numpy as np
import matplotlib.pyplot as plt

# Set up the bar chart's data first.
models = list(map_classifier.keys())
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-score']

# Determine each model's metric scores.
scores_metric = {}
for metric in metrics:
    scores_metric[metric] = [accuracy_score(test_y, map_classifier[model_name]) if metric == 'Accuracy'
                             else precision_score(test_y, map_classifier[model_name], average='weighted') if metric == 'Precision'
                             else recall_score(test_y, map_classifier[model_name], average='weighted') if metric == 'Recall'
                             else f1_score(test_y, map_classifier[model_name], average='weighted') for model_name in models]

# Set the width of the bars
bar_width = 0.2

# Set the position of bars on the X-axis
l1 = np.arange(len(models))
l2 = [x + bar_width for x in l1]
l3 = [x + bar_width for x in l2]
l4 = [x + bar_width for x in l3]

plt.figure(figsize=(14, 6))

# Defining light colors
colors = ['#add8e6', '#ffcc99', '#90ee90', '#ffb6c1']

# Plotting the bars
plt.bar(l1, scores_metric['Accuracy'], color=colors[0], width=bar_width, edgecolor='grey', label='Accuracy')
plt.bar(l2, scores_metric['Precision'], color=colors[1], width=bar_width, edgecolor='grey', label='Precision')
plt.bar(l3, scores_metric['Recall'], color=colors[2], width=bar_width, edgecolor='grey', label='Recall')
plt.bar(l4, scores_metric['F1-score'], color=colors[3], width=bar_width, edgecolor='grey', label='F1-score')

# Put xticks in the group bars' center.
plt.xlabel('Models', fontweight='bold')
plt.xticks([r + bar_width*1.5 for r in range(len(models))], models)

# Set y-axis limits
plt.ylim(0.6, 1.0)

# Add a legend and show the plot
plt.legend()
plt.show()

输出

代码

import matplotlib.pyplot as plt

# Set up the line plot's data first.
models = list(map_classifier.keys())
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-score']

# Determine each model's metric scores.
scores_metric = {}
for metric in metrics:
    scores_metric[metric] = [accuracy_score(test_y, map_classifier[model_name]) if metric == 'Accuracy'
                             else precision_score(test_y, map_classifier[model_name], average='weighted') if metric == 'Precision'
                             else recall_score(test_y, map_classifier[model_name], average='weighted') if metric == 'Recall'
                             else f1_score(test_y, map_classifier[model_name], average='weighted') for model_name in models]

# Making a line plot
plt.figure(figsize=(10, 6))

for metric in metrics:
    plt.plot(models, scores_metric[metric], marker='o', label=metric)

plt.title('Performance Metrics of Different Models Using Count Vectorizer')
plt.xlabel('Models')
plt.ylabel('Score')
plt.ylim(0.7, 1.0)
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

输出

下一个主题机器学习中的感知器

NLP 中的 CountVectorizer

导入库

加载数据集

数据预处理

特征工程

分割数据集

CountVectorizer 用于特征提取

逻辑回归

KNeighborsClassifier

MultinomialNB

DecisionTreeClassifier

RandomForestClassifier

XGBClassifier

绩效指标

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

机器学习

监督式学习

分类

杂项

相关教程

面试题

NLP 中的 CountVectorizer

导入库

加载数据集

数据预处理

特征工程

分割数据集

CountVectorizer 用于特征提取

逻辑回归

KNeighborsClassifier

MultinomialNB

DecisionTreeClassifier

RandomForestClassifier

XGBClassifier

绩效指标

相关帖子

机器学习中的联邦学习

如何将 NumPy 数组保存到文件以进行机器学习？

机器学习中使用的统计数据类型

机器学习中的粒子群优化算法

机器学习中的空气污染预测

机器学习中的解析解

准确率、精确率、召回率或 F1 分数

如何使用 Kaggle

机器学习线性代数中的矩阵类型

如何从零开始学习机器学习

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器