机器学习中的文档分类

2025 年 8 月 27 日 | 阅读 6 分钟

Document Classification Using Machine Learning

在当今数字进步的时代，企业和机构面临着管理包含在不同文档格式中的海量信息的艰巨任务。对这些丰富信息进行高效组织和分类对于实现快速检索和明智决策至关重要。作为回应，将机器学习方法应用于文档分类已成为一种有效的补救措施，能够实现这些关键流程的自动化和简化。

文档分类在信息管理领域扮演着至关重要的角色，有助于简化存储、检索和分析。通过将文档分类到相关类别中，组织能够构建有组织的存储库，促进知识传播，并提高整体生产力。传统的 F 工分类方法费力、容易出错且耗时，因此凸显了自动化机器学习技术在此背景下的巨大价值。

复杂的机器学习算法能够仔细审查文档内容、结构和元数据，从而确保精确分类。监督学习技术，包括朴素贝叶斯、支持向量机 (SVM) 和随机森林，在分类工作中得到了广泛应用。这些算法从带有注释的训练数据中学习，其中文档被分配了相应的类别。此外，无监督学习方法（如 K 均值聚类和层次聚类）可用于揭示隐藏模式并将类似文档聚合，而无需预先建立类别信息。

现在我们将尝试在代码中实现它。

代码

导入库

import os
import string
import re

import matplotlib.pyplot as plt

# imports
from tensorflow.keras import layers
from tensorflow import keras
import tensorflow as tf

from sklearn.model_selection import train_test_split
from ast import literal_eval

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

加载数据集

DATASET = "/kaggle/input/document-classification/file.txt"
print(f"Dataset found = {os.path.exists(DATASET)}")

data = []
with open(DATASET, "r") as fp:
    rows = fp.read().split("\n")
    for i, row in enumerate(rows):
        if i == 0:
            continue
            
        texts = row.split(" ")
        label = int(texts[0])
        text = " ".join(texts[1:])
        data.append({'label': label, 'text': text})

df = pd.DataFrame(data)
df.head(5)

输出

print(f"Number of rows in the dataset = {df.shape[0]}")

num_labels = df["label"].nunique()
print(f"Number of unique labels = {num_labels}")

# Label distribution

df["label"].value_counts()

输出

注意：数据集中类别不平衡程度适中。分层划分。

test_split = 0.2

train_df, test_df = train_test_split(df, test_size=test_split, stratify=df['label'].values)

# further dividing the test set into separate test sets for validation.
val_df = test_df.sample(frac=0.5)
test_df.drop(val_df.index, inplace=True)

print(f"Number of rows in training set: {len(train_df)}")
print(f"Number of rows in validation set: {len(val_df)}")
print(f"Number of rows in test set: {len(test_df)}")

输出

GPU API

首先，我们需要检查 API 的可用性。

print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

gpus = tf.config.list_physical_devices('GPU')
if gpus:
  # Limit TensorFlow's use to just the first GPU
  try:
    tf.config.set_visible_devices(gpus[0], 'GPU')
    logical_gpus = tf.config.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU")
  except RuntimeError as e:
    # It is necessary to establish visible devices before initializing the GPUs.
    print(e)
    
  try:
    # Currently, memory growth across GPUs must be uniform.
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
    logical_gpus = tf.config.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Before GPU initialization, memory growth must be configured.
    print(e)
    
tf.config.run_functions_eagerly(False)

输出

准备数据

我们需要准备适合计算的数据。

# Change the labels to one hot encoding.

labels = tf.ragged.constant(train_df["label"].values)

# One hot encoding is used for numerous class classification tasks.
label_lookup = tf.keras.layers.IntegerLookup(output_mode="one_hot")
label_lookup.adapt(labels)
vocab = label_lookup.get_vocabulary()

def invert_multi_hot(encoded_labels):
    """Reverse a single multi-hot encoded label to a tuple of vocab terms."""
    hot_indices = np.argwhere(encoded_labels == 1.0)[..., 0]
    return np.take(vocab, hot_indices)


print("Vocabulary:\n")
print(vocab)

输出

# Here, we take the label pool's individual unique classes out one by one and use that information to represent a specific label set as a collection of 0s and 1s. Here is an illustration.

sample_label = train_df["label"].iloc[1]
print(f"Original label: {sample_label}")

label_binarized = label_lookup([sample_label])
print(f"Label-binarized representation: {label_binarized}")

输出

数据预处理

# We begin by estimating the sequence lengths' percentiles. In a minute, the goal will become evident.

train_df["text"].apply(lambda x: len(x.split(" "))).describe()

输出

max_seqlen = 107
batch_size = 128
padding_token = "<pad>"
auto = tf.data.AUTOTUNE

def make_dataset(dataframe, is_train=True):
    labels = tf.ragged.constant(dataframe["label"].values)
    label_binarized = label_lookup(labels).numpy()
    dataset = tf.data.Dataset.from_tensor_slices(
        (dataframe["text"].values, label_binarized)
    )
    dataset = dataset.shuffle(batch_size * 10) if is_train else dataset
    return dataset.batch(batch_size)


train_dataset = make_dataset(train_df, is_train=True)
validation_dataset = make_dataset(val_df, is_train=False)
test_dataset = make_dataset(test_df, is_train=False)


""" Dataset preview """

text_batch, label_batch = next(iter(train_dataset))

for i, text in enumerate(text_batch[0:5]):
    label = label_batch[i].numpy()[None, ...]
    print(f"Abstract: {text}")
    print(f"Label(s): {invert_multi_hot(label[0])}")
    print(" ")

输出

"""
Vectorize the text data using TextVectorization, and TF-IDF vectorization.
"""

vocabulary = set()
train_df["text"].str.lower().str.split().apply(vocabulary.update)
vocabulary_size = len(vocabulary)
print(f"Vocabulary size = {vocabulary_size}")

text_vectorizer = layers.TextVectorization(
    max_tokens=vocabulary_size, ngrams=2, output_mode="tf_idf"
)

# `It is necessary to modify the " TextVectorization" layer in accordance with the terminology in our training set.
with tf.device("/CPU:0"):
    text_vectorizer.adapt(train_dataset.map(lambda text, label: text))

train_dataset = train_dataset.map(
    lambda text, label: (text_vectorizer(text), label), num_parallel_calls=auto
).prefetch(auto)

validation_dataset = validation_dataset.map(
    lambda text, label: (text_vectorizer(text), label), num_parallel_calls=auto
).prefetch(auto)

test_dataset = test_dataset.map(
    lambda text, label: (text_vectorizer(text), label), num_parallel_calls=auto
).prefetch(auto)

输出

辅助方法

辅助方法是协助程序中执行特定任务的函数或过程。这些方法旨在处理重复或常见操作，使代码更具模块化、可读性和可维护性。

def plot_result(history, item):
    plt.plot(history.history[item], label=item)
    plt.plot(history.history["val_" + item], label="val_" + item)
    plt.xlabel("Epochs")
    plt.ylabel(item)
    plt.title("Train and Validation {} Over Epochs".format(item), fontsize=14)
    plt.legend()
    plt.grid()
    plt.show()

建模

在这里，我们将训练模型并查看其准确性。

1. 简单前馈网络

def make_model():
    model = keras.Sequential(
        [
            layers.Dense(512, activation="relu"),
            layers.Dense(256, activation="relu"),
            layers.Dense(label_lookup.vocabulary_size(), activation="sigmoid")
        ]
    )
    return model

model1 = make_model()
model1.compile(loss="binary_crossentropy", optimizer="adam", metrics=["binary_accuracy"])

epochs = 10
with tf.device('/device:GPU:0'):
    history = model1.fit(
        train_dataset, validation_data=validation_dataset, epochs=epochs
    )

    plot_result(history, "loss")
    plot_result(history, "binary_accuracy")

输出

# Evaluate
with tf.device('/device:GPU:0'):
    _, binary_acc = model1.evaluate(test_dataset)
    print(f"Categorical accuracy on the test set: {round(binary_acc * 100, 2)}%.")

输出

观察到相当高的训练和测试准确率，分别为 99.95% 和 99.09%。
其中一个 epoch 的验证准确率更高，检查点机制将有所帮助。
鉴于类别不平衡，检查每个标签的精度很重要

## Computing metrics per class.

def per_class_accuracy(model, ):
    model_for_inference = keras.Sequential([text_vectorizer, model])

    per_label_results = {}
    for label in label_lookup.get_vocabulary():
        if label == -1:
            continue
    #     per_label_results[label] = {'tp': 0, 'tn': 0, 'fp': 0, 'fn': 0}
        per_label_results[label] = {'correct': 0, 'incorrect': 0}

    num_labels = len(per_label_results.items()) + 1
    accuracy_map = np.zeros((num_labels, num_labels), dtype=np.int64)

    inference_dataset = make_dataset(test_df, is_train=False)
    iterator = iter(inference_dataset)

    i = 0
    while True:
        i = i + 1
        text_batch, label_batch = None, None
        try:
            text_batch, label_batch = next(iterator)
#             print(f"Loaded {text_batch.shape[0]} items in batch#{i}.")
        except:
            break

        # Make predictions for the whole batch.
        with tf.device('/device:GPU:0'):
            predicted_probabilities = model_for_inference.predict(text_batch)

        for j, text in enumerate(text_batch):
            label_gt_one_hot = label_batch[j].numpy()[None, ...]
            label_gt = invert_multi_hot(label_gt_one_hot[0])[0]

            predicted_proba = [proba for proba in predicted_probabilities[j]]
            top_label = [
                x
                for _, x in sorted(
                    zip(predicted_probabilities[j], label_lookup.get_vocabulary()),
                    key=lambda pair: pair[0],
                    reverse=True,
                )
            ][:1]
            label_predicted = top_label[0]

            accuracy_map[label_gt][label_predicted] = accuracy_map[label_gt][label_predicted] + 1

            if label_predicted == label_gt:
                # True positive
                per_label_results[label_gt]['correct'] = per_label_results[label_gt]['correct'] + 1
            else:
                per_label_results[label_gt]['incorrect'] = per_label_results[label_gt]['incorrect'] + 1

    return per_label_results, accuracy_map

def print_per_class_accuracy(per_label_results):
    for k, v in per_label_results.items():
        correct = v['correct']
        incorrect = v['incorrect']
        accuracy = correct / (correct + incorrect) * 100
        print(f"Label = {k}, Test Accuracy = {accuracy:.2f}%")
    
def plot_accuracy_map(accuracy_map):
    sns.set(font_scale=1.2)
    fig, ax = plt.subplots(1, 1, figsize=(10, 7))
    sns.heatmap(accuracy_map)
    plt.title('Cross-validation Accuracy')


per_label_results, accuracy_map = per_class_accuracy(model1)
print_per_class_accuracy(per_label_results)
plot_accuracy_map(accuracy_map)

输出

模型评估结果显示，大多数标签的测试准确率相对较高。标签 6 的准确率较低，为 88.89%，表明模型可能难以正确分类属于此标签的实例。标签 8 的准确率也较低，为 77.78%。标签 4 和 5 的准确率分别为 85.71% 和 80.00%，表明在准确预测这些标签的实例方面仍有改进空间。

2. 更深层的前馈网络

"""
1. Deeper network
2. Add regularization with dropout
"""
dropout_rate = 0.5

model2 = keras.Sequential(
    [
        layers.Dense(512, activation="relu"),
        layers.Dropout(rate=dropout_rate),
        layers.Dense(256, activation="relu"),
        layers.Dropout(rate=dropout_rate),
        layers.Dense(128, activation="relu"),
        layers.Dense(label_lookup.vocabulary_size(), activation="sigmoid")
    ]
)

model2.compile(loss="binary_crossentropy", optimizer="adam", metrics=["binary_accuracy"])

epochs = 10
with tf.device('/device:GPU:0'):
    history = model2.fit(
        train_dataset, validation_data=validation_dataset, epochs=epochs
    )

    plot_result(history, "loss")
    plot_result(history, "binary_accuracy")

输出

# Evaluate
with tf.device('/device:GPU:0'):
    _, binary_acc = model2.evaluate(test_dataset)
    print(f"Categorical accuracy on the test set: {round(binary_acc * 100, 2)}%.")
    
per_label_results, accuracy_map = per_class_accuracy(model2)
print_per_class_accuracy(per_label_results)
plot_accuracy_map(accuracy_map)

输出

模型评估结果表明，在测试集上具有令人印象深刻的性能，二分类准确率高达 99.05%。

该模型展现出强大的预测能力，大多数标签都达到了高准确率。然而，对于准确率较低的标签 8 和标签 5，可能仍有改进空间。对模型进行进一步分析和完善可能会提高其在所有标签上的性能。

结论

利用机器学习进行文档分类提供了一种开创性的解决方案，可以高效地组织和检索信息。这种变革性方法使分类过程自动化，使组织能够简化运营、增强决策并揭示其文档存储库中隐藏的价值。随着技术进步不断发展和挑战得到系统解决，未来在开发日益复杂和准确的文档分类系统方面具有巨大的潜力。

下一主题使用机器学习进行手写字符识别

机器学习中的文档分类

导入库

加载数据集

注意：数据集中类别不平衡程度适中。分层划分。

GPU API

准备数据

数据预处理

辅助方法

结论

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

机器学习

监督式学习

分类

杂项

相关教程

面试题

机器学习中的文档分类

导入库

加载数据集

注意：数据集中类别不平衡程度适中。分层划分。

GPU API

准备数据

数据预处理

辅助方法

结论

相关帖子

EigenFaces

ML 中的数据匿名化

focl 算法在机器学习中的应用

机器学习的数据结构

AutoML

神经网络中的学习率 (eta)

拉格朗日乘数法

联合概率分布

半监督学习

用于机器学习分类的共形预测

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器