机器学习中的向量空间模型

2025年3月2日 | 9 分钟阅读

向量空间模型 (Vector Space Model, VSM) 是机器学习、信息检索和自然语言处理中的一个基本概念。使用 VSM，可以将文本、图像甚至结构化数据等对象表示为高维空间中的向量。这样，VSM 就可以以一种易于算法处理的方式来比较、操作和分析复杂的实体。它能够将定性数据（词语或文档）转换为定量向量，这使其成为搜索引擎、推荐系统、文本挖掘等领域的基石。

简单来说，向量空间模型就是将文档或数据点表示为在某个 n 维空间中的向量。每个维度对应一个独特的特征，例如文档中的一个词或术语。向量在某个维度上的位置代表了该特征的相对重要性或权重。例如，如果一个文档包含“machine”一词 10 次，那么该文档向量在“machine”维度上的权重就会相对较高。

这种对数据的几何视图可以轻松地进行对象之间的相似度测量。在这种情况下，文档作为点的空间允许将两个向量之间的角度或距离作为它们相似度的度量。最常用的度量是余弦相似度，它计算两个向量之间夹角的余弦值，但欧氏距离和其他距离度量也适用。

向量空间模型的应用

以下是向量空间模型的一些最重要的应用

文档检索：可能是 VSM 最重要的应用是文档检索系统，例如搜索引擎。在此模型中，文档由向量表示，查询也位于同一空间内。根据查询向量与文档向量之间的相似度度量（余弦相似度或其他距离度量）的比较来检索文档，并使用排序度量来评估相关性。
文本分类：在文本分类中，VSM 是一种将文档表示为特征向量的技术，其中每个维度代表一个术语或 n-gram（可以定义为一系列词语）。然后将这些特征向量输入机器学习算法，例如支持向量机、逻辑回归和神经网络，以将文本分类为垃圾邮件或非垃圾邮件，或区分正面/负面情绪。
聚类：VSM 在无监督学习的聚类中起着非常重要的作用。在向量空间中，相似性会导致文档的 agrupamiento。换句话说，K-means 或层次聚类可以将向量空间中几何上更接近的文档进行分组，这意味着内容相似的文档会被归为一类。

向量空间模型的局限性

向量空间模型在许多应用中已被证明是有效的，但也并非没有局限性

高维度：在基于文本的应用中，向量空间通常是高维的，因为每个唯一的词语或术语都对应一个维度。这通常意味着解决问题存在计算上的挑战。尤其是在处理大型语料库时，必须使用诸如降低向量空间维度（如 PCA 和 LSA）或选择术语的方法（如停用词移除）来减少或消除这种情况。
语义关系：VSM 本质上无法捕捉术语之间的语义关系。例如，“car”和“automobile”这两个词在一个特定的文本中可能被用作同义词，但在传统的 VSM 中，它们会被视为完全不同的维度。为了克服这个困难，通常会与 VSM 一起使用更高级的模型，如 Word2Vec 或 GloVe，它们可以表示术语之间的语义相似性。

代码

我们将尝试实现向量空间模型，并利用它来根据首都找到国家。

现在，我们只是加载数据集并导入一些必需的库。

import pickle
import matplotlib.pyplot as plt
import w3_unittest
import numpy as np
import pandas as pd


from utils import get_vectors
data = pd.read_csv('./data/capitals.txt', delimiter=' ')
data.columns = ['city_one', 'country_one', 'city_two', 'country_two']

# print the first five elements in the DataFrame
data.head(5)

输出

import nltk
from gensim.models import KeyedVectors


embeddings = KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary = True)
f = open('capitals.txt', 'r').read()
words_set = set(nltk.word_tokenize(f))
selected_words = words = ['king', 'queen', 'oil', 'gas', 'happy', 'sad', 'city', 'town', 'village', 'country', 'continent', 'petroleum', 'joyful']
for w in selected_words:
    words_set.add(w)

def word_embedding_get(embeddings):

    embedding_words = {}
    for word in embeddings.vocab:
        if word in words_set:
            embedding_words[word] = embeddings[word]
    return embedding_words


# Testing your function
embedding_words = word_embedding_get(embeddings)
print(len(embedding_words))
pickle.dump( embedding_words, open( "embedding_words_subset.p", "wb" ) )

embedding_words = pickle.load(open("./data/embedding_words_subset.p", "rb"))
len(embedding_words)  # there should be 243 words that will be used in this assignment

print("dimension: {}".format(embedding_words['Spain'].shape[0]))

输出

现在我们需要一个函数，当两个词语作为向量时，能够告诉我们余弦距离。

def similarity_cosine(A, B):
    '''
    Input:
        A: Word vector in array format
        B: Word vector in array format
    Output:
        cos:  A numeric value indicating the similarity between A and B in accordance with cosine similarity.


    '''

    ### START CODE HERE ###
    dot = np.dot(A,B)
    normA = np.linalg.norm(A)
    normB = np.linalg.norm(B) 
    cos = dot / (normA*normB)

    ### END CODE HERE ###
    return cos

# feel free to try different words
king = embedding_words['king']
queen = embedding_words['queen']

similarity_cosine(king, queen)

输出

我们现在将实现一个函数，该函数使用欧氏距离计算两个向量之间的相似度。

def get_euclidean_dis(A, B):
    """
    Input:
        A: Word vector in array format
        B: Word vector in array format

    Output:
        D: A numeric value indicating the distance between A and B in accordance with Euclidean distance.
    """

    ### START CODE HERE ###

    # euclidean distance    
    d = np.linalg.norm(A-B)

    ### END CODE HERE ###

    return d

# Test the function
get_euclidean_dis(king, queen)

输出

我们将使用上述函数来计算向量之间的相似度，然后将其应用于查找国家首都。我们还将定义一个函数，该函数接受三个词语和词嵌入字典作为参数。

def country_get(city_one, country_one, city_two, embeddings, similarity_cosine=similarity_cosine):
    """
    Input:
        city_one: A text string representing the capital city of country_one.
        country_one: A text string representing the country associated with capital1.
        city_two: A text string representing the capital city of country_two.
        embeddings: A dictionary where each key is a word, and its corresponding value is the word's embedding.
    Output:
        Country: A tuple containing the most probable country and its corresponding similarity score.
    """


    # store the city_one, country 1, and city 2 in a set called group
    group = set((city_one,country_one,city_two))

    # get embeddings of city 1
    city_one_emb = embeddings[city_one]

    # get embedding of country 1
    country_one_emb = embeddings[country_one]

    # get embedding of city 2
    city_two_emb = embeddings[city_two]

    # Obtain the embedding for country 2, which is derived from the embeddings of country 1, city 1, and city 2.
    # Remember: King - Man + Woman = Queen
    vec = country_one_emb - city_one_emb + city_two_emb

    #Set the initial similarity to -1, which will be updated to values closer to +1.
    similarity = -1

    # Set the country to an empty string.
    country = ''

    # Loop through each word in the embedding dictionary.
    for word in embeddings.keys():

        # First, ensure that the word is not already part of the 'group.'
        if word not in group:

            # Extract the word embedding
            word_emb = embeddings[word]

            # Calculate the similarity between the vector representation of country 2 and the associated word in the embeddings dictionary according to cosine.

            cur_similarity = similarity_cosine(vec,word_emb)

            # If the cosine similarity is greater than the previously highest similarity...
            if cur_similarity > similarity:

                # Update the similarity to reflect the improved value.
                similarity = cur_similarity
                
                # Save the country as a tuple that includes the word and its similarity.
                country = (word,similarity)

    ### END CODE HERE ###

    return country

#When testing your function, consider enhancing its robustness by returning the five most similar words.
country_get('Athens', 'Greece', 'Cairo', embedding_words)

输出

我们需要实现一个程序，该程序可以计算给定数据集的准确性。我们需要遍历每一行，获取相应的词语，并将它们输入到上面的 country_get 函数中。

def accu_get(embedding_words, data, country_get=country_get):
    '''
    Input:
        embedding_words: a dictionary in which each key represents a word and each value corresponds to its embedding.
        data: We will use the dataframe from the pandas library that will help us to pair up the countries along with their cities accordingly.

    '''

    # Initializing the number of correct from zero
    correct_num = 0

    # Now we need to iterate through rows in data.
    for j, ro in data.iterrows():

        # get city_one
        city_one = ro['city_one']

        # get country_one
        country_one = ro['country_one']

        # get city_two
        city_two = ro['city_two']

        # get country_two
        country_two = ro['country_two']

        # utilize `country_get` to determine the predicted `country_two`.
        predicted_country_two, _ = country_get(city_one,country_one,city_two,embeddings=embedding_words)

        # if the predicted `country_two` matches the actual `country_two`...
        if predicted_country_two == country_two:
            # increment the number of correct by 1
            correct_num += 1

    #  We need to get the total number of rows in the dataset.
    m = len(data)

    # Now we will get the accuracy with the procedure of splitting the total number right predictions made by m.
    accur = correct_num/m


    return accur

accu = accu_get(embedding_words, data)
print(f"Accuracy is {accu:.2f}")

输出

绘制向量

我们应用主成分分析 (PCA) 将词向量的维度从 300 降低到 2，以便绘制词语在其嵌入中的关系。

def compute_pca(X, components_n=2):
    """
    Input:
        X: of size (m,n) where each row represents a word vector. 
        components_n: The number of components you wish to retain.
    Output:
        reducedX: Data converted into two dimensions/columns, along with the reconstructed original data.
    pass in: data as 2D NumPy array
    """

  
    # We need to get deviations, so we will subtract the mean from every data score.
    X_demeaned = X - np.mean(X,axis=0)

    # Compute the covariance matrix.
    matrix_cov = np.cov(X_demeaned,rowvar=False)

    # Determine the eigenvectors and eigenvalues of the covariance matrix.
    vals_eigen, vecs_eigen = np.linalg.eigh(matrix_cov)

    # Sort the eigenvalues in ascending order and obtain the corresponding indices from the sort.
    sorted_idx = np.argsort(vals_eigen)
    
    #Reverse the order to arrange the values from highest to lowest.
    sorted_decreasing_idx = sorted_idx[::-1]

    # sort the eigenvalues by sorted_decreasing_idx
    vals_eigen_sorted = vals_eigen[sorted_decreasing_idx]

    # Sort the eigenvalues using the indices from `sorted_decreasing_idx`.
    vecs_eigen_sorted = vecs_eigen[:,sorted_decreasing_idx]

    #Choose the first ( n ) eigenvectors, where ( n)  represents the desired dimension of the rescaled data array or `components_n`.
    vecs_eigen_subset = vecs_eigen_sorted[:,0:components_n]

 
    # We will now transform our data with the help of vectors_eigen. We will first transpose the eigenvectors then we will multiply it with the data. We will transpose the eigenvectors when the data is mean centered. We also need to transpose the resulting product.

    reducedX = np.dot(vecs_eigen_subset.transpose(),X_demeaned.transpose()).transpose()



    return reducedX

# Testing the function
np.random.seed(1)
X = np.random.rand(3, 10)
reducedX = compute_pca(X, components_n=2)
print("Your original matrix was " + str(X.shape) + " and it became:")
print(reducedX)

输出

现在您将使用我们的 pca 函数来绘制我们为您选择的一些词语。

稍后，您会注意到大多数相似的词语或听起来相似的词语会聚集在一起。我们可以说它们彼此非常接近。事实上，在某些情况下，听起来相反的词语也表现出这种趋势或行为。它们通常在句子中的位置相同，并且是相同的词性，因此，在学习词向量时，您会得到相似的权重。我们将在下周讨论如何学习它们，但现在，请享受使用它们。

value_words= ['mountain', 'ocean', 'forest', 'sunlight', 'computer', 'laptop', 'python', 'cloud', 'desert', 'river', 'island']

#  Here if we provide a list or collection of words along with their embedding, the underlying function will give a matrix with embeddings in it accordingly.

X = get_vectors(embedding_words, value_words)

print("Your original matrix was " + str(X.shape) + " and it became:")

输出

# We will plotting
outcomes_here = compute_pca(X, 2)
plt.scatter(outcomes_here[:, 0], outcomes_here[:, 1])
for j, word in enumerate(value_words):
    plt.annotate(word, xy=(outcomes_here[j, 0] - 0.05, outcomes_here[j, 1] + 0.1))

plt.show()

输出

“gas”、“oil”和“petroleum”这几个词似乎是相关的，因为它们的向量彼此非常接近。同样，“sad”、“joyful”和“happy”都是表达情感的形容词，彼此距离也很近。

结论

向量空间模型向量是复杂数据在高维空间中表示的基础。由于 VSM，实际信息文本可以表示为高维空间中的向量，因此它已成为机器学习、信息检索以及自然语言处理等领域的重要工具。VSM 在文档检索、文本分类和聚类等领域具有非凡的应用性。它还应用于其他多样化的任务。

尽管 VSM 非常有效，但仍然存在一些局限性，包括由高维度引起的问题以及无法绘制术语之间的语义关系。为了解决这些问题，通常会与 VSM 一起使用更高级的技术，如降维和语义向量表示，即 Word2Vec 和 GloVe。

下一个主题机器学习最佳笔记本电脑

机器学习中的向量空间模型

向量空间模型的应用

向量空间模型的局限性

绘制向量

结论

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

机器学习

监督式学习

分类

杂项

相关教程

面试题

机器学习中的向量空间模型

向量空间模型的应用

向量空间模型的局限性

绘制向量

结论

相关帖子

揭秘机器学习

分布数据分析

泰勒级数

自由度公式

机器学习中的竞技跑者伤病预测

机器学习中的元学习

成功机器学习项目的指南

ML 驱动的系统有何独特之处？

机器学习中的数字识别

什么是雅可比矩阵？

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器