如何在机器学习中对相关性进行排名？

2025年6月18日 | 阅读 10 分钟

相关性分析是衡量关系性质和强度的直接定量指标。例如，秩相关性关系到使用变量的某个单调函数可以多好地描述这种关系。秩相关性分析在处理有序数据或任何不符合皮尔逊系数所需的正态性假设的数据集时非常有用。

秩相关性

秩相关性衡量两个变量的排序之间的关联程度。计算秩相关的最常用的两种方法是 Spearman 秩相关系数和 Kendall 的 tau 系数。Spearman 秩相关性描述了两个变量之间的关系可以通过一个单调函数在多大程度上被描述，无论这种关系是否是线性的。它是使用变量的秩值来计算的，以产生秩差，最终得出相关系数。另一方面，Kendall 的 tau 系数基于对所有观测对之间的一致性和不一致性的调查，定义了任何两个变量之间关联的强度或程度。两个系数都在 -1 到 1 的范围内。换句话说，完全负相关的值为 -1，完全正相关的值为 1，完全不相关的值为 0。

秩相关性的用途

秩相关性在各种场景中都有优势，特别是在处理非参数数据或传统相关性度量的假设不满足时。例如，当数据不正态分布或包含异常值时，秩相关性可以提供更可靠的关系指标。此外，对于有序数据，其中值的顺序可能比实际值更重要，秩相关性是完美的。这种灵活性使研究人员和从业者能够分析更广泛的数据集中的关系。

机器学习中有各种相关性方法，这里列出一些：

01. 皮尔逊相关性

皮尔逊相关系数用于衡量两个连续变量之间关系的强度和方向。它不仅衡量变量的相关性有多强，还衡量它们是否以相同的方向（正向或负向）移动。

有两种类型的相关性：

正相关： 一种变量增加也带来另一个变量增加的关系，反之亦然。它仅仅意味着在两个变量中，相对于它们的关系趋势，存在直线运动。
负相关： 意思是当一个变量上升时，另一个变量下降，它们朝着相反的方向移动。

我们将通过相关性矩阵的辅助来选择我们的特征。

如果 2 个或更多独立特征高度相关，则它们可以被视为重复特征并可以被删除。当独立变量之间存在强相关性时，改变一个变量会改变另一个变量，导致模型输出的显著波动。给定数据或模型的小变化，模型结果将不稳定并且变化很大。无论值是正的还是负的，我们都必须考虑这两种潜在的后果。

现在我们将以乳腺癌数据为例来说明皮尔逊相关性。

 
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
%matplotlib inline 
import warnings
warnings.filterwarnings("ignore")


from sklearn.model_selection import train_test_split
def split(df,label):
    tr_X, te_X, tr_Y, te_Y = train_test_split(df, label, test_size=0.25, random_state=42)
    return tr_X, te_X, tr_Y, te_Y


def correlation(dataset, cor):
    df = dataset.copy()
    col_corr = set()  # For storing unique value
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if abs(corr_matrix.iloc[i, j]) > cor: # absolute values to handle positive and negative correlations
                colname = corr_matrix.columns[i]  
                col_corr.add(colname)
    df.drop(col_corr,axis = 1,inplace = True)
    return df

from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import metrics
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
import xgboost as xgb

classifiers = ['LinearSVM', 'RadialSVM', 
               'Logistic',  'RandomForest', 
               'AdaBoost',  'DecisionTree', 
               'KNeighbors','GradientBoosting']

models = [svm.SVC(kernel='linear'),
          svm.SVC(kernel='rbf'),
          LogisticRegression(max_iter = 1000),
          RandomForestClassifier(n_estimators=200, random_state=0),
          AdaBoostClassifier(random_state = 0),
          DecisionTreeClassifier(random_state=0),
          KNeighborsClassifier(),
          GradientBoostingClassifier(random_state=0)]


def score_acc(df,label):
    Score = pd.DataFrame({"Classifier":classifiers})
    j = 0
    acc = []
    train_X,test_X,train_Y,test_Y = split(df,label)
    for i in models:
        model = i
        model.fit(train_X,train_Y)
        predictions = model.predict(test_X)
        acc.append(accuracy_score(test_Y,predictions))
        j = j+1     
    Score["Accuracy"] = acc
    Score.sort_values(by="Accuracy", ascending=False,inplace = True)
    Score.reset_index(drop=True, inplace=True)
    return Score


def score_cor_acc(df,label,cor_list):
    Score = pd.DataFrame({"Classifier":classifiers})
    for k in range(len(cor_list)):
        df2 = correlation(df, cor_list[k])
        train_X,test_X,train_Y,test_Y = split(df2,label)
        j = 0
        acc = []
        for i in models:
            model = i
            model.fit(train_X,train_Y)
            predictions = model.predict(test_X)
            acc.append(accuracy_score(test_Y,predictions))
            j = j+1  
        feat = str(cor_list[k])
        Score[feat] = acc
    return Score

        
def plot2(df,l1,l2,p1,p2,c = "b"):
    feat = df.columns.tolist()
    feat = feat[1:]
    plt.figure(figsize = (16, 18))
    for j in range(0,df.shape[0]):
        value = []
        k = 0
        for i in range(1,len(df.columns.tolist())):
            value.append(df.iloc[j][i])
        plt.subplot(4, 4,j+1)
        ax = sns.pointplot(x=feat, y=value,color = c )
        plt.text(p1,p2,df.iloc[j][0])
        plt.xticks(rotation=90)
        ax.set(ylim=(l1,l2))
        k = k+1
        

def highlight_max(data, color='aquamarine'):
    attr = 'background-color: {}'.format(color)
    if data.ndim == 1:  
        is_max = data == data.max()
        return [attr if v else '' for v in is_max]
    else: 
        is_max = data == data.max().max()
        return pd.DataFrame(np.where(is_max, attr, ''),
                            index=data.index, columns=data.columns)

# Description of Function
#split(): This function splits the dataset into a training set and a test set, which is critical to model performance evaluation. In most cases, it makes use of a random sampling method so that both sets are appropriately representative; an 80/20 split ratio for training and testing is quite common.

# correlation (): It function correlates features in a given set of data; it drops those above certain threshold levels. Therefore the function helps in the multicollinearity reduction thus showing model stability and increasing interpretability. Upon its return, it presents an end-user with a DataFrame cleansed of uncorrelated features.

# score_acc(): It measures how well a group of classifiers does on the dataset by fitting many algorithms to the training set and measuring their fitness on the test set.

# score_cor_acc(): Exactly similar to score_acc(), This function calls corrl() as the first thing before any of the accuracy measures for the classifier can be performed. For purposes of illustrative explanations, it does it in an ad hoc fashion: instead of computing the classifier's accuracy under regular conditions of a non-collapsing complete model, here features known to be very correlated in pairs are "dropped" by corrl().

# plot2(): This function plots the results from the analyses, such as plotting accuracy scores for different classifiers or displaying the effects of feature correlation filtering. This is done using libraries like Matplotlib or Seaborn to create informative visual representations that help interpret and communicate findings.   

现在我们来看一下乳腺癌数据。

我们首先检查数据集。

 
bc_data = pd.read_csv("../input/breast-cancer-wisconsin-data/data.csv")
bc_label = bc_data["diagnosis"]
bc_label = np.where(bc_label == 'M',1,0)
bc_data.drop(["id","diagnosis","Unnamed: 32"],axis = 1,inplace = True)

print("Breast Cancer dataset:\n",bc_data.shape[0],"Records\n",bc_data.shape[1],"Features")   

输出

How to Rank Correlation in Machine Learning?

 
display(bc_data.head())
print("We can see that all the features in the data are continuous.")   

输出

我们可以看到数据中的所有特征都是连续的。

现在让我们检查它的热力图。

 
plt.figure(figsize=(18,18))
corone = bc_data.corr()
sns.heatmap(corone, annot=True, cmap="viridis",annot_kws={"size":8})
plt.show()   

输出

我们将检查准确度。

 
score_one = score_acc(bc_data,bc_label)
score_one   

输出

 
bc_corrate = [0.6,0.7,0.8,0.9,0.95,0.99]
classifiers = score_one["Classifier"].tolist()
bc_score = score_cor_acc(bc_data,bc_label,bc_corrate)
bc_score.style.apply(highlight_max, subset = bc_score.columns[1:], axis=None)   

输出

结果表明，使用所有特征时准确度最高的分类器是随机森林分类器，其准确度为 0.972。另一方面，通过 corrl() 函数过滤相关性显示，最高准确度的分类器是线性 SVM 和决策树分类器，其中线性 SVM 的相关性阈值在 (0.9, 0.99) 之间，决策树分类器的相关性阈值在 (0.9) 之间。这意味着特征已经被原始特征集很好地优化，在应用基于相关性的特征选择后没有大的改进，这表明了模型性能的卓越优化。

让我们进行可视化。

输出

02. Spearman 相关性

Spearman 相关系数是一种可以应用于任何数据集的通用度量。该度量是可测量的，因此它解释了相关性。这是更一般的相关系数类别的一部分，具有三个主要属性。当两个变量完全负相关时，系数变为 -1，表示完全负相关。因此，值为 1 表示两个变量之间完全正相关。换句话说，这意味着一个变量的增加伴随着另一个变量的增加。最后，与零值附近没有相关性的值表示这些值之间没有已知关系。因此，在这方面，Spearman 相关系数在关联数据方面是多功能且信息丰富的。

现在，我们将以蒙特利尔自行车为例来考察 Spearman 相关性。

 
from scipy.stats import rankdata
Import pandas as pd
import numpy as np


def pearson_relation(x, y):
    bar_x, bar_y = np.mean(x), np.mean(y)
    est_cov = np.sum((x - bar_x) * (y - bar_y))
    x_est_std = np.sqrt(np.sum((x - bar_x)**2))
    y_est_std = np.sqrt(np.sum((y - bar_y)**2))
    return est_cov / (x_est_std * y_est_std)

def spearmans_correlation_coefficient(X, Y):
    Xr, Yr = rankdata(X), rankdata(Y)
    return pearson_relation(Xr, Yr)   

Spearman 相关性特别适合我们在这个数据集中感兴趣的问题：即，花费的旅行时间增加与旅行次数之间的关系。理想情况下，当自行车站点均匀分布，使得通勤者平均旅行距离时，我们期望旅行次数与总旅行时间之间存在单调关系。这意味着随着时间的增加，旅行次数也应该以一致的方式增加，以反映数据中清晰可预测的模式。因此，在这种情况下使用 Spearman 相关性可以有效地评估这种潜在的单调关系。

 
import numpy as np
import pandas as pd

ridenn = pd.read_csv("../input/OD_2017.csv", index_col=0)
ridenn.head()

输出

 
to_df = (
    rides
        .loc[:, ['end_station_code', 'duration_sec']]
        .groupby('end_station_code')
        .sum()
        .assign(rides_n=rides['end_station_code'].value_counts())
)

df_from = (
    rides
        .loc[:, ['start_station_code', 'duration_sec']]
        .groupby('start_station_code')
        .sum()
        .assign(rides_n=rides['start_station_code'].value_counts())   

现在，让我们绘制这些值。

 
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

sns.jointplot(x='rides_n', y='duration_sec', data=to_df)   

输出

然而，由于数据集非常庞大，单独的散点图有点误导：只需查看两侧的直方图，大部分点都集中在下方几像素的前几把。如果我们采用秩并绘制它们，情况会变得更紧凑。

输出

它通过直接计算分配给每个数据点的秩的相关性来计算 Spearman 系数，从而对变量之间的单调关系产生理解。该系数更侧重于秩而不是实际原始值，实际上很好地解释了关系在增加或减少，甚至保持不变的单调趋势方面有多好。这里呈现的结果是秩之间相互关联的程度，从而表明由数据集表示的变量之间是否存在正相关。

输出

数据显示，自行车站点数量几乎是通勤者到达该地点骑行总时间的理想预测因子。这种强预测关系意味着骑行频率在解释通勤者行为方面将发挥非常重要的作用，并且表明随着骑行次数的增加，通勤者使用这些自行车的花费时间也会增加。

03. Kendall 的 Tau 相关性

Kendall 的 Tau 相关系数称为 Tau；它提供了两个变量之间在有序数据而非实际数据值上的测量关联性之间的强度和方向。Kendall 的 Tau 相对于 Pearson 相关性的优点在于，它不考虑建立连续分布变量之间的线性关系。而是关注观测对之间发生的一致性和不一致性。如果两个变量的秩以相同的方向排列，则这两个观测被称为一致。也就是说，如果一个变量上升，另一个变量也会上升。但是，如果一个变量的增加伴随着另一个变量的减少，那么这就称为不一致对。这可以通过以下公式计算：tau = (CP- DP) / [num(num-1)/2]，其中 CP 是一致对的数量，DP 是不一致对的数量，num 是观测总数。系数可以在负一和正一之间。如果结果为 1，则表示完全正相关，而系数 -1 表示完全负相关。如果为 0，则表示没有相关性。因此，Kendall 的 Tau 非常有用，特别是对于小型样本或具有许多相同秩的大型数据集。它是一些统计和数据分析应用的良好替代方案。

04. NDCG

NDCG 是一个用于评估信息检索系统或推荐算法性能的评分。它通过衡量系统的输出与用户偏好或地面真相的接近程度来评估排名列表。该度量尤其有价值，因为它捕获了相关项在排名列表中的位置。它理解了排名靠前的项目比排名靠后的项目对用户满意度更重要的直观概念。NDCG 有两个主要步骤：计算折扣累积增益 (DCG)，它将检索到的项目的相关性得分相加，并根据其排名位置进行对数折扣，以便排名靠后的项目对结果的贡献较小。第二步是将 DCG 标准化，方法是将其除以理想 DCG (IDCG)，IDCG 被定义为同一组项目的最大可能 DCG，从而得到一个介于 0 和 1 之间的值。值为 1 表示排名完美，而接近 0 的分数则表示排名很差。这在相关性项变化且排名顺序显著影响用户体验的情况下特别有用，例如搜索引擎、推荐系统和在线广告。

下一个主题机器学习中的归一化

如何在机器学习中对相关性进行排名？

秩相关性

秩相关性的用途

01. 皮尔逊相关性

02. Spearman 相关性

03. Kendall 的 Tau 相关性

04. NDCG

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

机器学习

监督式学习

分类

杂项

相关教程

面试题

如何在机器学习中对相关性进行排名？

秩相关性

秩相关性的用途

01. 皮尔逊相关性

02. Spearman 相关性

03. Kendall 的 Tau 相关性

04. NDCG

相关帖子

如何去除时间序列中的非平稳性

Extra Trees 分类器

基于人口统计学的推荐系统

机器学习中的 Leaky ReLU 激活函数是什么

BERT 语言模型

机器学习先决条件

NLP 中的连续词袋模型 (CBOW)

随机优化

注意力机制

机器学习中的多数投票算法

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器