机器学习中的信用卡欺诈检测

2025年3月17日 | 阅读 7 分钟

信用卡欺诈是指有人在不知情的情况下，使用他人的信用卡进行金融交易。信用卡是为了帮助消费者提高购买力而创建的；它们是与银行达成的协议，允许用户在偿还银行贷款时花费银行提供的资金，否则将产生利息费用。

随着电子商务的出现以及在新冠大流行期间 OTT 平台的蓬勃发展，信用卡以及其他支付方式的使用量急剧增加。由于自然界中的一切都是二元的，信用卡诈骗的数量也显著增加。这些盗窃行为每年给全球经济造成超过 240 亿美元的损失。因此，解决这个问题变得至关重要，并且在该价值 300 亿美元的市场中涌现了许多公司。因此，需要为如此不断增长的问题创建自动化模型，而机器学习是关键！

现在我们将尝试对信用卡交易是欺诈性还是真实性进行分类，并处理不平衡的数据集。

数据集属性

V1 - V28：PCA 转换产生的数值特征。
Time：自第一次交易以来的经过秒数。
Amount：交易金额。
Class：欺诈或非欺诈（1 或 0）

代码

导入库

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
pd.options.display.float_format = '{:.2f}'.format

读取数据集

data = pd.read_csv('../input/creditcardfraud/creditcard.csv')
data.head()

Credit Card Fraud Detection Using Machine Learning

输出

fraud = data[data['Class'] == 1].describe().T
nofraud = data[data['Class'] == 0].describe().T

colors = ['#FFD700','#3B3B3C']

fig,ax = plt.subplots(nrows = 2,ncols = 2,figsize = (5,15))
plt.subplot(2,2,1)
sns.heatmap(fraud[['mean']][:15],annot = True,cmap = colors,linewidths = 0.5,linecolor = 'black',cbar = False,fmt = '.2f')
plt.title('Fraud Samples : Part 1');

plt.subplot(2,2,2)
sns.heatmap(fraud[['mean']][15:30],annot = True,cmap = colors,linewidths = 0.5,linecolor = 'black',cbar = False,fmt = '.2f')
plt.title('Fraud Samples : Part 2');

plt.subplot(2,2,3)
sns.heatmap(nofraud[['mean']][:15],annot = True,cmap = colors,linewidths = 0.5,linecolor = 'black',cbar = False,fmt = '.2f')
plt.title('No Fraud Samples : Part 1');

plt.subplot(2,2,4)
sns.heatmap(nofraud[['mean']][15:30],annot = True,cmap = colors,linewidths = 0.5,linecolor = 'black',cbar = False,fmt = '.2f')
plt.title('No Fraud Samples : Part 2');

fig.tight_layout(w_pad = 2)

输出

欺诈和非欺诈情况下的特征平均值！
在非欺诈情况下，V1 - V28 的平均值接近于零。在欺诈情况下，平均金额 88.29 小于平均交易金额 122.21。
非欺诈交易的耗时比欺诈交易长。
这些可能是识别欺诈交易的一些线索。

数据可视化

我们现在将可视化我们的数据。

目标变量可视化（Class）

fraud = len(data[data['Class'] == 1]) / len(data) * 100
nofraud = len(data[data['Class'] == 0]) / len(data) * 100
fraud_percentage = [nofraud,fraud]

fig,ax = plt.subplots(nrows = 1,ncols = 2,figsize = (20,5))
plt.subplot(1,2,1)
plt.pie(fraud_percentage,labels = ['Fraud','No Fraud'],autopct='%1.1f%%',startangle = 90,colors = colors,
       wedgeprops = {'edgecolor' : 'black','linewidth': 1,'antialiased' : True})

plt.subplot(1,2,2)
ax = sns.countplot('Class',data = data,edgecolor = 'black',palette = colors)
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height() + 2, rect.get_height(), horizontalalignment='center', fontsize = 11)
ax.set_xticklabels(['No Fraud','Fraud'])
plt.title('Number of Fraud Cases');

输出

数据明显不平衡，大多数交易表示没有欺诈。
由于数据非常不均匀，分类模型将倾向于预测多数类“无欺诈”。
因此，数据平衡成为构建强大模型的重要一步。

特征选择

我们需要从数据集中选择某些特征。

ANOVA 检验

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

features = data.loc[:,:'Amount']
target = data.loc[:,'Class']

best_features = SelectKBest(score_func = f_classif,k = 'all')
fit = best_features.fit(features,target)

featureScores = pd.DataFrame(data = fit.scores_,index = list(features.columns),columns = ['ANOVA Score']) 
featureScores = featureScores.sort_values(ascending = False,by = 'ANOVA Score')

fig,ax = plt.subplots(nrows = 1,ncols = 2,figsize = (5,10))

plt.subplot(1,2,1)
sns.heatmap(featureScores.iloc[:15,:],annot = True,cmap = colors,linewidths = 0.4,linecolor = 'black',cbar = False, fmt = '.2f')
plt.title('ANOVA Score : Part 1')

plt.subplot(1,2,2)
sns.heatmap(featureScores.iloc[15:30],annot = True,cmap = colors,linewidths = 0.4,linecolor = 'black',cbar = False, fmt = '.2f')
plt.title('ANOVA Score : Part 2')

fig.tight_layout(w_pad = 2)

输出

ANOVA 分数越高，该特征对目标变量的重要性就越大。
从上图可以看出，我们将丢弃值小于 50 的特征。
在本例中，我们将使用相关性图和 ANOVA 分数的特征来构建两个模型。

#  Dataset for Model based on Correlation Plot
df1 = data[['V3','V4','V7','V10','V11','V12','V14','V16','V17','Class']].copy(deep = True)
df1.head()

输出

# Dataset for Model based on ANOVA Score
df2 = data.copy(deep = True)
df2.drop(columns = list(featureScores.index[20:]),inplace = True)
df2.head()

输出

数据平衡

处理不平衡数据有两种选择

欠采样：减少目标变量多数类的样本。
过采样：将目标变量的少数类样本转换为多数类样本。

为了获得最佳结果，我们将结合使用欠采样和过采样。

我们将首先对多数类样本进行欠采样，然后对少数类样本进行过采样。

为了数据平衡，我们将使用 imblearn。

import imblearn
from collections import Counter
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline

# Data Balancing for Model based on Correlation Plot 
over = SMOTE(sampling_strategy = 0.5)
under = RandomUnderSampler(sampling_strategy = 0.1)
f1 = df1.iloc[:,:9].values
t1 = df1.iloc[:,9].values

steps = [('under', under),('over', over)]
pipeline = Pipeline(steps=steps)
f1, t1 = pipeline.fit_resample(f1, t1)
Counter(t1)

输出

# Data Balancing for Model based on ANOVA Score
over = SMOTE(sampling_strategy = 0.5)
under = RandomUnderSampler(sampling_strategy = 0.1)
f2 = df2.iloc[:,:20].values
t2 = df2.iloc[:,20].values

steps = [('under', under),('over', over)]
pipeline = Pipeline(steps=steps)
f2, t2 = pipeline.fit_resample(f2, t2)
Counter(t2)

输出

数据平衡计算

采样策略：这是过采样和欠采样的通用参数比例。
采样策略： (少数类样本数) / (多数类样本数)

在这种情况下，

多数类：无欺诈情况：284315 个样本
少数类：欺诈情况：492 个样本

欠采样：减少多数类样本

采样策略 = 0.1
1 = (492) / 多数类样本数
欠采样后，
- 多数类：无欺诈情况：4920 个样本
- 少数类：欺诈情况：492 个样本

过采样：增加少数类样本。

采样策略 = 0.5
即 (少数类样本数) / 4920。

过采样后，

多数类：无欺诈情况：4920 个样本。
少数类：欺诈情况：2460 个样本。

最终类样本

多数类：无欺诈情况：4920 个样本。
少数类：欺诈情况：2460 个样本。

为了考虑预测中的潜在偏差，我们复制了不平衡数据集中的数据。由于此复制过程，我们使用合成数据进行建模，以确保预测不会偏向多数目标类值。

因此，仅根据准确率对模型进行评分将是具有欺骗性的。相反，我们将使用混淆矩阵、ROC-AUC 图和 ROC-AUC 分数来评估模型。

建模

现在，我们将研究各种机器学习模型。

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import RocCurveDisplay
from sklearn.metrics import plot_roc_curve
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.metrics import precision_recall_curve

x_train1, x_test1, y_train1, y_test1 = train_test_split(f1, t1, test_size = 0.20, random_state = 2)
x_train2, x_test2, y_train2, y_test2 = train_test_split(f2, t2, test_size = 0.20, random_state = 2)

def model(classifier,x_train,y_train,x_test,y_test):
    
    classifier.fit(x_train,y_train)
    prediction = classifier.predict(x_test)
    cv = RepeatedStratifiedKFold(n_splits = 10,n_repeats = 3,random_state = 1)
    print("Cross Validation Score : ",'{0:.2%}'.format(cross_val_score(classifier,x_train,y_train,cv = cv,scoring = 'roc_auc').mean()))
    print("ROC_AUC Score : ",'{0:.2%}'.format(roc_auc_score(y_test,prediction)))
    plot_roc_curve(classifier, x_test,y_test)
    plt.title('ROC_AUC_Plot')
    plt.show()
    
def model_evaluation(classifier,x_test,y_test):
    
    # Confusion Matrix
    cm = confusion_matrix(y_test,classifier.predict(x_test))
    names = ['True Neg','False Pos','False Neg','True Pos']
    counts = [value for value in cm.flatten()]
    percentages = ['{0:.2%}'.format(value) for value in cm.flatten()/np.sum(cm)]
    labels = [f'{v1}\n{v2}\n{v3}' for v1, v2, v3 in zip(names,counts,percentages)]
    labels = np.asarray(labels).reshape(2,2)
    sns.heatmap(cm,annot = labels,cmap = 'Blues',fmt ='')
    
    # Classification Report
    print(classification_report(y_test,classifier.predict(x_test)))

1. 逻辑回归

from sklearn.linear_model import LogisticRegression
classifier_lr = LogisticRegression(random_state = 0,C=10,penalty= 'l2') 

# Model based on Correlation Plot 
model(classifier_lr,x_train1,y_train1,x_test1,y_test1)
model_evaluation(classifier_lr,x_test1,y_test1)

输出

# Model based on ANOVA Score
model(classifier_lr,x_train2,y_train2,x_test2,y_test2)
model_evaluation(classifier_lr,x_test2,y_test2)

输出

2. SVM

from sklearn.svm import SVC
classifier_svc = SVC(kernel = 'linear',C = 0.1)

# Model based on Correlation Plot
model(classifier_svc,x_train1,y_train1,x_test1,y_test1)
model_evaluation(classifier_svc,x_test1,y_test1)

输出

# Model based on ANOVA Score
model(classifier_svc,x_train2,y_train2,x_test2,y_test2)
model_evaluation(classifier_svc,x_test2,y_test2)

输出

3. DTC

from sklearn.tree import DecisionTreeClassifier
classifier_dt = DecisionTreeClassifier(random_state = 1000,max_depth = 4,min_samples_leaf = 1)

# Model based on Correlation Plot
model(classifier_dt,x_train1,y_train1,x_test1,y_test1)
model_evaluation(classifier_dt,x_test1,y_test1)

输出

# Model based on ANOVA Score
model(classifier_dt,x_train2,y_train2,x_test2,y_test2)
model_evaluation(classifier_dt,x_test2,y_test2)

输出

4. RFC

from sklearn.ensemble import RandomForestClassifier
classifier_rf = RandomForestClassifier(max_depth = 4,random_state = 0)

# Model based on Correlation Plot
model(classifier_rf,x_train1,y_train1,x_test1,y_test1)
model_evaluation(classifier_rf,x_test1,y_test1)

输出

# Model based on ANOVA Score
model(classifier_rf,x_train2,y_train2,x_test2,y_test2)
model_evaluation(classifier_rf,x_test2,y_test2)

输出

5. KNN

from sklearn.neighbors import KNeighborsClassifier
classifier_knn = KNeighborsClassifier(leaf_size = 1, n_neighbors = 3,p = 1)

# Model based on Correlation Plot
model(classifier_knn,x_train1,y_train1,x_test1,y_test1)
model_evaluation(classifier_knn,x_test1,y_test1)

输出

# Model based on ANOVA Score
model(classifier_knn,x_train2,y_train2,x_test2,y_test2)
model_evaluation(classifier_knn,x_test2,y_test2)

输出

结果表格

基于相关性图的模型

序号。	机器学习算法	交叉验证分数	ROC AUC 分数	F1 分数（欺诈）
1	逻辑回归	98.01%	92.35%	91%
2	支持向量分类器	97.94%	92.10%	91%
3	决策树分类器	96.67%	91.36%	90%
4	随机森林分类器	97.84%	91.71%	91%
5	K-近邻	99.34%	97.63%	97%

基于 ANOVA 分数模型

序号。	机器学习算法	交叉验证分数	ROC AUC 分数	F1 分数（欺诈）
1	逻辑回归	98.45%	94.69%	94%
2	支持向量分类器	98.32%	94.40%	94%
3	决策树分类器	97.13%	93.69%	93%
4	随机森林分类器	98.20%	94.06%	94%
5	K-近邻	99.54%	98.47%	97%

特征是隐藏的，并且由于对问题的领域知识无法支持特征选择。统计检验在选择建模特征方面至关重要。

由于数据已通过 SMOTE 分析进行平衡，因此在这些合成数据上训练的模型无法通过准确率进行测试。因此，我们使用交叉验证分数和 ROC-AUC 分数来评估我们的模型。

下一主题KL 散度

← 上一个下一个 →

机器学习中的信用卡欺诈检测

数据集属性

导入库

读取数据集

数据可视化

目标变量可视化（Class）

特征选择

相关矩阵

ANOVA 检验

数据平衡

建模

1. 逻辑回归

2. SVM

3. DTC

4. RFC

5. KNN

结果表格

基于相关性图的模型

基于 ANOVA 分数模型

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

机器学习

监督式学习

分类

杂项

相关教程

面试题

机器学习中的信用卡欺诈检测

数据集属性

导入库

读取数据集

数据可视化

目标变量可视化（Class）

特征选择

相关矩阵

ANOVA 检验

数据平衡

建模

1. 逻辑回归

2. SVM

3. DTC

4. RFC

5. KNN

结果表格

基于相关性图的模型

基于 ANOVA 分数模型

相关帖子

机器学习中的剪枝

CycleGAN

机器学习在媒体领域的应用

机器学习中的过采样与欠采样

机器学习还是软件开发：哪个更好

处理大型数据集的 Pandas 替代方案

图像分割的平均交并比 (mIoU)

印度机器学习专家薪资

GIS 的组成部分

K-Means 聚类算法

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器