机器学习中的肝脏疾病预测

2025年3月17日 | 阅读11分钟

Liver Disease Prediction Using Machine Learning

肝脏疾病是一个重大的全球健康问题，影响着全球数百万人的健康。早期、准确地检测肝脏疾病对于有效治疗和预防进一步的并发症至关重要。近年来，机器学习已成为医疗保健领域的一个强大工具，能够开发出有助于诊断和预测包括肝脏疾病在内的各种疾病的预测模型。

机器学习在肝脏疾病预测中的应用

机器学习算法在肝脏疾病预测领域有着广泛的应用。通过分析患者数据和病历，机器学习模型可以识别与肝脏疾病相关的模式和风险因素。一些主要应用包括：

机器学习模型可以在症状出现之前就检测到肝脏疾病的早期迹象。这使得医疗保健提供者能够及早干预，并可能阻止疾病的进展。
机器学习算法可以根据患者患肝脏疾病的风险水平进行分类。这使得可以制定个性化的治疗方案并更好地分配医疗资源。
通过持续分析患者数据，机器学习模型可以监测疾病进展，并向医务人员提供实时更新。
机器学习可以预测患者对不同治疗方案的反应，从而优化治疗策略并改善患者的治疗效果。

使用机器学习进行肝脏疾病预测的好处

将机器学习整合到肝脏疾病预测中带来了许多好处：

提高准确性：机器学习模型可以处理海量数据并识别复杂模式，与传统方法相比，可以提高预测的准确性。
早期检测：机器学习算法可以在肝脏疾病的早期阶段进行检测，从而可以及时进行医疗干预，并可能预防严重并发症。
个性化医疗：通过分析个体患者数据，机器学习可以制定针对每位患者独特需求的个性化治疗方案。
改善患者预后：准确的预测和早期检测有助于改善患者的预后和生活质量。
成本效益：机器学习可以通过识别高风险患者并减少不必要的检查和住院来优化医疗资源利用。

使用机器学习进行肝脏疾病预测的挑战

尽管有许多优点，但在将机器学习应用于肝脏疾病预测方面仍存在一些挑战：

获取高质量、多样化的医疗数据对于训练稳健的机器学习模型至关重要。然而，获得肝脏疾病预测的标记数据集可能很困难。
肝脏疾病数据集通常不平衡，与阴性病例相比，阳性病例数量很少。不平衡数据集可能导致模型产生偏差，影响其预测性能。
一些机器学习模型，如深度学习算法，由于其复杂的性质，通常被认为是“黑箱”。医务人员可能难以解释这些模型的预测。
处理敏感的医疗数据会引发伦理和隐私问题。保护患者数据并确保其可用于研究是微妙的平衡。

在这里，我们将尝试在代码中实现它。

数据摘要

由于过度饮酒、吸入有害气体以及摄入受污染的食物、腌菜和药物，肝脏疾病患者的数量一直在不断增加。该数据集用于评估预测算法，以减轻医生的负担。

内容

该数据集包含来自印度安得拉邦东北部收集的 416 名肝病患者和 167 名非肝病患者的记录。“Dataset”列是用于将组分为肝病患者（有肝病）或非肝病患者（无肝病）的类标签。该数据集包含 441 名男性患者和 142 名女性患者的记录。

任何年龄超过 89 岁的患者都被列为年龄“90”岁。

列

患者年龄
患者性别
总胆红素
直接胆红素
碱性磷酸酶
丙氨酸氨基转移酶
天冬氨酸氨基转移酶
总蛋白
白蛋白
白蛋白和球蛋白比例
Dataset：用于将数据分为两组（肝病患者或无肝病患者）的字段

导入库

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

data=pd.read_csv("../input/indian_liver_patient.csv")
data.head()

输出

EDA

输出

名为“Albumin_and_Globulin_Ratio”的特征不完整，因为它缺少 583 个值。因此，我们需要在数据预处理阶段解决这个问题。现在，我们打算通过创建直方图来评估数据的平衡性。

# checking the stats
# given on the website 416 liver disease patients and 167 non-liver disease patients
# need to remap the classes liver disease:=1 and no liver disease:=0 (normal convention to be followed)
count_classes = pd.value_counts(data['Dataset'], sort = True).sort_index()
count_classes.plot(kind = 'bar')
plt.title("Liver disease classes histogram")
plt.xlabel("Dataset")
plt.ylabel("Frequency")

输出

为了简化类标签，我们需要重新分配它们。对于没有肝脏疾病的患者，我们将分配标签 0，对于患有肝脏疾病的患者，我们将分配标签 1。

输出

此时，我将用零替换缺失值。

data_features=data.drop(['Dataset'],axis=1)
data_num_features=data.drop(['Gender','Dataset'],axis=1)
data_num_features.head()

输出

data_num_features.describe() # check whether feature scaling has to be performed or not 

输出

根据表中提供的信息，由于不同特征的范围不同，因此有必要进行特征缩放。

from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
cols=list(data_num_features.columns)
data_features_scaled=pd.DataFrame(data=data_features)
data_features_scaled[cols]=scaler.fit_transform(data_features[cols])
data_features_scaled.head()

输出

现在，为了将分类数据编码为数值，我们使用了传统的 pandas 函数“get_dummies”。由于只有一个列需要编码，因此此函数足以完成此任务。

data_exp=pd.get_dummies(data_features_scaled)
data_exp.head()

输出

为了检查特征之间的关系，使用“corr()”函数并生成热力图是一种有价值的方法。这允许对特征之间的相关性进行可视化表示。

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(12, 10))
plt.title('Pearson Correlation of liver disease Features')
# Draw the heatmap using seaborn
sns.heatmap(data_num_features.astype(float).corr(),linewidths=0.25,vmax=1.0, square=True, cmap="YlGnBu", linecolor='black',annot=True)

输出

根据热力图分析，可以明显看出某些特征对之间存在很强的相关性。具体来说，“直接胆红素”和“总胆红素”、“丙氨酸氨基转移酶”和“天冬氨酸氨基转移酶”以及“总蛋白”和“白蛋白”之间存在高度相关性。

现在，我们将仅使用支持向量分类器 (SVC) 对不使用任何采样技术的数据库进行操作，以评估其性能。

from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,precision_recall_curve,auc,roc_auc_score,roc_curve,recall_score,classification_report

import itertools

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        #print("Normalized confusion matrix")
    else:
        1
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

X=data_exp
y=data['Dataset'] 
X_train,X_test,Y_train,Y_test=train_test_split(X,y,test_size=0.3,random_state=0)

输出

clf=SVC(random_state=0,kernel='rbf')
clf.fit(X_train,Y_train)
predictions=clf.predict(X_test)

# Compute confusion matrix
cnf_matrix = confusion_matrix(Y_test,predictions)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

输出

根据混淆矩阵，我们观察到没有真阴性，这是算法的一个错误结果。这表明算法不平衡，并且持续预测患者患有肝脏疾病。我们需要调整模型。

false_positive_rate, true_positive_rate, thresholds = roc_curve(Y_test, predictions)
roc_auc = auc(false_positive_rate, true_positive_rate)
print (roc_auc)

输出

plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, true_positive_rate, 'b', label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')

输出

根据 ROC 曲线和混淆矩阵的分析，很明显需要最小化假阳性数量，因为它们代表了错误的预测。为了优化模型，我们使用了 GridSearchCV 方法。

# Import 'GridSearchCV', 'make_scorer', and any other necessary libraries
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, fbeta_score,accuracy_score
#from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
# Initialize the classifier
clf = SVC(random_state=0,kernel='rbf')

#  Create the parameters list you wish to tune, using a dictionary if needed.
#  parameters = {'parameter_1': [value1, value2], 'parameter_2': [value1, value2]}
parameters = {'C': [10,50,100,200],'kernel':['poly','rbf','linear','sigmoid']}

# Make an fbeta_score scoring object using make_scorer()
scorer = make_scorer(fbeta_score,beta=0.5)

# Perform grid search on the classifier using 'scorer' as the scoring method using GridSearchCV()
grid_obj = GridSearchCV(clf,parameters,scoring=scorer,n_jobs=-1)

# Fit the grid search object to the training data and find the optimal parameters using fit()
grid_fit = grid_obj.fit(X_train,Y_train)

# Get the estimator
best_clf = grid_fit.best_estimator_

# Make predictions using the unoptimized and model
predictions = (clf.fit(X_train,Y_train)).predict(X_test)
best_predictions = best_clf.predict(X_test)

# Report the before-and-afterscores
print ("Unoptimized model\n------")
print ("Accuracy score on testing data: {:.4f}".format(accuracy_score(Y_test, predictions)))
print ("F-score on testing data: {:.4f}".format(fbeta_score(Y_test, predictions, beta = 2)))
print ("\nOptimized Model\n------")
print ("Final accuracy score on the testing data: {:.4f}".format(accuracy_score(Y_test, best_predictions)))
print ("Final F-score on the testing data: {:.4f}".format(fbeta_score(Y_test, best_predictions, beta = 2)))
print (best_clf)

输出

# Compute confusion matrix
cnf_matrix = confusion_matrix(Y_test,best_predictions)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

输出

随着真阴性病例的包含，ROC 曲线有望显示出更好的性能。

false_positive_rate, true_positive_rate, thresholds = roc_curve(Y_test, best_predictions)
roc_auc = auc(false_positive_rate, true_positive_rate)
print (roc_auc)

输出

plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, true_positive_rate, 'b', label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')

输出

与未优化的模型相比，ROC 曲线的 AUC 提高到 0.58。然而，这仍然不能算是一个高度有效的模型。这可能归因于数据集的不平衡性，这限制了 AUC 的改进。此外，数据集相对较小的规模也可能导致模型性能的局限性。

我将应用过采样技术来平衡数据集并增加数据量。

from imblearn.over_sampling import SMOTE
oversampler=SMOTE(random_state=0)
os_features,os_labels=oversampler.fit_sample(X_train,Y_train)

输出

clf=SVC(random_state=0,kernel='rbf') # unoptimized Model
clf.fit(os_features,os_labels)

输出

# perform predictions on test set
predictions=clf.predict(X_test)

# Compute confusion matrix
cnf_matrix = confusion_matrix(Y_test,predictions)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

输出

召回率指标显示值较低，表明需要优化模型以获得改进。

false_positive_rate, true_positive_rate, thresholds = roc_curve(Y_test, predictions)
roc_auc = auc(false_positive_rate, true_positive_rate)
print (roc_auc)

输出

plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, true_positive_rate, 'b', label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')

输出

#Import 'GridSearchCV', 'make_scorer', and any other necessary libraries
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, fbeta_score,accuracy_score
#from sklearn.ensemble import RandomForestClassifier
# TODO: Initialize the classifier
clf = SVC(random_state=0,kernel='rbf')

#  Create the parameters list you wish to tune, using a dictionary if needed.
#  parameters = {'parameter_1': [value1, value2], 'parameter_2': [value1, value2]}
parameters = {'C': [10,50,100,200],'kernel':['poly','rbf','linear','sigmoid']}

# Make an fbeta_score scoring object using make_scorer()
scorer = make_scorer(fbeta_score,beta=2)

# Perform grid search on the classifier using 'scorer' as the scoring method using GridSearchCV()
grid_obj = GridSearchCV(clf,parameters,scoring=scorer,n_jobs=-1)

#  Fit the grid search object to the training data and find the optimal parameters using fit()
grid_fit = grid_obj.fit(os_features,os_labels)

# Get the estimator
best_clf = grid_fit.best_estimator_

# Make predictions using the unoptimized and model
predictions = (clf.fit(os_features,os_labels)).predict(X_test)
best_predictions = best_clf.predict(X_test)

# Report the before-and-afterscores
print ("Unoptimized model\n------")
print ("Accuracy score on testing data: {:.4f}".format(accuracy_score(Y_test, predictions)))
print ("F-score on testing data: {:.4f}".format(fbeta_score(Y_test, predictions, beta = 2)))
print ("\nOptimized Model\n------")
print ("Final accuracy score on the testing data: {:.4f}".format(accuracy_score(Y_test, best_predictions)))
print ("Final F-score on the testing data: {:.4f}".format(fbeta_score(Y_test, best_predictions, beta = 2)))
print (best_clf)

输出

# Compute confusion matrix
cnf_matrix = confusion_matrix(Y_test,best_predictions)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

输出

false_positive_rate, true_positive_rate, thresholds = roc_curve(Y_test, best_predictions)
roc_auc = auc(false_positive_rate, true_positive_rate)
print (roc_auc)

输出

plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, true_positive_rate, 'b', label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')

输出

尽管采用了 SMOTE 技术，SVC 的性能仍然不令人满意。召回率指标和 AUC 分数都约为 0.67，未能达到期望水平。因此，我们决定探索 RandomForestClassifier 作为替代方法。

from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier(random_state=0) # unoptimized Model
clf.fit(os_features,os_labels)

输出

# perform predictions on test set
predictions=clf.predict(X_test)

# Compute confusion matrix
cnf_matrix = confusion_matrix(Y_test,predictions)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

输出

使用 RandomForestClassifier 后，召回率指标相比 SVC 有所提高。然而，模型仍需要进一步调整以优化其性能。

false_positive_rate, true_positive_rate, thresholds = roc_curve(Y_test, predictions)
roc_auc = auc(false_positive_rate, true_positive_rate)
print (roc_auc)

输出

plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, true_positive_rate, 'b', label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')

输出

# TODO: Import 'GridSearchCV', 'make_scorer', and any other necessary libraries
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, fbeta_score,accuracy_score
from sklearn.ensemble import RandomForestClassifier
# TODO: Initialize the classifier
clf = RandomForestClassifier(random_state=0)

# TODO: Create the parameters list you wish to tune, using a dictionary if needed.
# HINT: parameters = {'parameter_1': [value1, value2], 'parameter_2': [value1, value2]}
parameters = {'n_estimators': [100,250,500], 'max_depth': [3,6,9]}

# TODO: Make an fbeta_score scoring object using make_scorer()
scorer = make_scorer(fbeta_score,beta=2)

# TODO: Perform grid search on the classifier using 'scorer' as the scoring method using GridSearchCV()
grid_obj = GridSearchCV(clf,parameters,scoring=scorer,n_jobs=-1)

# TODO: Fit the grid search object to the training data and find the optimal parameters using fit()
grid_fit = grid_obj.fit(os_features,os_labels)

# Get the estimator
best_clf = grid_fit.best_estimator_

# Make predictions using the unoptimized and model
predictions = (clf.fit(os_features,os_labels)).predict(X_test)
best_predictions = best_clf.predict(X_test)

# Report the before-and-afterscores
print ("Unoptimized model\n------")
print ("Accuracy score on testing data: {:.4f}".format(accuracy_score(Y_test, predictions)))
print ("F-score on testing data: {:.4f}".format(fbeta_score(Y_test, predictions, beta = 2)))
print ("\nOptimized Model\n------")
print ("Final accuracy score on the testing data: {:.4f}".format(accuracy_score(Y_test, best_predictions)))
print ("Final F-score on the testing data: {:.4f}".format(fbeta_score(Y_test, best_predictions, beta = 2)))
print (best_clf)

输出

# Compute confusion matrix
cnf_matrix = confusion_matrix(Y_test,best_predictions)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

输出

false_positive_rate, true_positive_rate, thresholds = roc_curve(Y_test, best_predictions)
roc_auc = auc(false_positive_rate, true_positive_rate)
print (roc_auc)

输出

plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, true_positive_rate, 'b', label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')

输出

在应用 GridSearchCV 优化 RandomForestClassifier 后，该模型在 ROC 曲线上的召回率指标为 0.76，AUC 为 0.69。
考虑到模型的准确性，RandomForestClassifier 将是预测患者肝脏疾病的最佳选择，因为它考虑了多个特征。

使用机器学习进行肝脏疾病预测的未来方面

随着机器学习的不断发展，一些未来的方面有望为肝脏疾病预测带来希望：

将机器学习算法与 EHR 系统集成可以增强实时预测能力，并实现持续的患者监测。
通过集成方法组合多个机器学习模型可以提高预测的准确性和鲁棒性。
对可解释人工智能技术的研究可以深入了解复杂机器学习模型的决策过程，使其更加透明和易于解释。
整合各种组学数据（如基因组学、蛋白质组学和代谢组学）可以增强机器学习模型在肝脏疾病预测方面的能力。
开发能够从新数据中持续学习的自适应机器学习模型可以随着时间的推移提高预测的准确性。

结论

机器学习已成为肝脏疾病预测的宝贵工具，在准确性、早期检测和个性化医疗方面提供了显著的优势。然而，数据可用性、模型可解释性和伦理考量等挑战需要解决。未来，机器学习技术有望取得进一步的进展，从而实现更准确、更有效的肝脏疾病预测。通过利用机器学习的力量，我们可以改善患者的治疗效果，并在与全球肝脏疾病的斗争中取得重大进展。

下一个主题机器学习中的多数投票算法

机器学习中的肝脏疾病预测

机器学习在肝脏疾病预测中的应用

使用机器学习进行肝脏疾病预测的好处

使用机器学习进行肝脏疾病预测的挑战

数据摘要

内容

列

导入库

EDA

使用机器学习进行肝脏疾病预测的未来方面

结论

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

机器学习

监督式学习

分类

杂项

相关教程

面试题

机器学习中的肝脏疾病预测

机器学习在肝脏疾病预测中的应用

使用机器学习进行肝脏疾病预测的好处

使用机器学习进行肝脏疾病预测的挑战

数据摘要

内容

列

导入库

EDA

使用机器学习进行肝脏疾病预测的未来方面

结论

相关帖子

如何通过使用复数来改进神经网络

什么是 LSTM 网络

Softmax 激活函数如何工作

机器学习中的梯度下降

机器学习算法

深度学习中梯度消失和爆炸问题

使用迁移学习进行狗品种分类

机器学习中的心脏病预测

机器学习中的贝叶斯网络

机器学习中的生成模型

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器