使用机器学习进行糖尿病预测

2025年3月17日 | 阅读11分钟

糖尿病是一种影响身体将食物转化为能量的医学疾病。我们日常食用的大部分食物都会转化为糖，通常称为葡萄糖，然后释放到血液中。当血糖水平升高时，我们的胰腺会释放胰岛素。

如果不持续仔细地控制，糖尿病会导致血糖水平升高，从而增加心脏病和中风等严重副作用的风险。因此，我们选择使用 Python 机器学习进行预测。

步骤：

安装库
导入数据集
填充缺失值
探索性数据分析
特征工程
实现机器学习模型
预测未知数据
总结报告

安装库

在构建项目的第一个步骤中，我们首先需要导入最流行的 Python 库，我们将使用它们来实现机器学习算法，包括 Pandas、Seaborn、Matplotlib 等。

我们将使用 Python，因为它是在数据分析方面最灵活、最强大的编程语言。在软件开发领域，我们也使用 Python。

代码

# Import libraries
import numpy as np # for linear algebra
import pandas as pd # for data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns # for data visualization
import matplotlib.pyplot as plt # to plot data visualization charts
from collections import Counter
import os

# Modeling Libraries
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score
from sklearn.preprocessing import QuantileTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold, learning_curve, train_test_split
from sklearn.svm import SVC

Sklearn 工具包非常实用且有帮助，并且具有实际应用。它提供了大量 ML 模型和算法。

导入数据集

本次研究我们使用的是 Kaggle 的糖尿病数据集。美国国家糖尿病和消化及肾脏疾病研究所是该数据库的原始来源。

代码

# Importing the dataset from Kaggle
data = pd.read_csv("../input/pima-indians-diabetes-database/diabetes.csv")

# First step is getting familiar with the structure of the dataset
data.info()

输出

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

正如我们所见，除了 BMI 和 DiabetesPedigreeFunction 之外，所有列都是整数。目标变量是值为 1 和 0 的标签。一个人的糖尿病状况由一或零表示。

代码

# Showing the top 5 rows of the dataset
data.head()

输出

	孕次	葡萄糖	血压	皮厚	胰岛素	BMI	糖尿病系数值	年龄	结果
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1

填充缺失值

下一步是清理数据集，这是数据分析中的关键步骤。在建模和进行预测时，缺失数据可能导致不正确的结果。

代码

# Exploring the missing values in the diabetes dataset
data.isnull().sum()

输出

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

我们在数据集中没有找到缺失值，但是像皮厚、胰岛素、血压和葡萄糖等独立特征中的一些值为 0，这在实际中是不可能的。必须使用特定列的平均值或中位数分数来替换不希望出现的 0 值。

代码

# Replacing 0 values with the mean of that column

# Replacing 0 values of Glucose
data['Glucose'] = data['Glucose'].replace(0, data['Glucose'].median())

# Filling 0 values of Blood Pressure
data['BloodPressure'] = data['BloodPressure'].replace(0, data['BloodPressure'].median()) 

# Replacing 0 values in BMI
data['BMI'] = data['BMI'].replace(0, data['BMI'].mean())

# Replacing the missing values of Insulin and SkinThickness
data['SkinThickness'] = data['SkinThickness'].replace(0, data['SkinThickness'].mean())
data['Insulin'] = data['Insulin'].replace(0, data['Insulin'].mean())
data.head()

输出

	孕次	葡萄糖	血压	皮厚	胰岛素	BMI	糖尿病系数值	年龄	结果
0	6	148	72	35.000000	79.799479	33.6	0.627	50	1
1	1	85	66	29.000000	79.799479	26.6	0.351	31	0
2	8	183	64	20.536458	79.799479	23.3	0.672	32	1
3	1	89	66	23.000000	94.000000	28.1	0.167	21	0
4	0	137	40	35.000000	168.000000	43.1	2.288	33	1

现在让我们检查数据统计信息。

代码

# Reviewing the dataset statistics
data.describe()

输出

	孕次	葡萄糖	血压	皮厚	胰岛素	BMI	糖尿病系数值	年龄	结果
count	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000
均值	3.845052	121.656250	72.386719	26.606479	118.660163	32.450805	0.471876	33.240885	0.348958
std	3.369578	30.438286	12.096642	9.631241	93.080358	6.875374	0.331329	11.760232	0.476951
min	0.000000	44.000000	24.000000	7.000000	14.000000	18.200000	0.078000	21.000000	0.000000
25%	1.000000	99.750000	64.000000	20.536458	79.799479	27.500000	0.243750	24.000000	0.000000
50%	3.000000	117.000000	72.000000	23.000000	79.799479	32.000000	0.372500	29.000000	0.000000
75%	6.000000	140.250000	80.000000	32.000000	127.250000	36.600000	0.626250	41.000000	1.000000
max	17.000000	199.000000	122.000000	99.000000	846.000000	67.100000	2.420000	81.000000	1.000000

现在我们的数据集已没有缺失值和不希望出现的值。

探索性数据分析

在本教程中，我们将使用 Seaborn GUI 来展示分析。

相关性

相关性是两个或多个变量之间的关系。在开始建模之前找到重要特征并清理数据集也有助于提高模型的效率。

代码

# Correlation plot of the independent variables

plt.figure(figsize = (10, 8))
sns.heatmap(data.corr(), annot = True, fmt = ".3f", cmap = "YlGnBu")
plt.title("Correlation heatmap")

输出

Diabetes Prediction Using Machine Learning

观察表明，怀孕次数、葡萄糖、BMI 和年龄等特征与结果的关系更密切。在接下来的阶段，我详细说明了这些方面。

怀孕

代码

# Exploring Pregnancy and target variables together

plt.figure(figsize = (10, 8))

# Plotting density function graph of the pregnancies and the target variable
kde = sns.kdeplot(data["Pregnancies"][data["Outcome"] == 1], color = "Red", shade = True)
kde = sns.kdeplot(data["Pregnancies"][data["Outcome"] == 0], ax = kde, color = "Blue", shade= True)
kde.set_xlabel("Pregnancies")
kde.set_ylabel("Density")
kde.legend(["Positive Result", "Negative Result"])

输出

根据数据，患有糖尿病的女性生下了健康婴儿。然而，通过控制糖尿病可以降低未来并发症的风险。如果女性患有未控制的糖尿病，患妊娠并发症（如高血压、抑郁症、早产、出生缺陷和流产）的风险会增加。

葡萄糖

# Exploring the Glucose and the Target variables together
plt.figure(figsize = (10, 8))
sns.violinplot(data = data, x = "Outcome", y = "Glucose",
               split = True, inner = "quart", linewidth = 2)

输出

患糖尿病的几率随着葡萄糖水平的升高而逐渐升高。

代码

# Exploring the density function plot of the Glucose levels

plt.figure(figsize = (10, 8))
kde = sns.kdeplot(data["Glucose"][data["Outcome"] == 1], color = "Red", shade = True)
kde = sns.kdeplot(data["Glucose"][data["Outcome"] == 0], ax = kde, color = "Blue", shade= True)
kde.set_xlabel("Glucose")
kde.set_ylabel("Density")
kde.legend(["Positive Result","Negative Result"])

输出

实现机器学习模型

在本部分中，我们将测试多种机器学习模型并比较它们的准确性。之后，我们将对具有良好精度的模型进行超参数调整。

我们将使用 sklearn.preprocessing 将数据分位数化，然后再划分数据集。

代码

# Transforming the data into quartiles
quartile  = QuantileTransformer()
X = quartile.fit_transform(data)
dataset = quartile.transform(X)
dataset = pd.DataFrame(X)
dataset.columns =['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
# Showing the top 5 rows of the transformed dataset
dataset.head()

输出

	孕次	葡萄糖	血压	皮厚	胰岛素	BMI	糖尿病系数值	年龄	结果
0	0.747718	0.810300	0.494133	0.801825	0.380052	0.591265	0.750978	0.889831	1.0
1	0.232725	0.091265	0.290091	0.644720	0.380052	0.213168	0.475880	0.558670	0.0
2	0.863755	0.956975	0.233377	0.308996	0.380052	0.077575	0.782269	0.585398	1.0
3	0.232725	0.124511	0.290091	0.505867	0.662973	0.284224	0.106258	0.000000	0.0
4	0.000000	0.721643	0.005215	0.801825	0.834420	0.926988	0.997392	0.606258	1.0

数据分割

现在我们将数据分成训练集和测试集。我们将使用训练集和测试集来训练和评估不同的模型。在预测测试数据之前，我们还将对多个模型进行交叉验证。

代码

# Splitting the dependent and independent features
X = data.drop(["Outcome"], axis = 1)
Y = data["Outcome"]

# Splitting the dataset into the training and testing dataset
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.40, random_state = 10)

# Printing the size of the training and testing dataset
print("The size of the training dataset: ", X_train.size)
print("The size of the testing dataset: ", X_test.size)

输出

The size of the training dataset:  3680
The size of the testing dataset:  2464

上面的代码将数据集分为训练集（70%）和测试集（30%）。

交叉验证模型

我们将对模型进行交叉验证。

代码

# Python program to create a function to validate models

def cv_model(models):
    """
    We will create a list of machine learning models and print graphs of cross-validation scores with the help of mean accuracy.
    """
    
    # Cross validating the model using the Kfold stratified cross-validation method
    k_fold = StratifiedKFold(n_splits = 15)
    
    r = []
    for m in models :
        r.append(cross_val_score(estimator = m, X = X_train, y = Y_train, scoring = "accuracy", cv = k_fold, n_jobs = 4))

    cross_val_means = []
    cross_val_std = []
    for result in r:
        cross_val_means.append(result.mean())
        cross_val_std.append(result.std())

    df_result = pd.DataFrame({
        "CrossValMean": cross_val_means,
        "CrossValStd": cross_val_std,
        "Model List":[
            "DecisionTreeClassifier",
            "LogisticRegression",
            "SVC",
            "AdaBoostClassifier",
            "GradientBoostingClassifier",
            "RandomForestClassifier",
            "KNeighborsClassifier"
        ]
    })

    # Generating the graph of cross-validation scores
    bar_plot = sns.barplot(x = cross_val_means, y = df_result["Model List"].values, data = df_result)
    bar_plot.set_xlabel("Mean of Cross Validation Accuracy Scores")
    bar_plot.set_title("Cross Validation Scores of Models")
    return df_result

将一组机器学习模型传递给“cv_model”函数，该函数会根据传入函数的各种模型的准确度值的平均值，提供一个交叉验证分数图。

代码

# Modeling the dataset using different machine learning algorithms
state = 20
models_list = [
    DecisionTreeClassifier(random_state = state),
    LogisticRegression(random_state = state, solver ='liblinear'),
    SVC(random_state = random_state),
    AdaBoostClassifier(DecisionTreeClassifier(random_state = state), random_state = state, learning_rate = 0.3),
    GradientBoostingClassifier(random_state = state),
    RandomForestClassifier(random_state = state),
    KNeighborsClassifier()
]
cv_model(models_list)

输出

	交叉验证平均值	交叉验证标准差	模型列表
0	0.697921	0.067773	决策树分类器
1	0.780358	0.085376	逻辑回归
2	0.782437	0.069578	SVC
3	0.686882	0.050551	AdaBoostClassifier
4	0.762796	0.072912	GradientBoostingClassifier
5	0.760717	0.079104	RandomForestClassifier
6	0.739283	0.043985	KNeighborsClassifier

根据以上分析，我们发现 RandomForestClassifier、LogisticRegression 和 SVC 模型具有较高的准确性。因此，我们将对这三个不同的模型进行超参数调整。

超参数调整

为机器学习算法选择最佳的超参数集合称为超参数调整。超参数是模型的输入，其值在学习阶段开始之前就已确定。超参数调整对于机器学习模型的运行至关重要。

我们单独调整了 RandomForestClassifier、LogisticRegression 和 SVC 模型。

代码

# Importing the required libraries
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV

# Defining a function to analyse the grid results
def analyze_grid(grid):
    '''
    Analyzing the results of GridCV method and making predictions for the test data
    Presenting the classification report at the end
    '''    
    # Printing the best parameter and accuracy score
    print("Tuned hyperparameters: ", grid.best_params_)
    print("Accuracy Score:", grid.best_score_)
    
    mean_values = grid.cv_results_["mean_test_score"]
    std_values = grid.cv_results_["std_test_score"]
    for m, s, p in zip(mean_values, std_values, grid.cv_results_["params"]):
      print(f"Mean: {m}, Std: {s} * 2, Params: {p}")
      print("The classification Report:")
    Y_true, Y_pred = Y_test, grid.predict(X_test)
    print(classification_report(Y_true, Y_pred))
    print()

首先从 Sklearn 包中导入 GridSearchCV 和 classification_report 类。然后定义“analyse grid”方法，该方法将显示预测结果。我们为 SearchCV 中使用的每个模型调用了此方法。在下一阶段，我们将调整每个模型。

调整逻辑回归的超参数

代码

# Defining the Logistic Regression model and its parameters
model = LogisticRegression(solver ='liblinear')
solver_list = ['liblinear']
penalty_type = ['l2']
c_values = [200, 100, 10, 1.0, 0.01]

# Defining the grid search
grid_lr = dict(solver = solver_list, penalty = penalty_type, C = c_values)
cross_val = StratifiedKFold(n_splits = 100, random_state = 10, shuffle = True)
grid_search_cv = GridSearchCV(estimator = model, param_grid = grid_lr, cv = cross_val, scoring = 'accuracy', error_score = 0)
lr_result = grid_search_cv.fit(X_train, Y_train)

# Result of Hyper Parameters of Logistic Regression
analyze_grid(lr_result)

输出

Tuned hyperparameters:  {'C': 200, 'penalty': 'l2', 'solver': 'liblinear'}
Accuracy Score: 0.7715000000000001
Mean: 0.7715000000000001, Std: 0.16556796187668676 * 2, Params: {'C': 200, 'penalty': 'l2', 'solver': 'liblinear'}
The classification Report:
Mean: 0.7715000000000001, Std: 0.16556796187668676 * 2, Params: {'C': 100, 'penalty': 'l2', 'solver': 'liblinear'}
The classification Report:
Mean: 0.7675, Std: 0.16961353129983467 * 2, Params: {'C': 10, 'penalty': 'l2', 'solver': 'liblinear'}
The classification Report:
Mean: 0.7675, Std: 0.17224619008848932 * 2, Params: {'C': 1.0, 'penalty': 'l2', 'solver': 'liblinear'}
The classification Report:
Mean: 0.711, Std: 0.1888888562091475 * 2, Params: {'C': 0.01, 'penalty': 'l2', 'solver': 'liblinear'}
The classification Report:
              precision    recall  f1-score   support
           0       0.78      0.88      0.83       201
           1       0.70      0.53      0.61       107

    accuracy                           0.76       308
   macro avg       0.74      0.71      0.72       308
weighted avg       0.75      0.76      0.75       308

正如我们在输出中看到的，LogisticRegression 模型返回的最佳分数是 0.77，参数为 {'C': 200, 'penalty': 'l2', 'solver': 'liblinear'}。类似地，我们将对其他模型进行参数调整。

调整 SVC 的超参数

代码

# Defining the SVC model and its parameters

# Defining the grid search
svc = SVC()
parameters = [
    {"kernel": ["rbf"], "gamma": [1e-4], "C": [200, 100, 10, 1.0, 0.01]}
]

# Performing the cross-validation with tuned parameters
cross_val = StratifiedKFold(n_splits = 3, random_state = 10, shuffle = True)

# Performing the grid search
grid = GridSearchCV(estimator = svc, param_grid = parameters, cv = cross_val, scoring = 'accuracy')

# SVC Hyperparameter tuning result
result = grid.fit(X_train, Y_train)

analyze_grid(result)

输出

Tuned hyperparameters:  {'C': 1.0, 'gamma': 0.0001, 'kernel': 'rbf'}
Accuracy Score: 0.7695158871629459
Mean: 0.745607333842628, Std: 0.019766615171568313 * 2, Params: {'C': 200, 'gamma': 0.0001, 'kernel': 'rbf'}
The classification Report:
Mean: 0.7521291344820756, Std: 0.02368565638376449 * 2, Params: {'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}
The classification Report:
Mean: 0.7542370483546955, Std: 0.046474062764375476 * 2, Params: {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}
The classification Report:
Mean: 0.7695158871629459, Std: 0.016045599935252022 * 2, Params: {'C': 1.0, 'gamma': 0.0001, 'kernel': 'rbf'}
The classification Report:
Mean: 0.650001414707297, Std: 0.002707677330225552 * 2, Params: {'C': 0.01, 'gamma': 0.0001, 'kernel': 'rbf'}
The classification Report:
              precision    recall  f1-score   support

           0       0.74      0.88      0.80       201
           1       0.64      0.42      0.51       107

    accuracy                           0.72       308
   macro avg       0.69      0.65      0.66       308
weighted avg       0.71      0.72      0.70       308

SVC 模型的最高准确度为 0.769，略低于逻辑回归。我们可以将此模型保留在此处。

调整 RandomForestClassifier 的超参数

代码

# Defining the SVC model and its parameters

# Defining the grid search
rfc = RandomForestClassifier(random_state = 42)
parameters = { 
    'n_estimators': [500],
    'max_features': ['log2'],
    'max_depth' : [4,5,6],
    'criterion' :['entropy']
}

# Performing the cross-validation with tuned parameters
cross_val = StratifiedKFold(n_splits = 3, random_state = 10, shuffle = True)

# Performing the grid search
grid = GridSearchCV(estimator = rfc, param_grid = parameters, cv = cross_val, scoring = 'accuracy')

# SVC Hyperparameter Tuning Result
result = grid.fit(X_train, Y_train)

analyze_grid(result)

输出

Tuned hyperparameters:  {'criterion': 'entropy', 'max_depth': 5, 'max_features': 'log2', 'n_estimators': 500}
Accuracy Score: 0.7717369776193306
Mean: 0.7673938262173556, Std: 0.0027915297477680364 * 2, Params: {'criterion': 'entropy', 'max_depth': 4, 'max_features': 'log2', 'n_estimators': 500}
The classification Report:
Mean: 0.7717369776193306, Std: 0.005382324516419591 * 2, Params: {'criterion': 'entropy', 'max_depth': 5, 'max_features': 'log2', 'n_estimators': 500}
The classification Report:
Mean: 0.7652151769798828, Std: 0.02135846347536185 * 2, Params: {'criterion': 'entropy', 'max_depth': 6, 'max_features': 'log2', 'n_estimators': 500}
The classification Report:
              precision    recall  f1-score   support

           0       0.76      0.87      0.81       201
           1       0.66      0.50      0.57       107

    accuracy                           0.74       308
   macro avg       0.71      0.68      0.69       308
weighted avg       0.73      0.74      0.73       308

预测未知数据

我们花了时间进行探索性数据分析、机器学习算法的交叉验证以及超参数调整，以确定最适合我的数据集的模型。现在，我们将使用具有最高准确度得分的调整超参数的模型进行预测。

代码

# Making the predictions
Y_pred = lr_result.predict(X_test)
print(classification_report(Y_test, Y_pred))

输出

precision    recall  f1-score   support

           0       0.78      0.88      0.83       201
           1       0.70      0.53      0.61       107

    accuracy                           0.76       308
   macro avg       0.74      0.71      0.72       308
weighted avg       0.75      0.76      0.75       308

最后，在测试数据集中附加一个名为“Prediction”的新特征列，并打印数据集。

代码

X_test['predictions'] = Y_pred
print(X_test)

输出

总结报告

妊娠期间的风险之一是糖尿病。必须进行诊断以避免问题。
葡萄糖水平的升高与糖尿病的升高密切相关。
经过调整参数的逻辑回归模型给出了最高准确度分数。

下一个主题字符串中的第一个唯一字符 Python

← 上一个下一个 →

使用机器学习进行糖尿病预测

安装库

导入数据集

填充缺失值

探索性数据分析

实现机器学习模型

预测未知数据

总结报告

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

Python 问题

使用机器学习进行糖尿病预测

安装库

导入数据集

填充缺失值

探索性数据分析

实现机器学习模型

预测未知数据

总结报告

相关帖子

统计学中使用 Python 的 Lomax 分布

Python 中的 InfluxDB

使用 PyQt 的 QThread 防止 GUI 冻结

如何使用 Python 获取国家信息

最佳在线 Python 编译器

Python 程序实现最短作业优先 (SJF) CPU 调度

使用 tinyhtml 模块在 Python 中生成 HTML

Python Sympy 模块

implicitly_wait Driver 方法 - Selenium Python

C vs C++ vs Python vs Java

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器