电信客户流失率分析

2025 年 1 月 5 日 | 11 分钟阅读

在本教程中，我们将介绍如何使用 Kaggle 电信客户数据集开发简单实用模型来分析流失率。

背景与问题；
数据摘要与探索性分析；
数据分析；
策略建议，

缺点和未来研究都包含在具体流程中。

背景

鉴于使用电话服务的消费者数量大幅增加，电信公司的营销部门旨在留住现有客户，防止他们终止合同，同时吸引更多新客户。为了发展其客户群，电信公司的增长率必须超过其客户流失率。提供更好的定价、更快的互联网连接以及更安全的在线体验是导致现有客户离开电信公司的一些原因。

高离职率会损害企业的利润并阻碍增长。通过我们的流失预测，电信行业将能够确定其留住现有客户的效率，并找出导致现有客户终止合同的基本原因。

通过我们的研究，电信公司可以确定其产品是否比竞争对手更有优势。由于维护现有客户的成本远高于获取新客户，因此该公司可以利用流失率研究来提供折扣、独家优惠和更好的产品以留住现有客户。

数据集

该电信公司的数据集源自 IBM 样本数据集集合，可在 Kaggle 上获取。该公司在加利福尼亚州为 7043 名互联网和住宅服务客户提供服务。我们的挑战是帮助该公司预测客户行为以留住他们作为客户，并分析所有相关的客户数据以制定有针对性的客户保留活动。

提供的数据集中包含以下详细信息

客户人口统计数据，例如年龄、性别和婚姻状况
客户账户的详细信息，例如他们与公司合作的月数、无纸化账单、付款方式、月度费用和总费用
客户的服务使用方式，例如他们是否观看电视或电影流媒体
客户的注册服务包括电话、互联网、多线路、在线安全、互联网备份、设备保护和技术支持。
客户流失，或客户在上个月未续订服务

研究目标

在导致高保留率的因素中，哪个是最重要的？
哪种分析模型可以准确预测客户的流失率？
使用各种分析模型有哪些优缺点？
电信公司可以利用我们提供的信息制定哪些有针对性的保留计划？

研究的合理性

我们的流失研究对于电信公司理解客户为何停止使用其产品或服务至关重要。如果不知道因客户取消而造成的总收入损失、哪些客户正在取消以及为何取消，电信公司就很难改进其产品和服务。

我们将使用简单线性回归、二项逻辑回归、二项 Logit 回归和随机森林回归来分析客户流失行为，因为流失率分析是监督学习中常见的分类问题。

我们的研究将有助于公司通过关注客户的人口统计数据、账户详细信息、使用模式以及客户注册的服务，为如何降低客户流失率提供指导。

探索性分析和数据摘要

我们检查的次级数据可在免费使用的数据聚合平台 Kaggle 上获取。

以下代码包含部分相关数据。

<bound method NDFrame.describe of       customerID  gender  SeniorCitizen Partner Dependents  tenure  \
   7590-VHVEG  Female              0     Yes         No       1   
   5575-GNVDE    Male              0      No         No      34   
   3668-QPYBK    Male              0      No         No       2   
   7795-CFOCW    Male              0      No         No      45   
   9237-HQITU  Female              0      No         No       2   
...          ...     ...            ...     ...        ...     ...   
6840-RESVB    Male              0     Yes        Yes      24   
2234-XADUH  Female              0     Yes        Yes      72   
4801-JZAZL  Female              0     Yes        Yes      11   
8361-LTMKD    Male              1     Yes         No       4   
3186-AJIEK    Male              0      No         No      66   

     PhoneService     MultipleLines InternetService OnlineSecurity  ...  \
            No  No phone service             DSL             No  ...   
           Yes                No             DSL            Yes  ...   
           Yes                No             DSL            Yes  ...   
            No  No phone service             DSL            Yes  ...   
           Yes                No     Fiber optic             No  ...   
...           ...               ...             ...            ...  ...   
        Yes               Yes             DSL            Yes  ...   
        Yes               Yes     Fiber optic             No  ...   
         No  No phone service             DSL            Yes  ...   
        Yes               Yes     Fiber optic             No  ...   
        Yes                No     Fiber optic            Yes  ...   
...
  No  
 Yes  
  No  

[7043 rows x 21 columns]>
Output is truncated. View

数据介绍

使用 Python 中的 Pandas 读取数据后，我们发现原始数据集没有缺失信息，并且大多数特征——包括性别、电话服务和付款方式——都是分类数据。月度费用和总费用都以数字表示。

相关性

在使用编码器和标签编码转换所有分类数据后，我们对每个特征进行了成对相关性分析

热力图显示“合同”和“租期”这两个特征之间存在很强的关联。这是有道理的，因为这些特征衡量了客户的承诺程度。

“多线路”、“电视流媒体”、“电影流媒体”和“月度费用”之间存在很强的关联。我们认为这是因为喜欢看电影的人也更倾向于看电视。由于孩子们在观看电视节目或电影时消耗大量数据，他们的月度费用通常会增加。拥有多个电话线的客户的账单很可能比只有一个电话线的客户要高。

数据分析和主要发现

简单线性回归、二项逻辑回归、二项 Logit 回归和随机森林回归是我们为数据选择的四种技术。

模型概述

让我们从描述简单线性回归模型开始，这是我们的第一个选择。该模型将目标预测为特征输入的加权和。由于线性回归是我们准确率的标准和比较点，因此其易用性构成了其绝大部分优缺点。

随机森林是我们最后一个模型，也是第四个模型，它是一个广泛使用的机器学习模型。组成随机森林模型的决策树是许多独立的、协同工作的树。

在我们的情况下，优点如下：(1) 它通常能提供很高的准确率，并且在偏差和方差之间取得了良好的平衡。(2) 它可以作为特征相关性的可视化。(3) 异常值对它的影响很小或没有影响。(4) 它支持线性和非线性关系。缺点如下：(1) 与早期模型相比，它更难理解。(2) 如果数据集很大，它会花费更长的时间。

源代码

Import numpy as np
import pandas as pd
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score, mean_squared_error, r2_score, roc_auc_score, roc_curve, classification_report, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
import warnings
warnings.filterwarnings("ignore")
#import lux
df=pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
df.head()
df.shape
df.columns
df.describe
df.info(verbose=1, null_counts=True, memory_usage=True)

s = df.shape
print(f'The dataset contains {s[0]} rows and {s[1]-1} independent columns and 1 target variable')
We need to convert SeniorCitizen to object and TotalCharges to float datatype
# Assuming df is your DataFrame
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df['SeniorCitizen'] = df['SeniorCitizen'].astype(str)

df.dtypes
Checking missing values
df.isnull().sum()
#NO missing values found
#TARGET VARIABLE
df['Churn'].value_counts()
df['Churn'].value_counts(normalize=True)

# normalize=True 参数将返回唯一值的相对频率，给出比例而不是原始计数。

流失客户的比例远少于现有客户。因此，在本数据集中，26% 的客户离开了电信服务。

现在，让我们单独可视化每个变量。变量的类型有分类、有序和数值。

分类变量

customerID（假设它是一个标识符，不作为特征使用）

性别 (Gender)
Partner
Dependents
PhoneService
MultipleLines
InternetService
OnlineSecurity
OnlineBackup
DeviceProtection
TechSupport
StreamingTV
StreamingMovies
合同
PaperlessBilling
payment method

有序变量

SeniorCitizen（假设它是一个二元变量，但其顺序性可能取决于特定上下文）

数值变量

tenure
MonthlyCharges
TotalCharges

# drop customer because it's just the customer ID
df.drop(['customerID'], axis=1, inplace=True)

数据可视化自变量（分类）——用于检查**异常值**

源代码

import matplotlib.pyplot as plt
import seaborn as sns
object_columns = df.select_dtypes(include='object').columns
num_subplots = len(object_columns)
# Create subplots dynamically based on the number of object columns
fig, axes = plt. subplots((num_subplots + 2) // 3, 3, figsize=(20, 20))
axes = axes.flatten()
for i, col in enumerate(object_columns):
    ax = axes[i]
    ax.set_title(f'Subplot {i + 1}')
    # Using Seaborn countplot
    sns.countplot(x=col, data=df, ax=ax)
    # Annotating each bar with the percentage
    total = len(df[col])
    for p in ax.patches:
        height = p.get_height()
        percentage = f'{height / total:.2%}'
        ax.annotate(percentage,
                    xy=(p.get_x() + p.get_width() / 2, height),
                    xytext=(0, 3),  # 3 points vertical offset
                    textcoords="offset points",
                    ha='center', va='bottom')

# Adjust layout and show the plot
plt.tight_layout()
plt.show()

输出

观察：在分类列中未发现异常值

检测分类列中的异常值与检测数值列中的异常值有所不同。在分类列中，通常不像数值那样有“异常值”的概念。但是，您可以检查可能因频率而被视为异常值的异常或稀有类别。

以下是一些方法：

值计数：使用 value_counts() 检查分类列中每个类别的分布。如果您看到某个类别的频率显著低于其他类别，您可能会认为它不寻常或稀有。
条形图：使用条形图可视化类别的分布。这可以帮助您快速识别频率较低的类别。
稀有类别聚合：如果存在频率非常低的类别，您可能会考虑将它们聚合成一个类别，以简化您的分析。
检查缺失值：有时，分类列中的缺失值可以被视为一个特殊类别。检查是否有任何意外的缺失值。

请记住，分类列中“异常值”的定义在一定程度上是主观的，并且取决于您数据的上下文。目标是识别稀有或具有异常模式的类别。

自变量（数值）——用于检查异常值

num_cols=df.select_dtypes(["int", "float"]).columns
num_cols
categorical_cols=df.select_dtypes(["object", "bool"]).columns
fig, axes = plt.subplots(nrows=1, ncols =3, figsize=(25,20))
plt.subplots_adjust(hspace=0.5)
for i , feature in enumerate(num_cols):
    sns.boxplot(data=df, y=feature, ax = axes[i], orient='v')
    axes[i].set_title(f" Distribution for {feature}")
plt.tight_layout()
plt.show()
Insights:
no outliers found in numerical cols
Categorical Independent Variable v/s Target Variable
fig, axes = plt.subplots(nrows=5, ncols =4)
plt.subplots_adjust(hspace=0.9)
for i , feature in enumerate(categorical_cols):
    row_index = i//4
    col_index = i%4
    plot=pd.crosstab(df[feature],df['Churn'])
    plot.div(plot.sum(1).astype(float), axis=0).plot(kind="bar", stacked=True, figsize=(12,20), ax = axes[row_index, col_index])
    axes[row_index, col_index].set_title(f" Distribution for {feature}")
plt.tight_layout()
plt.show()
Numerical Independent Variable vs Target Variable
plt.figure(1)
plt.subplot(1, 2, 1)
a = df.groupby('Churn')['tenure'].median().plot.bar()
plt.bar_label(a.containers[0])
plt.figure(1)
plt.subplot(1, 2, 1)
a = df.groupby('Churn')['TotalCharges'].median().plot.bar()
plt.bar_label(a.containers[0])
#MonthlyCharges
df['Churn'].replace("No", 0, inplace=True)
df['Churn'].replace("Yes", 1, inplace=True)
Churn (customer activity) variable - if the yes then 1 else 0

现在让我们看看所有数值变量之间的相关性。我们将使用热力图来可视化相关性。热力图通过颜色变化来可视化数据。颜色较深的变量表示它们的相关性更强。

matrix = df[df.select_dtypes(["int","float"]).columns].corr()
f, ax = plt.subplots(figsize=(9, 6))
sns.heatmap(matrix, vmax=.8, square=True, cmap="BuPu")

特定领域分析

根据领域和业务背景，调查正相关的原因。是否存在特定的业务实践或原因可以解释这种关系？理解背景可以提供宝贵的见解。

预测建模

如果您的目标是构建预测模型，请考虑由于相关性很强，同时使用 tenure 和 total_charges 作为特征是否冗余。在某些情况下，您可以选择保留其中一个特征或应用降维技术。

## XGBOOST
from xgboost import XGBClassifier
xgb_model = XGBClassifier().fit(X_train, y_train)
y_pred = xgb_model.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

## XgBOOST tuning
xgb = XGBClassifier()
xgb_params = {"n_estimators": [50, 100, 300], "subsample":[0.5,0.8,1], "max_depth":[3,5,7], "learning_rate":[0.1,0.01,0.3]}
xgb_cv_model = GridSearchCV(xgb, xgb_params, cv = 3, n_jobs = -1, verbose = 2).fit(X_train, y_train)

xgb_cv_model.best_params_
xgb_tuned = XGBClassifier(learning_rate= 0.01, max_depth= 5, n_estimators= 450, subsample= 0.5).fit(X_train, y_train)

y_pred = xgb_tuned.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
## SVM Support vector classifier
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_s = sc.fit_transform(X_train)
X_test_s = sc.transform(X_test)
svc_model_sc = SVC().fit(X_train_s, y_train)
y_pred = svc_model_sc.predict(X_test_s)
cnf_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix(y_test, y_pred))
sns.heatmap(cnf_matrix, annot=True, cmap="YlGnBu",fmt='d')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.show()
print(classification_report(y_test, y_pred))
param_grid = {'C': [0.1,1, 10, 100, 1000], 'gamma': [1,0.1,0.01,0.001,0.0001], 'kernel': ['rbf', 'linear']}
from sklearn.model_selection import GridSearchCV
svc_tuned = GridSearchCV(SVC(),param_grid, verbose=3, refit=True)
svc_tuned.fit(X_train_s, y_train)
print(svc_tuned.best_params_)
print(svc_tuned.best_estimator_)
y_pred = svc_tuned.predict(X_test_s)
cnf_matrix = confusion_matrix(y_test, y_pred)
print(cnf_matrix)
sns.heatmap(cnf_matrix, annot=True, cmap="YlGnBu",fmt='d')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.show()
print(classification_report(y_test,y_pred))
## Logistic Regression
log_model = LogisticRegression()
log_model.fit(X_train, y_train)
y_pred = log_model.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=42)
X_train_s = sc.fit_transform(X_train)
X_test_s = sc.transform(X_test)
log_model_sc = LogisticRegression()
log_model_sc.fit(X_train_s, y_train)
y_pred = log_model_sc.predict(X_test_s)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
## KNN
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, stratify=y, random_state=42)
X_train_s = sc.fit_transform(X_train)
X_test_s = sc.transform(X_test)
knn_model_sc = KNeighborsClassifier(n_neighbors=1).fit(X_train_s, y_train)

y_pred = knn_model_sc.predict(X_test_s)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
error_rate = []
for i in range(1, 40):
    model = KNeighborsClassifier(n_neighbors = i)
    model.fit(X_train_s, y_train)
    y_pred_i = model.predict(X_test_s)
    error_rate.append(np.mean(y_pred_i != y_test))
plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=42)
X_train_s = sc.fit_transform(X_train)
X_test_s = sc.transform(X_test)
knn_model_sc_tuned = KNeighborsClassifier(n_neighbors=38).fit(X_train_s, y_train)
y_pred = knn_model_sc_tuned.predict(X_test_s)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
train = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")
test = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")
my_report = sweetviz.compare([train, "Train"], [test, "Test"], "Churn")
my_report.show_html("Report.html") # Not providing a filename will default to SWEETVIZ_REPORT.html
param_grid = {'C': [0.1,1, 10, 100, 1000], 'gamma': [1,0.1,0.01,0.001,0.0001], 'kernel': ['rbf', 'linear']}
from sklearn.model_selection import GridSearchCV
svc_tuned = GridSearchCV(SVC(),param_grid, verbose=3, refit=True)
svc_tuned.fit(X_train_s, y_train)
print(svc_tuned.best_params_)
print(svc_tuned.best_estimator_)
y_pred = svc_tuned.predict(X_test_s)
cnf_matrix = confusion_matrix(y_test, y_pred)
print(cnf_matrix)
sns.heatmap(cnf_matrix, annot=True, cmap="YlGnBu",fmt='d')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.show()
print(classification_report(y_test,y_pred))

输出

Fitting 5 folds for each of 50 candidates, totalling 250 fits
[CV 1/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.735 total time=   2.1s
[CV 2/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.735 total time=   1.8s
[CV 3/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.735 total time=   1.8s
[CV 4/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.735 total time=   1.9s
[CV 5/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.734 total time=   1.5s
[CV 1/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.823 total time=   0.4s
[CV 2/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.809 total time=   0.3s
[CV 3/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.800 total time=   0.4s
[CV 4/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.791 total time=   0.4s
[CV 5/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.806 total time=   0.3s
[CV 1/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.801 total time=   0.8s
[CV 2/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.783 total time=   0.8s
[CV 3/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.794 total time=   0.8s
[CV 4/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.769 total time=   0.8s

源代码

from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train_s = sc.fit_transform(X_train)
X_test_s = sc.transform(X_test)
svc_model_sc = SVC().fit(X_train_s, y_train)
y_pred = svc_model_sc.predict(X_test_s)

cnf_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix(y_test, y_pred))
sns.heatmap(cnf_matrix, annot=True, cmap="YlGnBu",fmt='d')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.show()
print(classification_report(y_test, y_pred))

输出

局限性

我们的模型和数据集以及研究的限制适用以下限制。
虽然观察数量可观，但如果我们能获得更多包含客户位置、竞争数据和其他相关信息的属性列，我们可能会从结果中学到更多。
我们之外存在更强大的模型，但我们选择的模型不仅基于其复杂性和预测能力，而且——更重要的是——基于其解释的简单性。例如，具有强大梯度提升的神经网络可能会运行得更好并产生更高的准确率。
我们的数据集是横截面结构的。这意味着它没有时间序列成分。我们的目标是预测流失率，以便我们可以选择按月、一年或两年期合同。如果我们想提高预测和判断未来市场的能力，拥有一个包含长达两年客户数据的时间序列数据集将是理想的选择。

下一主题条形图的替代方案

电信客户流失率分析

背景

数据集

数据介绍

数据分析和主要发现

分类变量

有序变量

数值变量

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

其他

电信客户流失率分析

背景

数据集

数据介绍

数据分析和主要发现

分类变量

有序变量

数值变量

相关帖子

Pandas中的流水线

Python中的分治算法

使用Python实现校验和

Python Set discard()方法

使用Chaquopy将Python字典转换为Kotlin JSON

如何在Python中捕获SystemExit异常

比较Python中的字典

Python中的有限元分析（FEM）入门

Python中的并行for循环

Python解决方案：计算排序数组中某个元素的出现次数

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器