机器学习中的客户细分

2025年6月24日 | 阅读12分钟

Customer Segmentation Using Machine Learning

客户细分是通过将客户群划分为在营销方面具有某些相似之处的个体群体来完成的，例如年龄、性别、兴趣和消费习惯。它使公司能够通过量身定制的促销、产品或服务来定位特定群体，这些促销、产品或服务最有可能引起他们的共鸣。机器学习已成为自动化客户细分过程的流行工具，提供了一种更有效的方式来识别客户数据中的模式和关系。

使用机器学习进行客户细分的几种不同方法，包括：-

聚类算法：这些算法根据客户的特征和行为将他们分成不同的组。例如，k-means 聚类可用于在数据集中找到 k 个簇的数量。
决策树：这些算法使用树状模型来识别影响客户行为的最重要变量。通过使用决策树，公司可以确定哪些客户最有可能对某些营销活动或产品做出反应。
神经网络：这些算法可用于对客户及其行为之间的复杂关系进行建模。神经网络可以识别传统方法不易识别的客户数据中的模式。
关联规则学习：此方法查找客户属性和行为之间的关系，例如购买习惯和产品偏好。关联规则学习可以帮助公司了解哪些产品经常一起购买，并据此定位客户。

机器学习用于客户细分的优势

使用机器学习进行客户细分的一个关键好处是它能够实时处理海量数据。这使得公司能够快速识别客户行为中的新趋势和模式，从而做出更明智的营销决策。此外，机器学习算法可以随着时间的推移不断学习和改进，从而更准确地描绘客户行为。
使用机器学习进行客户细分的另一个好处是它无需手动数据分析。这可能是一个耗时且容易出错的过程，尤其是在处理大型数据集时。机器学习算法可以自动化数据分析过程，为公司提供更准确可靠的结果。

现在，我们将对杂货店数据库中的客户记录执行无监督数据聚类。

导入库

# Importing the Libraries
import numpy as np
import pandas as pd
import datetime
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import colors
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt, numpy as np
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import AgglomerativeClustering
from matplotlib.colors import ListedColormap
from sklearn import metrics
import warnings
import sys
if not sys.warnoptions:
    warnings.simplefilter("ignore")
np.random.seed(42)

加载数据

# Loading the dataset
dataset = pd.read_csv("marketing_campaign.csv" ,sep="\t")
print("Number of datapoints in the dataset:", len(dataset))
dataset.head()

输出

数据清理

在此，我们将执行以下任务

数据清理
特征工程

为了全面了解这些程序，我们将清理数据集。让我们检查数据中包含的信息。

# Here we need to get the information about the features(column name) of the dataset
dataset.info()

输出

我们可以从上述输出中推断并注意到以下几点

收入缺少一些值。（因为只有 2194 个非空值）
"Dt Customer"（代表客户数据库录入日期）未处理为 DateTime。
在我们的数据框中，有几个类别特征（以及一些 dtype: object 特征）。因此，稍后我们将需要将它们编码为数字表示。

我们将首先删除缺失收入值的行。

# We need to remove the NA values from our dataset, so we will use .dropna()
dataset = dataset.dropna()
no=len(dataset)
print(f" After eliminating the rows with missing values, there are ultimately {no} number of datapoints in the dataset ")

输出

下一步是根据“Dt Customer”创建一个特征，该特征显示客户使用公司数据库的注册用户有多长时间。但为了简单起见，我们将使用该值相对于记录中最新的客户。

因此，我们必须比较最新和最早的记录日期才能获得这些值。

dataset["Dt_Customer"] = pd.to_datetime(dataset["Dt_Customer"])
dates = []
for i in dataset["Dt_Customer"]:
    i = i.date()
    dates.append(i)  
# Dates of the most recent and oldest client enrollments on record
newest_date = max(dates)
print(f"Date of the most recent customer's enrollment in the records: {newest_date}")
oldest_date = min(dates)
print(f" Date of records' oldest customer's enrollment: {oldest_date}")

输出

创建一个特征（“客户时长”），该特征计算客户与最后记录日期相比已在该公司购物的天数。

#Created a feature "Customer_For_How_Much_Time"
days = []
d_1 = max(dates) #taking it to be the newest customer
for i in dates:
    d = d_1 - i
    days.append(d)
dataset["Customer_For_How_Much_Time"] = days
dataset["Customer_For_How_Much_Time"] = pd.to_numeric(dataset["Customer_For_How_Much_Time"], errors="coerce")

为了进一步了解数据，我们现在将调查类别特征中的独特值。

print("Total categories for the Marital Status feature:\n", dataset["Marital_Status"].value_counts(), "\n")
print("Total categories for the feature Education:\n", dataset["Education"].value_counts())

输出

我们将在下一部分执行以下过程来生成一些新特征

从客户的“出生年份”提取客户的“年龄”。
添加一个名为“消费总额”的新特征，该特征显示客户在两年内的所有类别总消费。
为了区分夫妻的居住状况，请根据“婚姻状况”创建“居住情况”特征。
创建“子女”特征，显示家中居住的儿童和青少年的总数。
为了进一步阐明家庭，添加一个表示“家庭规模”的特征
将“是否为父母”设置为一个特征，以确定您是否是父母。
最后，通过简化其值计数，我们将“教育程度”分为三个类别。
删除一些无用的特征

# Engineering Features
# Now, we need to engineer the feature according to the requirement

#  Age of Customer till today
dataset["Age"] = 2021-dataset["Year_Birth"]

# Total spending on numerous products
dataset["Spent"] = dataset["MntWines"]+ dataset["MntFruits"]+ dataset["MntMeatProducts"]+ dataset["MntFishProducts"]+ dataset["MntSweetProducts"]+ dataset["MntGoldProds"]

# Living condition determined by marriage status "Alone"
dataset["Living_With"]=dataset["Marital_Status"].replace({"Married":"Partner", "Together":"Partner", "Absurd":"Alone", "Widow":"Alone", "YOLO":"Alone", "Divorced":"Alone", "Single":"Alone",})

# A feature that counts the number of kids in the home
dataset["Children"]=dataset["Kidhome"]+dataset["Teenhome"]

# Total number of household members feature
dataset["Family_Size"] = dataset["Living_With"].replace({"Alone": 1, "Partner":2})+ dataset["Children"]

# Feature related to parenting
dataset["Is_Parent"] = np.where(dataset.Children> 0, 1, 0)

# Dividing educational levels into three categories
dataset["Education"]=dataset["Education"].replace({"Basic":"Undergraduate","2n Cycle":"Undergraduate", "Graduation":"Graduate", "Master":"Postgraduate", "PhD":"Postgraduate"})

# For clarity
dataset=dataset.rename(columns={"MntWines": "Wines","MntFruits":"Fruits","MntMeatProducts":"Meat","MntFishProducts":"Fish","MntSweetProducts":"Sweets","MntGoldProds":"Gold"})

# Removing some of the pointless features
to_drop = ["Marital_Status", "Dt_Customer", "Z_CostContact", "Z_Revenue", "Year_Birth", "ID"]
dataset = dataset.drop(to_drop, axis=1)

现在我们有了一些附加特征，让我们来看看数据的统计信息。

输出

上述统计数据表明，平均收入和年龄以及最高收入和年龄存在一些差异。

请注意，最高年龄为 128 岁，因为数据已过时，我们计算的最高年龄是今天（即 2021 年）。

我们需要从更广泛的角度来看待这些事实。我们将绘制一些选定特征的图。

# to plot a few chosen features
# establishing colour preferences
sns.set(rc={"axes.facecolor":"#FFF9ED","figure.facecolor":"#FFF9ED"})
pallet = ["#682F2F", "#9E726F", "#D6B2B1", "#B9C0C9", "#9F8A78", "#F3AB60"]
cmap = colors.ListedColormap(["#682F2F", "#9E726F", "#D6B2B1", "#B9C0C9", "#9F8A78", "#F3AB60"])
#plotting the features that follow
to_be_plotted = [ "Income", "Recency", "Customer_For_How_Much_Time", "Age", "Spent", "Is_Parent"]
print("Relational Script of a Few Selected Features: A subset of data")
plt.figure()
sns.pairplot(dataset[to_be_plotted], hue= "Is_Parent",palette= (["#682F2F","#F3AB60"]))
#Taking hue
plt.show()

输出

部分选定特征的关系图：数据子集

显然，收入和年龄特征包含一些异常值。数据中的异常值将被删除。

# removing the outliers by capping their income and age.
dataset = dataset[(dataset["Age"]<90)]
dataset = dataset[(dataset["Income"]<600000)]
l=len(dataset)
print( f"Following the elimination of the outliers, there are {l} numbers of data points:")

输出

现在让我们检查特征之间的关系。（此时，排除分类特征）

# correlation matrix
corrmat= dataset.corr()
plt.figure(figsize=(20,20))  
sns.heatmap(corrmat,annot=True, cmap=cmap, center=0)

输出

<AxesSubplot: >

新特征已到位，数据相当干净。我们将继续进行下一阶段。具体来说，数据准备。

数据预处理

在此部分，我们将对数据进行预处理，以便进行聚类程序。

数据使用以下过程进行预处理

标记类别特征
使用默认缩放器缩放特征
创建一个子数据集以降低维度

# Obtain a list of the category variables
s = (dataset.dtypes == 'object')
object_columns = list(s[s].index)

print("the dataset's categorical variables are:", object_columns)

输出

# The object dtypes are label encoded.
LE=LabelEncoder()
for i in object_columns:
    dataset[i]=dataset[[i]].apply(LE.fit_transform)
   
print("Now, all attributes are numerical.")

输出

# making a duplicate of the data
copy_dataset = dataset.copy()
# Removing the features on deals accepted and promotions to create a subset of the dataframe
columns_to_delete = ['AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1','AcceptedCmp2', 'Complain', 'Response']
copy_dataset = copy_dataset.drop(columns_to_delete, axis=1)
# Scaling
standard_scaler = StandardScaler()
standard_scaler.fit(copy_dataset)
scaled_dataset = pd.DataFrame(standard_scaler.transform(copy_dataset),columns= copy_dataset.columns )
print(" Now, every feature is scaled ")

输出

# Using scaled data to reduce the dimensionality
print("Dataframe to be applied in further modelling:")
scaled_dataset.head()

输出

降维

降维是机器学习和数据科学中一种用于减少数据集中特征或维度数量的技术，同时尽可能多地保留信息。目标是在保留其结构和变量之间关系的同时简化数据。

主成分分析 (PCA) 是一种统计技术，用于分析复杂数据集的结构，例如高维数据集。它用于识别数据中的模式，然后可以用来降低数据的维度，使其更容易可视化和解释。

本节的后续操作

基于 PCA 的降维
绘制压缩后的数据框

#Initiating PCA to reduce dimensions, aka features, to 3
pca = PCA(n_components=3)
pca.fit(scaled_dataset)
PCA_dataset = pd.DataFrame(pca.transform(scaled_dataset), columns=(["col1","col2", "col3"]))
PCA_dataset.describe().T

输出

#A Reduced Dimensional 3D Data Projection
x =PCA_dataset["col1"]
y =PCA_dataset["col2"]
z =PCA_dataset["col3"]
# To plot
fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111, projection="3d")
ax.scatter(x,y,z, c="maroon", marker="o" )
ax.set_title("A Reduced Dimensional 3D Data Projection")
plt.show()

输出

聚类

现在将使用凝聚聚类来实现聚类。凝聚聚类是一种分层聚类技术。直到达到适当数量的簇，样本才会被合并。

聚类的步骤

使用肘部方法确定要构建的簇的数量。
用于聚类的凝聚聚类
检查创建的散点图簇

# Quick review of the elbow technique to determine how many clusters to create.
print('The amount of clusters to generate will be determined using the elbow method:')
Elbow_method = KElbowVisualizer(KMeans(), k=10)
Elbow_method.fit(PCA_dataset)
Elbow_method.show()

输出

<AxesSubplot: title={'center': 'KMeans Clustering 的失真度肘部图'}, xlabel='k', ylabel='失真度分数'>

根据上面的单元格，四个簇将是此数据集的最佳选择。为了获得最终簇，我们将然后拟合凝聚聚类模型。

# Agglomerative Clustering model launch
aggCluster = AgglomerativeClustering(n_clusters=4)
# model fitting and cluster prediction
yhat_aggCluster = aggCluster.fit_predict(PCA_dataset)
PCA_dataset["Clusters"] = yhat_aggCluster
# The original dataframe is updated with the Clusters feature.
dataset["Clusters"]= yhat_aggCluster

让我们看看簇的 3D 分布以研究生成的簇。

# Plotting the clusters
fig = plt.figure(figsize=(10,8))
ax = plt.subplot(111, projection='3d', label="bla")
ax.scatter(x, y, z, s=40, c=PCA_dataset["Clusters"], marker='o', cmap = cmap )
ax.set_title("The Plot Of The Clusters")
plt.show()

输出

评估模型

由于此聚类是无监督完成的，因此我们的模型无法进行评估或评分，因为它缺少标记的特征。本节的目标是检查已形成的簇中的模式并确定它们的性质。

为此，我们将使用探索性数据分析来查看在簇的背景下查看数据并做出判断。

#Plotting countplot of clusters
pal = ["#682F2F","#B9C0C9", "#9F8A78","#F3AB60"]
pl = sns.countplot(x=dataset["Clusters"], palette= pal)
pl.set_title("Arrangement Of The Clusters")
plt.show()

输出

这些簇似乎分布相当均匀。

pl = sns.scatterplot(data = dataset,x=dataset["Spent"], y=dataset["Income"],hue=dataset["Clusters"], palette= pal)
pl.set_title("Cluster's Income and Spending Profile")
plt.legend()
plt.show()

输出

簇模式显示在收入与支出图中。

第 0 组：高支出和平均收入
第 1 组：高收入和高支出
第 2 组：低收入和低支出
高支出和低收入构成第 3 组。

我们将检查的下一个是簇根据数据中不同商品的具体分布。如下：葡萄酒、水果、肉类、鱼类、糖果和黄金。

plt.figure()
pl=sns.swarmplot(x=dataset["Clusters"], y=dataset["Spent"], color= "#CBEDDD", alpha=0.5 )
pl=sns.boxenplot(x=dataset["Clusters"], y=dataset["Spent"], palette=pal)
plt.show()

输出

从上面的图可以看出，簇 1 是我们最大的客户群，紧随其后的是簇 0。我们可以研究每个簇正在投入的重点营销方法。

接下来，让我们看看我们过去的活动表现如何。

# Adding a tool to calculate the total number of approved promotions
dataset["Total_Promos"] = dataset["AcceptedCmp1"]+ dataset["AcceptedCmp2"]+ dataset["AcceptedCmp3"]+ dataset["AcceptedCmp4"]+ dataset["AcceptedCmp5"]
# Plotting the number of accepted campaigns overall.
plt.figure()
pl = sns.countplot(x=dataset["Total_Promos"],hue=dataset["Clusters"], palette= pal)
pl.set_title("Amount Of Accepted Promotions")
pl.set_xlabel("Total Number Of Promotions Accepted")
plt.show()

输出

这些活动尚未获得大量回应。通常只有很少的参与者。此外，没有任何一个部分可以包含所有这五项。也许需要设计更完善、更有针对性的促销活动来提高销量。

#Graphing the number of deals bought
plt.figure()
pl=sns.boxenplot(y=dataset["NumDealsPurchases"],x=dataset["Clusters"], palette= pal)
pl.set_title("Amount of Deals Bought")
plt.show()

输出

活动失败了，但交易成功了。第 0 组和第 3 组取得了最好的结果。尽管我们的大客户之一簇 1 对这些协议不太感兴趣。似乎没有什么能强烈吸引簇 2。

#for more details on the purchasing style
Places =["NumWebPurchases", "NumCatalogPurchases", "NumStorePurchases",  "NumWebVisitsMonth"]

for i in Places:
    plt.figure()
    sns.jointplot(x=dataset[i],y = dataset["Spent"],hue=dataset["Clusters"], palette= pal)
    plt.show()

输出

剖析

现在簇已经形成并且它们的购买模式已经检查完毕。让我们来看看这些簇中的每个个体。为了确定谁是我们的明星客户，谁需要零售店营销人员的进一步关注，我们将对已开发的簇进行画像。

为了做出决定，鉴于客户所在的簇，我们将绘制一些指示其个人特征的方面。我们将根据结果得出结论。

Personal = [ "Kidhome","Teenhome","Customer_For_How_Much_Time", "Age", "Children", "Family_Size", "Is_Parent", "Education","Living_With"]

for i in Personal:
    plt.figure()
    sns.jointplot(x=dataset[i], y=dataset["Spent"], hue =dataset["Clusters"], kind="kde", palette=pal)
    plt.show()

输出:

簇号 0

绝对是父母
最多有四个家庭成员，最少也有两个
单身父母是这个群体的一个子集
大多数家庭有青少年
相对年长

簇号 1

绝对不是父母
最多只有两个家庭成员
夫妻比单身人士略多
跨越所有年龄段
高收入群体

簇号 2

这些人中的大多数是父母
最多有三个家庭成员
他们主要有一个孩子（通常不是青少年）
相对年轻

簇号 3

绝对是父母
最多有五个家庭成员，最少也有两个
大多数家庭有青少年
相对年长
低收入群体

已执行无监督聚类。使用了降维和凝聚聚类。我们创建了四个簇，并利用它们根据客户的家庭构成、收入水平和消费习惯对客户进行画像。这可以应用于创建更好的营销计划。

总之，客户细分是营销策略的一个关键方面，而机器学习已成为自动化该过程的日益流行的工具。通过使用机器学习算法处理海量客户数据，公司可以快速识别新趋势和模式，通过量身定制的促销活动定位特定客户细分，并做出更明智的营销决策。凭借其实时处理数据、无需手动分析以及随着时间的推移不断改进的能力，机器学习是客户细分的一个强大工具。

下一个主题什么是 ImageNet 挑战 ILSVRC

机器学习中的客户细分

机器学习用于客户细分的优势

导入库

加载数据

数据清理

数据预处理

降维

聚类

评估模型

剖析

簇号 0

簇号 1

簇号 2

簇号 3

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

机器学习

监督式学习

分类

杂项

相关教程

面试题

机器学习中的客户细分

机器学习用于客户细分的优势

导入库

加载数据

数据清理

数据预处理

降维

聚类

评估模型

剖析

簇号 0

簇号 1

簇号 2

簇号 3

相关帖子

进化算法简介

ML 驱动的系统有何独特之处？

什么是 Xavier 初始化？

机器学习中的虚假新闻检测

McNemar 检验

深度学习 vs. 机器学习 vs. 人工智能

机器学习中的 AIC 和 BIC 是什么

计算峰度

连续机器学习

使用 Python 和 Pandas 访问 SQLite 数据库

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器