Python 中的 DBSCAN 算法

2025年03月17日 | 阅读 9 分钟

在本教程中，我们将学习如何在 Python 中实现和使用 DBSCAN 算法。

DBSCAN，即基于密度空间聚类应用（Density-Based Spatial Clustering of Applications with Noise），是一种聚类算法，于 1996 年首次提出，并于 2014 年荣获“测试之选”奖。DBSCAN 在数据挖掘会议 KDD 上获得了“测试之选”奖。在这里我们不会深入学习 DBSCAN 算法本身，而只讨论其在 Python 中的实现。但要理解 DBSCAN 算法的实现，我们至少需要对其有一个基本概念。因此，如果您不知道 DBSCAN 算法是什么或它是如何工作的，建议您先学习 DBSCAN 算法及其工作原理。

Python 中 DBSCAN 算法的实现

在本节中，我们将进行 DBSCAN 算法的实现操作，并分步骤进行，以便于理解和学习。我们将使用一个数据集在此实现过程中执行各种操作（包括 DBSCAN 算法中的操作）。在开始实现过程之前，我们应该满足实现 Python 程序中 DBSCAN 算法的先决条件。

实现 DBSCAN 算法的先决条件

在本节中，在继续进行 DBSCAN 算法的实现部分之前，我们需要满足以下先决条件：

1. Numpy 库：我们应该确保我们的系统已安装最新版本的 numpy 库，因为我们将使用 numpy 库的函数来处理我们在实现过程中使用的数据集。如果我们的系统没有 numpy 库或之前没有安装，我们可以在设备上的命令提示符终端中使用以下命令进行安装：

当我们按下回车键时，numpy 库将开始在我们的系统中安装。

过一段时间后，我们会看到 numpy 库已成功安装在我们的系统中（这里，我们的系统中已经存在 numpy 库）。

2. Panda 库：与 numpy 库一样，panda 库也是我们系统必需的库，如果我们的系统没有，我们可以使用以下命令在命令提示符终端中使用 pip 安装程序进行安装：

3. Matplotlib 库：在 DBSCAN 算法的实现过程中，它也是一个重要的库，因为这个库的函数将帮助我们显示数据集的结果。如果我们的系统没有 matplotlib 库，我们可以在命令提示符终端中使用以下命令使用 pip 安装程序进行安装：

4. Sklearn 库：在执行 DBSCAN 算法的实现操作时，Sklearn 库将是主要的必需品之一，因为我们需要从 Sklearn 库本身导入各种模块到程序中，例如预处理、分解等。因此，我们应该确保我们的系统存在 Sklearn 库，或者如果我们的系统没有，我们可以使用以下命令在命令提示符终端中使用 pip 安装程序进行安装：

5. 最后但同样重要的是，我们应该了解 DBSCAN 算法（它是什么以及它是如何工作的），如我们之前所讨论的，以便我们能够轻松理解它在 Python 中的实现。

在我们继续之前，我们应该确保我们已经满足了我们上面列出的所有先决条件，这样我们在遵循实现步骤时就不会遇到任何问题。

DBSCAN 算法的实现步骤

现在，我们将在 Python 中实现 DBSCAN 算法。但如前所述，我们将分步进行，这样实现部分就不会过于复杂，我们可以很容易地理解它。为了在 Python 程序中实现 DBSCAN 算法及其逻辑，我们需要遵循以下步骤：

步骤 1：导入所有必需的库

首先，也是最重要的，我们需要导入我们在先决条件部分安装的所有必需库，以便在实现 DBSCAN 算法时可以使用它们的函数。

在这里，我们首先在程序中导入了所有必需的库或库模块。

# Importing numpy library as nmp
import numpy as nmp
# Importing pandas library as pds
import pandas as pds
# Importing matplotlib library as pplt
import matplotlib.pyplot as pplt
# Importing DBSCAN from cluster module of Sklearn library
from sklearn.cluster import DBSCAN
# Importing StandardSclaer and normalize from preprocessing module of Sklearn library
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import normalize
# Importing PCA from decomposition module of Sklearn
from sklearn.decomposition import PCA

步骤 2：加载数据

在此步骤中，我们需要加载数据，我们可以通过导入或加载（DBSCAN 算法需要处理的）数据集来完成。要将数据集加载到程序中，我们将使用 **read.csv()** 函数的 panda 库，并打印数据集中的信息，如下所示：

# Loading the data inside an initialized variable
M = pds.read_csv('sampleDataset.csv') # Path of dataset file
# Dropping the CUST_ID column from the dataset with drop() function
M = M.drop('CUST_ID', axis = 1)
# Using fillna() function to handle missing values
M.fillna(method ='ffill', inplace = True)
# Printing dataset head in output
print(M.head())

输出

       BALANCE  BALANCE_FREQUENCY  ...  PRC_FULL_PAYMENT  TENURE
0    40.900749           0.818182  ...          0.000000      12
1  3202.467416           0.909091  ...          0.222222      12
2  2495.148862           1.000000  ...          0.000000      12
3  1666.670542           0.636364  ...          0.000000      12
4   817.714335           1.000000  ...          0.000000      12

[5 rows x 17 columns]

当我们运行程序时，上述输出中的数据将被打印出来，我们将处理从加载的数据集中获取的数据。

步骤 3：数据预处理

现在，在此步骤中，我们将使用 Sklearn 库的预处理模块中的函数开始对数据集进行数据预处理。在用 Sklearn 库函数进行数据预处理时，我们需要使用以下技术：

# Initializing a variable with the StandardSclaer() function
scalerFD = StandardScaler()
# Transforming the data of dataset with Scaler
M_scaled = scalerFD.fit_transform(M)
# To make sure that data will follow gaussian distribution
# We will normalize the scaled data with normalize() function
M_normalized = normalize(M_scaled)
# Now we will convert numpy arrays in the dataset into dataframes of panda
M_normalized = pds.DataFrame(M_normalized)

步骤 4：降低数据维度

在此步骤中，我们将降低缩放和标准化数据的维度，以便在程序中轻松可视化数据。我们需要按照以下方式使用 PCA 函数来转换数据并降低其维度：

# Initializing a variable with the PCA() function
pcaFD = PCA(n_components = 2) # components of data
# Transforming the normalized data with PCA
M_principal = pcaFD.fit_transform(M_normalized)
# Making dataframes from the transformed data
M_principal = pds.DataFrame(M_principal)
# Creating two columns in the transformed data
M_principal.columns = ['C1', 'C2']
# Printing the head of the transformed data
print(M_principal.head())

输出

         C1        C2
0 -0.489949 -0.679976
1 -0.519099  0.544828
2  0.330633  0.268877
3 -0.481656 -0.097610
4 -0.563512 -0.482506

正如我们在输出中看到的，我们使用 PCA 将标准化数据转换成了两个分量，即两列（我们可以在输出中看到它们）。之后，我们使用 panda 库的 **dataframe()** 函数从转换后的数据创建了数据框。

步骤 5：构建聚类模型

现在，这是实现中最重要的一步，因为在这里我们需要构建数据的聚类模型（我们正在对其进行操作），我们可以使用 Sklearn 库的 DBSCAN 函数来实现，如下所示：

# Creating clustering model of the data using the DBSCAN function and providing parameters in it
db_default = DBSCAN(eps = 0.0375, min_samples = 3).fit(M_principal)
# Labelling the clusters we have created in the dataset
labeling = db_default.labels_

步骤 6：可视化聚类模型

# Visualization of clustering model by giving different colours
colours = {}
# First colour in visualization is green
colours[0] = 'g'
# Second colour in visualization is black
colours[1] = 'k'
# Third colour in visualization is red
colours[2] = 'r'
# Last colour in visualization is blue
colours[-1] = 'b'
# Creating a colour vector for each data point in the dataset cluster
cvec = [colours[label] for label in labeling]
# Construction of the legend
# Scattering of green colour
g = pplt.scatter(M_principal['C1'], M_principal['C2'], color ='g');
# Scattering of black colour
k = pplt.scatter(M_principal['C1'], M_principal['C2'], color ='k');
# Scattering of red colour
r = pplt.scatter(M_principal['C1'], M_principal['C2'], color ='r');
# Scattering of green colour
b = pplt.scatter(M_principal['C1'], M_principal['C2'], color ='b'); 
# Plotting C1 column on the X-Axis and C2 on the Y-Axis
# Fitting the size of the figure with figure function
pplt.figure(figsize =(9, 9))
# Scattering the data points in the Visualization graph
pplt.scatter(M_principal['C1'], M_principal['C2'], c = cvec)
# Building the legend with the coloured data points and labelled
pplt.legend((g, k, r, b), ('Label M.0', 'Label M.1', 'Label M.2', 'Label M.-1'))
# Showing Visualization in the output
pplt.show()

输出

正如我们在输出中看到的，我们使用数据集的数据点绘制了图形，并通过用不同颜色标记数据点来可视化聚类。

步骤 7：调整参数

在此步骤中，我们将通过更改我们在 DBSCAN 函数中之前给出的参数来调整模块的参数，如下所示：

# Tuning the parameters of the model inside the DBSCAN function
dts = DBSCAN(eps = 0.0375, min_samples = 50).fit(M_principal)
# Labelling the clusters of data points
labeling = dts.labels_

步骤 8：可视化更改

现在，在调整了我们创建的聚类模型的参数后，我们将通过用不同颜色标记数据集中的数据点来可视化聚类中出现的更改，就像我们之前所做的那样。

# Labelling with different colours
colours1 = {}
# labelling with Red colour
colours1[0] = 'r'
# labelling with Green colour
colours1[1] = 'g'
# labelling with Blue colour
colours1[2] = 'b'
colours1[3] = 'c'
# labelling with Yellow colour
colours1[4] = 'y'
# Magenta colour
colours1[5] = 'm'
# labelling with Black colour
colours1[-1] = 'k'
# Labelling the data points with the colour variable we have defined
cvec = [colours1[label] for label in labeling]
# Defining all colour that we will use
colors = ['r', 'g', 'b', 'c', 'y', 'm', 'k' ]
# Scattering the colours onto the data points
r = pplt.scatter(
        M_principal['C1'], M_principal['C2'], marker ='o', color = colors[0])
g = pplt.scatter(
        M_principal['C1'], M_principal['C2'], marker ='o', color = colors[1])
b = pplt.scatter(
        M_principal['C1'], M_principal['C2'], marker ='o', color = colors[2])
c = pplt.scatter(
        M_principal['C1'], M_principal['C2'], marker ='o', color = colors[3])
y = pplt.scatter(
        M_principal['C1'], M_principal['C2'], marker ='o', color = colors[4])
m = pplt.scatter(
        M_principal['C1'], M_principal['C2'], marker ='o', color = colors[5])
k = pplt.scatter(
        M_principal['C1'], M_principal['C2'], marker ='o', color = colors[6])
# Fitting the size of the figure with figure function
pplt.figure(figsize =(9, 9))
# Scattering column 1 into X-axis and column 2 into y-axis
pplt.scatter(M_principal['C1'], M_principal['C2'], c = cvec)
# Constructing a legend with the colours we have defined
pplt.legend((r, g, b, c, y, m, k),
           ('Label M.0', 'Label M.1', 'Label M.2', 'Label M.3', 'Label M.4','Label M.5', 'Label M.-1'), # Using different labels for data points
           scatterpoints = 1, # Defining the scatter point
           loc ='upper left', # Location of cluster scattering
           ncol = 3, # Number of columns
           fontsize = 10) # Size of the font
# Displaying the visualisation of changes in cluster scattering
pplt.show()

输出

通过查看输出，我们可以清楚地观察到通过调整 DBSCAN 函数的参数而发生的聚类散点图的变化。当我们观察到这些变化时，我们还可以了解 DBSCAN 算法是如何工作的，以及它如何有助于可视化数据集中存在的聚类散点图。

下一主题如何编写打印 Python 异常/错误层次结构的代码

Python 中的 DBSCAN 算法

Python 中 DBSCAN 算法的实现

实现 DBSCAN 算法的先决条件

DBSCAN 算法的实现步骤

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

Python 问题

Python 中的 DBSCAN 算法

Python 中 DBSCAN 算法的实现

实现 DBSCAN 算法的先决条件

DBSCAN 算法的实现步骤

相关帖子

Python 中的 Barrier 对象

Python counter add

使用 Python 进行名片阅读器

Python 中的石头剪刀布游戏

Python 中的三元运算符

Python 中的 os.walk()

天线与波传播

如何检查 Python 中的数据类型

使用 Python 的所有后缀 Trie 进行模式搜索

Python 中合并两个字典

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器