在Python Pandas中处理缺失数据

2025年1月5日 | 阅读 3 分钟

缺失数据在实际数据集中很常见，有效处理缺失数据对于数据分析和机器学习任务至关重要。在 Python 中，Pandas 库提供了强大的工具来处理缺失数据，让您可以高效地清洗、操作和分析包含缺失值的数据集。

缺失数据简介

缺失数据可能由于各种原因而发生，例如数据录入错误、设备故障或故意遗漏。在 Pandas 中，缺失数据用 NaN（Not a Number）值表示，它表示某个特定值缺失或未定义。

在进行任何分析或建模之前，识别和妥善处理缺失数据非常重要。Pandas 提供了几种处理缺失数据的方法，包括检测缺失值、删除或替换它们，以及根据特定标准填充缺失值。

检测缺失数据

处理缺失数据的第一步是识别数据集中是否存在缺失数据。Pandas 提供了 `isnull()` 和 `notnull()` 方法来检测缺失值。这些方法返回一个布尔掩码，指示 DataFrame 或 Series 中的每个值是缺失还是未缺失。

import pandas as pd

# Sample DataFrame with missing values
data = {'A': [1, 2, None, 4, 5],
        'B': ['a', None, 'c', 'd', 'e']}
df = pd.DataFrame(data)

# Check for missing values
print(df.isnull())

输出

       A      B
0  False  False
1  False   True
2   True  False
3  False  False
4  False  False

df.isnull() 的输出将是一个 DataFrame，其中缺失数据的位置为 True，存在数据的位置为 False。

处理缺失数据

一旦检测到缺失数据，就有几种处理策略。一种常见的方法是使用 `dropna()` 方法删除包含缺失值的行或列。

# Remove rows with missing values
cleaned_df = df.dropna()
print(cleaned_df)

输出

     A  B
3  4.0  d
4  5.0  e

`dropna()` 方法默认会删除包含任何缺失值的行。您还可以指定 `axis` 参数来删除包含缺失值的列。

# Remove columns with missing values
cleaned_df = df.dropna(axis=1)
print(cleaned_df)

输出

Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]

另一种方法是使用 `fillna()` 方法将缺失值填充为特定值。例如，用列的均值填充缺失的数值型数据

# Fill missing numeric values with the mean
mean_fill_df = df.fillna(df.mean())
print(mean_fill_df)

输出

     A  B
0  1.0  a
1  2.0  c
2  3.0  c
3  4.0  d
4  5.0  e

用列中最常见的值填充缺失的分类数据

# Fill missing categorical values with the most frequent value
mode_fill_df = df.fillna(df.mode().iloc[0])
print(mode_fill_df)

输出

     A  B
0  1.0  a
1  2.0  a
2  2.0  c
3  4.0  d
4  5.0  e

插补缺失数据

在某些情况下，根据特定标准插补缺失值比简单地用特定值填充它们更合适。Pandas 提供了各种插补缺失数据的方法，例如使用列的均值、中位数或众数。

例如，用列的均值插补缺失的数值型数据

# Impute missing numeric values with the mean
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
imputed_data = imputer.fit_transform(df[['A']])
df['A'] = imputed_data
print(df)

输出

     A     B
0  1.0     a
1  2.0  None
2  3.0     c
3  4.0     d
4  5.0     e

缺失数据（在 Pandas 中通常表示为 NaN）会阻碍分析。使用 `isnull()` 或 `notnull()` 进行检测。通过使用 `dropna()` 删除行/列、使用 `fillna()` 填充缺失值或使用 `SimpleImputer()` 进行插补来处理。有效管理可确保分析和模型可靠。

用列中最常见的值插补缺失的分类数据

# Impute missing categorical values with the most frequent value
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='most_frequent')
imputed_data = imputer.fit_transform(df[['B']])
df['B'] = imputed_data
print(df)

输出

     A  B
0  1.0  a
1  2.0  a
2  3.0  c
3  4.0  d
4  5.0  e

结论

处理缺失数据是 Python Pandas 中数据清洗和预处理的重要组成部分。通过使用 Pandas 提供的 `isnull()`、`dropna()`、`fillna()` 等方法以及插补技术，您可以有效地管理数据集中的缺失数据，从而确保您的分析和机器学习模型基于可靠和完整的数据。

下一主题在 Python 中处理 zip 文件

← 上一个下一个 →

在Python Pandas中处理缺失数据

缺失数据简介

检测缺失数据

处理缺失数据

插补缺失数据

结论

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

其他

在Python Pandas中处理缺失数据

缺失数据简介

检测缺失数据

处理缺失数据

插补缺失数据

结论

相关帖子

比较Python中的字典

Python seaborn.displot()方法

在Python中将DateTime转换为UTC时间戳

在Python中迭代集合

Jython - 概述

如何在Python 3中使用ThreadPoolExecutor

如何在Python中进行加权随机选择

Python中的NumPy Polyfit

PowerShell vs Python

Python中的cmp()函数

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器