如何在 Python 中从文件夹读取多个文本文件？

2025年1月5日 | 阅读 4 分钟

引言

Python 是一种通用且强大的编程语言，提供了大量用于各种任务的工具和库。在数据处理和分析中，一个常见场景是需要从文件夹中读取多个文本文件。无论您是在进行自然语言处理、数据清理还是任何其他以数据为中心的任务，高效地处理一组文本文件都至关重要。在本文中，我们将探讨 Python 中从文件夹读取多个文本文件的不同方法和技术。

理解基础知识

在深入研究代码之前，了解读取多个文本文件所涉及的基本组件非常重要。Python 的内置库，如 os 和 glob，以及 open() 函数，在完成此任务中起着重要作用。

os 模块提供了使用依赖于操作系统的功能的方法，包括文件和目录操作。另一方面，glob 模块有助于根据 Unix shell 使用的规则搜索与指定模式匹配的文件。

import os
import glob

使用 os.listdir()

os.listdir() 方法返回一个列表，其中包含指定目录中条目的名称。我们可以使用此方法获取文件夹中文件名的列表，然后遍历该列表以读取每个文件。

folder_path = '/path/to/folder'
files = os.listdir(folder_path)

for file_name in files:
    if file_name.endswith('.txt'):
        file_path = os.path.join(folder_path, file_name)
        with open(file_path, 'r') as file:
            content = file.read()
            # Process the content as needed

此方法对于小型任务来说是直接有效的。但是，由于需要一次性将所有文件名加载到内存中，因此对于大型数据集来说，它可能不是最高效的解决方案。

使用 glob.glob()

glob.glob() 函数简化了获取与指定模式匹配的文件列表的过程。在处理特定类型的文件（如文本文件）时，它特别有用。

folder_path = '/path/to/folder'
files = glob.glob(os.path.join(folder_path, '*.txt'))

for file_path in files:
    with open(file_path, 'r') as file:
        content = file.read()
        # Process the content as needed

通过将 *.txt 模式与 glob.glob() 结合使用，我们可以确保列表中仅包含文本文件，从而使其成为一种更具选择性和更有效的方法。

使用 os.scandir() 以提高性能

os.scandir() 函数自 Python 3.5 起可用，它提供了一种比 os.listdir() 更有效的方式来遍历目录的内容。它返回一个 os.DirEntry 对象迭代器，每个对象代表目录中的一个条目。

folder_path = '/path/to/folder'

with os.scandir(folder_path) as entries:
    for entry in entries:
        if entry.is_file() and entry.name.endswith('.txt'):
            file_path = entry.path
            with open(file_path, 'r') as file:
                content = file.read()
                # Process the content as needed

对于大型数据集，推荐使用此方法，因为它更节省内存且性能更好。

使用 concurrent.futures 实现并行处理

随着数据集的不断增大，顺序处理文件可能会成为瓶颈。Python 中的 concurrent.futures 模块为异步执行可调用对象提供了一个高级接口。可以利用此功能同时读取和处理多个文件，从而提高整体效率。

import concurrent.futures

folder_path = '/path/to/folder'
files = glob.glob(os.path.join(folder_path, '*.txt'))

def process_file(file_path):
    with open(file_path, 'r') as file:
        content = file.read()
        # Process the content as needed

with concurrent.futures.ThreadPoolExecutor() as executor:
    executor.map(process_file, files)

使用 ThreadPoolExecutor 可以跨文件并行执行 process_file 函数。请注意，由于全局解释器锁（GIL），在 Python 中使用多线程时要小心，这可能会限制 CPU 密集型任务的性能提升。但是，对于文件读取等 I/O 密集型任务，多线程可以非常有效。

处理异常和错误

在处理文件 I/O 操作时，妥善处理异常至关重要。FileNotFoundError 和 PermissionError 是在文件处理过程中可能遇到的常见问题。

for file_path in files:
    try:
        with open(file_path, 'r') as file:
            content = file.read()
            # Process the content as needed
    except FileNotFoundError:
        print(f"File not found: {file_path}")
    except PermissionError:
        print(f"Permission error for file: {file_path}")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")

输出

File not found: /path/to/folder/nonexistent_file.txt
Permission error for file: /path/to/folder/protected_file.txt
An unexpected error occurred: [specific error message]

这确保了在出现意外问题时您的程序不会崩溃，并为调试提供了有意义的错误消息。

结论

在 Python 中从文件夹读取多个文本文件是一项常见任务，有多种方法可以实现。根据数据集的大小和特定需求，您可以选择最适合您需求的方法。os 和 glob 模块以及适当的异常处理功能，使您能够高效地处理和分析大量文本文件，使 Python 成为以数据为中心任务的强大选择。

下一个主题在 Python 中合并两个平衡二叉搜索树

如何在 Python 中从文件夹读取多个文本文件？

引言

理解基础知识

使用 os.listdir()

使用 glob.glob()

使用 os.scandir() 以提高性能

使用 concurrent.futures 实现并行处理

处理异常和错误

结论

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

其他

如何在 Python 中从文件夹读取多个文本文件？

引言

理解基础知识

使用 os.listdir()

使用 glob.glob()

使用 os.scandir() 以提高性能

使用 concurrent.futures 实现并行处理

处理异常和错误

结论

相关帖子

在Python中从给定字符串中提取单词

在Python Pandas中处理缺失数据

Scikit Learn中流水线入门指南

如何使用Python构建视频播放器

如何在Python中对反斜杠转义字符串进行反转义

Python中的os.symlink()方法

将Python应用程序转换为APK

Python中的NumPy Newaxis

Python String removeprefix() 方法

编写Python代码的最佳实践

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器