如何使用 Python 的 OCR（光学字符识别）读取 PDF 内容

2024年11月16日 | 阅读 5 分钟

Python 是当今最受欢迎的编程语言之一。我们可以用它来分析数据，但数据并非总是以所需的格式提供。在这种情况下，我们可以将文件格式从 pdf、jpg 转换为文本（.txt）格式，以便更好地分析数据。有许多库可用于执行此类任务。

我们可以使用 Python 的 PyPDF2 模块来执行将 .pdf 文件转换为文本格式的任务。使用此模块可能面临的主要缺点是编码方案。PDF 文档文件可以包含各种编码，如 Unicode、ASCII、UTF-8 等。因此，由于编码方案，将 PDF 文件转换为文本可能会导致数据丢失。

在本教程中，我们将学习如何通过使用“光学字符识别”方法读取 PDF 文件内容并将其存储为文本（.txt）格式。

首先，我们需要将 PDF 文档文件的页面转换为图像，然后，我们将使用 OCR 从图像中读取内容并将其存储到文本（.txt）格式的文件中。

所需模块

我们将使用给定的命令安装以下模块进行本教程的学习

PIL：-

pytesseract：-

pdf2image：-

tesseract-ocr：-

（为此，用户应拥有 Microsoft Visual C++ 14.0，可通过“Visual Studio 的生成工具”获取：https://visualstudio.microsoft.com/downloads/）

第一部分

第一部分将处理将我的 PDF 页面转换为图像文件。PDF 文件中的每个页面都将作为图像文件存储，图像的名称将存储为

PDF page no. 1: page_no_1.jpg
PDF page no. 2: page_no_2.jpg
PDF page no. 3: page_no_3.jpg
PDF page no. 4: page_no_4.jpg
.
.
PDF page no. n: page_no_n.jpg

第二部分

第二部分将处理识别图像文件中的文本，并将其排序到“.txt”格式的文本文件中。在这里，我们将处理图像文件以将其转换为文本内容。一旦我们有了作为字符串变量的文本，我们就可以开始处理文本（.txt）文件。例如，在许多 PFD 文件中，我们可以看到当一行完成时，最后一个单词无法完全写在同一行上，此时会添加一个连字符，然后将单词继续到下一行。例如

This is an example to show the above explanation of the wo-
rd which cannot be written entirely in the same line and is conti-
nued in the next line. 

对于这类单词，我们将进行基本预处理，将连字符和下一行转换为一个完整的单词。完成预处理后，此文本将排序到单独的文本文件中。

代码

from PIL import Image as img
import pytesseract as PT
import sys
from pdf2image import convert_from_path as CFP
import os
# Importing the pdf file
PDF_file_1 = "exp.pdf"
pages_1 = CFP(PDF_file1, 9)
  
# Now, we will create a counter for storing images of each page of PDF to image
image_counter1 = 1
  
# Iterating through all the pages of the pdf file stored above
for page in pages_1:
  
    # We will Declare the  filename for each page of PDF file as JPG file
    # For each page, the filename will be:
    # PDF page no. 1: Page_no_1.jpg
    # PDF page no.2: Page_no_2.jpg
    # PDF page no. 3: Page_no_3.jpg
    # PDF page no. 4: Page_no_4.jpg
    # .... and so on..
    # PDF page n: page_n.jpg
    filename1 = "Page_no_" + str(image_counter) + " .jpg"
      
    # Now, we will save the image of the page in system
    page.save(filename1, 'JPEG')
  
    # Then, we will increase the counter for updating filenames
    image_counter1 = image_counter1 + 1
  
'''
Part #2 - Recognize the text content from the image files by using OCR
'''
# Variable for getting the count of the total number of pages
filelimit1 = image_counter1 - 1
  
# then, we will create a text file for writing the output
out_file1 = "output_text.txt"
  
# Now, we will open the output file in append mode so that all contents of the # images will be added in the same output file.
f_1 = open(out_file1, "a")
  
# Iterating from 1 to total number of pages
for K in range(1, filelimit1 + 1):
  
    # Now, we will set filename for recognizing text from images
    # Again, these files will be:
    # Page_no_1.jpg
    # Page_no_2.jpg
    # Page_no_3.jpg
    # ....
    # page_no_n.jpg
    filename1 = "Page_no_" + str(K) + " .jpg"
          
    # Here, we will write a code for recognizing the text as a string variable in an image file by using the pytesserct module
    text1 = str(((PT.image_to_string (Image.open (filename1)))))
  
    # : The recognized text will be stored in variable text
    # : Any string variable processing may be applied to text content
    # : Here, basic formatting will be done:-
    
    text1 = text1.replace('-\n', '')    
  
    # At last, we will write the processed text into the file.
    f_1.write(text1)
  
# Closing the file after writing all the text content.
f_1.close()

输出

输入 PDF 文件

How to Read Contents of PDF using OCR in Python

输出文本文件

正如我们所见，PDF 文件中的页面已转换为图像。然后读取了这些图像，并将其内容写入文本文件。

使用 OCR 方法的优点

用户可以避免基于文本的转换，这些转换可能因编码方案导致数据丢失。
OCR 模块还可以识别 PDF 文件中的手写内容。
用户还可以通过使用 OCR 模块来修改仅识别 PDF 的特定页面。
由于文本以变量形式获取，因此可以进行大量预处理。

使用 OCR 方法的缺点

磁盘存储用于在本地系统上存储图像文件。但是，这些文件占用的空间非常小。
使用 OCR 并不能保证 100% 的准确性。而计算机生成的 PDF 文件文档可获得非常高的准确性。
OCR 模块可以识别手写内容，但准确性取决于许多因素，例如页面的颜色、手写体的清晰度等等。

结论

在本教程中，我们讨论了如何使用 Python 中的 OCR 读取 PDF 文件内容。

下一主题Python 中的语法和拼写检查器

如何使用 Python 的 OCR（光学字符识别）读取 PDF 内容

所需模块

第一部分

第二部分

代码

使用 OCR 方法的优点

使用 OCR 方法的缺点

结论

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

Python 问题

如何使用 Python 的 OCR（光学字符识别）读取 PDF 内容

所需模块

第一部分

第二部分

代码

使用 OCR 方法的优点

使用 OCR 方法的缺点

结论

相关帖子

Python rarfile 模块

Python 中的对象是什么

Python 程序查找两个字符串之间的差异

Patch.object python

Python 中不同排序技术的变体

Python Contextvars 模块

Python 中的自守数

使用 Tkinter 在 Python 中创建文件浏览器

使用 Python 通过 TCP Socket 进行文件传输

Python 算法

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器