PDFBox 读取文本

17 Mar 2025 | 阅读 2 分钟

PDFBox 库的主要功能之一是能够快速准确地从现有 PDF 文档中提取文本。在本节中，我们将学习如何使用 Java 程序在 PDFBox 库中从现有文档中读取文本。PDF 文档可能包含文本、动画和图像等作为其文本内容。我们可以使用 PDFTextStripper 类的 getText() 方法从现有 PDF 文档中提取文本。

按照以下步骤从现有 PDF 文档中读取文本-

加载 PDF 文档

我们可以使用静态 load() 方法加载现有的 PDF 文档。此方法接受一个文件对象作为参数。我们也可以使用 PDFBox 的 PDDocument 类名来调用它。

File file = new File("Path of Document"); 
PDDocument doc = PDDocument.load(file); 

实例化 PDFTextStripper 类

PDFTextStripper 类用于从 PDF 文档中检索文本。我们可以按如下方式实例化此类-

检索文本

getText() 方法用于从 PDF 文档中读取文本内容。在此方法中，我们需要将文档对象作为参数传递。此方法将文本作为字符串对象返回。

关闭文档

完成任务后，我们需要使用 close() 方法关闭 PDDocument 类对象。

示例-

这是一个 PDF 文档，我们将使用 Java 程序的 PDFBox 库从中提取文本内容。

Java 程序-

import java.io.File;
import java.io.IOException;

import java.io.File;
import java.io.IOException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

public class ExtractText {
	
	public static void main(String[] args)throws IOException {
		
		//Loading an existing document
	      File file = new File("/eclipse-workspace/blank.pdf");
	      PDDocument doc = PDDocument.load(file);
	
	//Instantiate PDFTextStripper class
	      PDFTextStripper pdfStripper = new PDFTextStripper();

	//Retrieving text from PDF document
	      String text = pdfStripper.getText(doc);
	      System.out.println("Text in PDF\n---------------------------------");
	      System.out.println(text);

	//Closing the document
	doc.close();
	}
}

输出

成功执行后，上面的程序会从 PDF 文档中检索文本，如以下输出所示。

下一个主题PDFBox 提取电话号码

PDFBox 读取文本

加载 PDF 文档

实例化 PDFTextStripper 类

检索文本

关闭文档

示例-

Java 程序-

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

PDFBox 教程

PDFBox 读取文本

加载 PDF 文档

实例化 PDFTextStripper 类

检索文本

关闭文档

示例-

Java 程序-

相关帖子

PDFBox 环境设置

PDFBox 插入图片

添加多行

加密 PDF 文档

使用附件

PDFBox 验证

获取位置和图片大小

加载现有文档

提取电话号码

PDFBox 添加页面

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器