PDFBox 提取电话号码

2025年3月17日 | 阅读 3 分钟

PDFBox 库具有多种功能。它能够快速准确地从现有 PDF 文档中提取电话联系人。在本节中，我们将学习如何使用 Java 程序从 PDFBox 库中的现有文档中读取电话号码。PDF 文档还可能包含文本、动画和图像等内容。

请按照以下步骤从现有 PDF 文档中提取电话号码：

加载 PDF 文档

我们可以使用静态 load() 方法加载现有 PDF 文档。此方法接受一个文件对象作为参数。我们也可以使用 PDFBox 的类名 PDDocument 来调用它。

File file = new File("Path of Document"); 
PDDocument doc = PDDocument.load(file); 

实例化 StringBuilder 和 PDFTextStripper 类

StringBuilder 和 PDFTextStripper 类用于从 PDF 文档中检索文本。我们可以像这样实例化这些类：

StringBuilder sb = new StringBuilder();			
PDFTextStripper stripper = new PDFTextStripper();

设置电话号码的模式

Pattern 指的是我们正在寻找的电话号码的格式。在我们的示例中，我们正在寻找带有 10 位数字且至少两端都围绕着一个空格的电话号码。可以从以下位置设置模式：

检索电话号码

我们可以使用 Matcher 检索电话号码，Matcher 指的是将找到模式的实际文本。如果找到电话号码，请使用 group() 方法打印电话号码，该方法指的是遵循我们指定的模式的下一个数字。

Matcher m = p.matcher(sb);
while (m.find()){
	    System.out.println(m.group());			
	 }

关闭文档

完成任务后，我们需要使用 close() 方法关闭 PDDocument class object 。

示例-

这是一个包含文本和电话号码的 PDF 文档。从这个 PDF 中，我们只想提取电话号码。在这里，我们假设电话号码的长度为 10 位数字。我们可以使用 Java 程序的 PDFBox 库来做到这一点。

Java 程序

import java.io.*;
import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.text.PDFTextStripper;
import java.util.regex.*;

public class ExtractPhone {
		
		public static void main(String[] args)throws IOException {
					
		// PDF file from the phone numbers are extracted
	         File fileName = new File("/eclipse-workspace/phone.pdf");
		   PDDocument doc = PDDocument.load(fileName);

	// StringBuilder to store the extracted text
		   StringBuilder sb = new StringBuilder();			
		   PDFTextStripper stripper = new PDFTextStripper();

	// Add text to the StringBuilder from the PDF
	sb.append(stripper.getText(doc));

	// Regex-> The Pattern refers to the format you are looking for. In our example,we are looking for 
	//numbers with 10 digits with atleast one surrounding white spaces on both ends.
	       Pattern p = Pattern.compile("\\s\\d\\d\\d\\d\\d\\d\\d\\d\\d\\d\\s");

	// Matcher refers to the actual text where the pattern will be found
	       Matcher m = p.matcher(sb);
	while (m.find()){
	//group() method refers to the next number that follows the pattern we have specified.
			   System.out.println(m.group());			
			   }

			if (doc != null) {
			doc.close();
			   }
			   System.out.println("\nPhone Number is extracted");
		}
}

输出

成功执行上述程序后，我们可以看到以下输出。

下一个主题PDFBox 处理元数据

PDFBox 提取电话号码

加载 PDF 文档

实例化 StringBuilder 和 PDFTextStripper 类

设置电话号码的模式

检索电话号码

关闭文档

示例-

Java 程序

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

PDFBox 教程

PDFBox 提取电话号码

加载 PDF 文档

实例化 StringBuilder 和 PDFTextStripper 类

设置电话号码的模式

检索电话号码

关闭文档

示例-

Java 程序

相关帖子

拆分 PDF 文档

合并 PDF 文档

添加多行

获取位置和图片大小

使用字体

加载现有文档

PDFBox 验证

PDFBox 教程

使用元数据

使用附件

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器