Java Web 爬虫

2025 年 8 月 5 日 | 阅读 7 分钟

网络爬虫基本上是一种程序，主要用于在网上导航并查找新页面或更新的页面进行索引。爬虫从各种种子网站或热门 URL 开始，并进行深度和广度的搜索以提取超链接。

网络爬虫应该是友好的并且健壮的。这里的友好意味着它遵守 robots.txt 设置的规则，并避免频繁访问网站。健壮意味着能够避开蜘蛛网和其他恶意行为。

以下是创建网络爬虫的步骤

第一步，我们首先从队列中选择一个 URL。
获取该 URL 的 HTML 代码。
通过解析 HTML 代码获取指向其他 URL 的链接。
检查该 URL 是否已经被爬取过。我们还检查是否之前看到过相同的内容。如果这两个条件都不满足，我们将其添加到索引中。
对于每个提取的 URL，验证它们是否同意被检查（robots.txt、爬行频率）

我们使用 jsoup，即 Java HTML 解析库，通过将以下依赖项添加到我们的 POM.xml 文件中。

<dependency> 
            <groupId>org.jsoup</groupId> 
            <artifactId>jsoup</artifactId> 
            <version>1.10.2</version> 
</dependency> 

让我们从网络爬虫的基本代码开始，了解它是如何工作的

WebCrawlerExample.java

// import required classes and packages
packagejavaTpoint.javacodes;
// import classes available in jsoup
importorg.jsoup.Jsoup; 
importorg.jsoup.nodes.Document; 
importorg.jsoup.nodes.Element; 
importorg.jsoup.select.Elements; 
// import exception and collection classes  
importjava.io.IOException; 
importjava.util.HashSet; 
// create WebCrawlerExample to understand the working of it and how we can implement it in Java
publicclassWebCrawlerExample { 
	// create set that will store links
privateHashSet<String>urlLink; 
    // initialize set using constructor
publicWebCrawlerExample() { 
	urlLink = newHashSet<String>(); 
    } 
    // create getPageLink() method that finds all the page link in the given URL
publicvoidgetPageLinks(String URL) {
	
        // we use the conditional statement to check whether we have already crawled the URL or not.
if (!urlLink.contains(URL)) { 
try { 
                // if the URL is not present in the set, we add it to the set
if (urlLink.add(URL)) { 
System.out.println(URL); 
                } 
                // fetch the HTML code of the given URL by using the connect() and get() method and store the result in Document
                Document doc = Jsoup.connect(URL).get();
                // we use the select() method to parse the HTML code for extracting links of other URLs and store them into Elements  
                Elements availableLinksOnPage = doc.select("a[href]"); 
                // for each extracted URL, we repeat process 
for (Element ele : availableLinksOnPage) { 
	// call getPageLinks() method and pass the extracted URL to it as an argument
getPageLinks(ele.attr("abs:href")); 
                } 
            } 
            // handle exception
catch (IOException e) { 
	// print exception messages
System.err.println("For '" + URL + "': " + e.getMessage()); 
            } 
        } 
    } 
    // main() method start
publicstaticvoid main(String[] args) { 
	WebCrawlerExampleobj = newWebCrawlerExample();
	
	// pick a URL from the frontier and call the getPageLinks()method
	obj.getPageLinks("https://tpointtech.cn/digital-electronics"); 
    } 
} 

输出

让我们对上述代码进行一些修改，设置链接的提取深度。前一个代码和当前代码的唯一区别在于它会爬取 URL 直到指定的深度。getPageLink() 方法接受一个整数参数，表示链接的深度。

WebCrawlerExampleWithDepth.java

// import required classes and packages
packagejavaTpoint.javacodes;
//import classes available in jsoup
importorg.jsoup.Jsoup; 
importorg.jsoup.nodes.Document; 
importorg.jsoup.nodes.Element; 
importorg.jsoup.select.Elements; 

//import exception and collection classes  
importjava.io.IOException; 
importjava.util.HashSet; 
publicclassWebCrawlerExampleWithDepth { 
	// initialize MAX_DEPTH variable with final value
privatestaticfinalintMAX_DEPTH = 2;
    // create set that will store links
privateHashSet<String>urlLinks; 
    // initialize set using constructor
publicWebCrawlerExampleWithDepth() { 
	urlLinks = newHashSet<>(); 
    } 
    // create method that finds all the page link in the given URL 
publicvoidgetPageLinks(String URL, int depth) { 
	
	//we use the conditional statement to check whether we have already crawled the URL or not.
	// we also check whether the depth reaches to MAX_DEPTH or not
if ((!urlLinks.contains(URL) && (depth <MAX_DEPTH))) { 
System.out.println(">> Depth: " + depth + " [" + URL + "]"); 
            // use try catch block for recursive process
try { 
	// if the URL is not present in the set, we add it to the set
	urlLinks.add(URL); 
	// fetch the HTML code of the given URL by using the connect() and get() method and store the result in Document
                Document doc = Jsoup.connect(URL).get(); 

                // we use the select() method to parse the HTML code for extracting links of other URLs and store them into Elements
                Elements availableLinksOnPage = doc.select("a[href]"); 
                // increase depth
depth++; 
                // for each extracted URL, we repeat above process
for (Element page : availableLinksOnPage) { 
	
	// call getPageLinks() method and pass the extracted URL to it as an argument
getPageLinks(page.attr("abs:href"), depth); 
                } 
            } 
            // handle exception
catch (IOException e) { 
	// print exception messages
	System.err.println("For '" + URL + "': " + e.getMessage()); 
            } 
        } 
    } 
    // main() method start
publicstaticvoid main(String[] args) { 
	// create instance of the WebCrawlerExampleWithDepth class
	WebCrawlerExampleWithDepthobj = newWebCrawlerExampleWithDepth ();
	
	// pick a URL from the frontier and call the getPageLinks()method and pass 0 as starting depth
	obj.getPageLinks("https://tpointtech.cn/digital-electronics/", 0); 
    } 
} 

输出

数据爬取与数据抓取之间的区别

数据爬取和数据抓取都是数据处理的两个重要概念。

数据爬取是指处理大型数据集，我们开发自己的爬虫，它可以爬取到最深的网络页面。

数据抓取是指从任何来源检索数据/信息。

数据抓取	数据爬取
数据抓取不仅可以从网络，还可以从任何来源提取数据。	数据爬取仅从网络提取数据。
在数据抓取中，重复不一定是必需的。	在数据爬取中，重复是重要的一部分。
它可以以任何规模完成，即小型或大型。	它主要以大规模完成。
它需要爬取解析器和代理。	它只需要一个爬取代理。

让我们再举一个例子，使用 Java 网络爬虫来爬取文章。

ExtractArticlesExample.java

// import required classes and packages
packagejavaTpoint.javacodes;
// //import classes available in jsoup
importorg.jsoup.Jsoup;
importorg.jsoup.nodes.Document;
importorg.jsoup.nodes.Element;
importorg.jsoup.select.Elements;
//import exception, FileWriter and collection classes  
importjava.io.FileWriter;
importjava.io.IOException;
importjava.util.ArrayList;
importjava.util.HashSet;
importjava.util.Iterator;
importjava.util.List;
// create ExtractArticlesExample to understand how we can extract articles
public class ExtractArticlesExample {
	// initialize MAX_DEPTH variable with final value
private static final int MAX_DEPTH = 2;
	
	// create set and nested list for storing links and articles 
privateHashSet<String>urlLinks;
private List<List<String>> articles;
    // initialize set and list
publicExtractArticlesExample() {
	urlLinks = new HashSet<>();
articles = new ArrayList<>();
    }
    //get all URLs that start with "https://tpointtech.cn/" and add them to the set
public void getPageLinks(String URL, int depth) {
	
	//we use the conditional statement to check whether we have already crawled the URL or not.
	// we also check whether the depth reaches to MAX_DEPTH or not
if (urlLinks.size() != 50 && !urlLinks.contains(URL) && (depth < MAX_DEPTH) && (URL.startsWith("https://tpointtech.cn") || URL.startsWith("https://tpointtech.cn"))){ 
System.out.println(">> Depth: " + depth + " [" + URL + "]"); 

            // use try catch block for recursive process
try { 
	// if the URL is not present in the set, we add it to the set
	urlLinks.add(URL); 
	
	// fetch the HTML code of the given URL by using the connect() and get() method and store the result in Document
                Document doc = Jsoup.connect(URL).get(); 

                // we use the select() method to parse the HTML code for extracting links of other URLs and store them into Elements
                Elements availableLinksOnPage = doc.select("a[href]"); 

                // increase depth
depth++; 

                // for each extracted URL, we repeat above process
for (Element ele : availableLinksOnPage) { 
	if(ele.attr("abs:href").startsWith("https://tpointtech.cn") || ele.attr("abs:href").startsWith("https://tpointtech.cn")) {
		// call getPageLinks() method and pass the extracted URL to it as an argument
getPageLinks(ele.attr("abs:href"), depth);
	}
                } 
            } 
            // handle exception
catch (IOException e) { 
	// print exception messages
	System.err.println("For '" + URL + "': " + e.getMessage()); 
            } 
        } 
    }

    //Connect to each link saved in the article and find all the articles in the page
public void getArticles() {
	Iterator<String>i = urlLinks.iterator();
while (i.hasNext()) {
	
	
		// create variable doc that store document data
		Document doc;
		
		// we put the recursive code in a try-catch block
try {
	
doc = Jsoup.connect(i.next()).get();
                Elements availableArticleLinks = doc.select("a[href]");


for (Element ele : availableArticleLinks) {
	
                    //we get only those article's  title which contain java 8
                    // use matches() and regx method to check whether text contains Java 8 or not
	if (ele.text().contains("python")) {
		System.out.println(ele.text());
		// create temp list that stores articles
		ArrayList<String> temp = new ArrayList<>();
		temp.add(ele.text()); //get title of the article
		temp.add(ele.attr("abs:href")); //get the URL of the article
                        // add article list in the nested article list
		articles.add(temp);
                    }
                }
            } 
            // handle exception 
catch (IOException e) {
	// show error message 
System.err.println(e.getMessage());
            }
	}
    }

    // create writeToFile() method to write data into file
public void writeToFile(String fName) {
	// declare variable of type FileWriter
FileWriterwr;

        //use try-catch block to write data into file
try {
	// initialize FileWriter for fName
wr = new FileWriter(fName);

for(inti = 0; i<articles.size(); i++) {
	
	try {
                    String article = "- Title: " + articles.get(i).get(0) + " (link: " + articles.get(i).get(1) + ")\n";

                    // show the article and save it to the specified file
System.out.println(article);
wr.write(article);

	}catch (IOException e) {
System.err.println(e.getMessage());
                }
            }
            // close FileWriter class object
wr.close();
        } catch (IOException e) {
System.err.println(e.getMessage());
        }
    }
    // main() method start
public static void main(String[] args) {
	// create instance of the ExtractArticlesExample class
	ExtractArticlesExampleobj = new ExtractArticlesExample();
	
	// call getPageLinks() method to get all the page links of the specified URL
	obj.getPageLinks("https://tpointtech.cn", 0);

        // call getArticles() method to find all the articles
	obj.getArticles();

        // call writeToFile() method to write all the articles in the specified file
	obj.writeToFile("Web Crawler Example");
    }
}

输出

下一主题Thread-safe-collections-java

Java Web 爬虫

数据爬取与数据抓取之间的区别

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

Java Conversion

Java Misc

Java Web 爬虫

数据爬取与数据抓取之间的区别

相关帖子

Array Slicing in Java

Minimum Swaps Required to Convert a Binary Tree to a Binary Search Tree (BST) in Java

Java Program to Find Distance of Nearest Cell Having 1 in a Binary Matrix

Java 中的矩阵对角线求和

Java 全栈开发

Pair Sum Closest to 0 Problem in Java

Java 中的接口属性

在 Java 中实现通用图

Java 中的标记-清除垃圾回收算法

Java 中查找包含 K 个元音的 longest substring

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器