Node.js 中的 Crawler

2025年2月27日 | 阅读 5 分钟

网络爬虫是一个自动浏览互联网的程序，它会获取给定网站 URL 的链接和内容。它也被称为蜘蛛或机器人，在收集各种网站数据方面非常有用。它还用于索引网站以供搜索引擎使用、数据挖掘等。这些网络爬虫从访问一个种子 URL 开始，该 URL 作为爬取过程的起点。根据使用情况，它们可以很简单也可以很复杂。网络爬虫也可能用于恶意目的，例如从网站抓取数据或进行 DDoS 攻击。

网络爬虫的关键点

起点是种子 URL。
从网站获取内容。
解析内容。
提取链接并获取后续链接。
存储提取到的数据。

使用 Node.js 创建网络爬虫应用程序

应用程序的目录结构

步骤 1：首先，我们必须创建文件夹并命名。此处命名为'web_crawler'，然后进入该目录。

步骤 2：接下来，我们需要使用命令 'npm init' 创建一个 package.json 文件，并设置应用程序的作者和起始文件等配置。

步骤 3：之后，使用以下命令安装必要的包。

此处，axios 包用于向服务器发送请求并获取服务器的响应。它支持请求拦截、请求取消和自动 JSON 转换。
Body parser 用作 express 的中间件，它可以将请求解析为不同的格式。它主要用于处理 POST 请求数据。
Cheerio 用于实现核心 jQuery 功能，它专为服务器端操作 HTML 和 XML 文档而设计。
EJS 用于在 HTML 模板中嵌入 JavaScript 代码。
Express 是一个 Node.js Web 应用程序框架。

步骤 4：之后，在同一目录中创建两个文件并命名。一个用于服务器，另一个用于爬虫 JavaScript 文件。因此，它们都命名为 'server.js' 和 'crawler.js'。

步骤 5：最后，在 server.js 文件中写入以下代码。

 
const express = require('express');
const path = require('path');
const ejs = require('ejs');
const crawl = require('./crawler');
const app = express();
const port = 3000;
app.set('view engine', 'ejs');
app.set('views', path.join(__dirname, 'views'));
app.use(express.urlencoded({ extended: true }));
app.use(express.json());
app.use(express.static(path.join(__dirname, 'public')));
app.get('/', (req, res) => {
    res.sendFile(path.join(__dirname, 'views', 'index.html'));
});
app.post('/crawl', async (req, res) => {
    const url = req.body.url;
    const maxDepth = parseInt(req.body.maxDepth) || 1;
    const result = await crawl(url, 1, maxDepth);
    if (result) {
        res.render('results', {
            url: url,
            links: result.links,
            content: result.content
        });
    } else {
        res.send(`<p>Error crawling the URL. Please try again.</p><a href="/">Back</a>`);
    }
});
app.listen(port, () => {
    console.log(`Server is running on https://:${port}`);
});   

步骤 6：现在，在 crawler.js 中写入以下代码。

 
const axios = require('axios');
const cheerio = require('cheerio');
const { URL } = require('url');
async function crawl(baseUrl, depth = 1, maxDepth = 1, visited = new Set()) {
    if (depth > maxDepth || visited.has(baseUrl)) return { links: [], titles: [], content: '' };
    visited.add(baseUrl);
    try {
        const response = await axios.get(baseUrl);
        const html = response.data;
        const $ = cheerio.load(html);
        const data = {
            links: [],
            titles: [],
            content: $('body').text().trim(),
        };
        $('a').each((index, element) => {
            const href = $(element).attr('href');
            if (href) {
                try {
                    const absoluteUrl = new URL(href, baseUrl).href;
                    data.links.push({ link: absoluteUrl, title: $(element).text().trim() });
                } catch (error) {
                    console.error(`Invalid URL: ${href}`);
                }
            }
        });
        for (const link of data.links) {
            const nestedData = await crawl(link.link, depth + 1, maxDepth, visited);
            data.links.push(...nestedData.links);
        }
        return data;
    } catch (error) {
        console.error(`Error fetching ${baseUrl}:`, error);
        return { links: [], titles: [], content: '' };
    }
}
module.exports = crawl;   

步骤 7：之后，为 html 文件和 ejs 文件创建另一个名为 'views' 的文件夹。在该文件夹中，创建一个名为 'index.html' 的文件，用于创建一个表单，用户需要在其中输入网站的 URL 和最大深度。'results.ejs' 文件用于显示结果，其中包含一个表格，列出网站上的链接和数据。

步骤 8：现在在 'index.html' 文件中使用以下代码。

 
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Simple Web Crawler</title>
    <link rel="stylesheet" href="/styles.css">
</head>
<body>
    <div class="container">
        <h1>Simple Web Crawler</h1>
        <form method="POST" action="/crawl" class="form">
            <label for="url">Enter URL to Crawl:</label>
            <input type="text" id="url" name="url" placeholder="https://example.com" required>
            <label for="maxDepth">Max Depth:</label>
            <input type="number" id="maxDepth" name="maxDepth" min="1" value="1">
            <button type="submit">Crawl</button>
        </form>
    </div>
</body>
</html>   

步骤 9：之后，在 'results.ejs' 文件中写入以下代码。

 
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Crawler Results</title>
    <link rel="stylesheet" href="/styles.css">
</head>
<body>
    <h2>Crawled Data from <%= url %></h2>
    <h3>Following Links:</h3>
    <table>
        <thead>
            <tr>
                <th>Title</th>
                <th>Link</th>
            </tr>
        </thead>
        <tbody>
            <% links.forEach(link => { %>
                <tr>
                    <td><%= link.title %></td>
                    <td><a href="<%= link.link %>" target="_blank">Go to Link</a></td>
                </tr>
            <% }); %>
        </tbody>
    </table>
    <h3>Extracted Data:</h3>
    <textarea class="data1" rows="10" cols="80" readonly><%= content %></textarea>
    <br>
    <button class="button1"><a href="/">Crawl Another URL</a></button>
</body>
</html>   

步骤 10：现在，创建一个名为 'public' 的文件夹。在该 public 文件夹中，创建一个用于样式设置的文件，并将其命名为 'styles.css'，然后使用以下代码对页面进行样式设置。

 
body {
    font-family: Arial, sans-serif;
    margin: 20px;
}
h2 {
    color: #333;
}
table {
    width: 100%;
    border-collapse: collapse;
    margin-bottom: 20px;
}
table, th, td {
    border: 1px solid #ddd;
}
th, td {
    padding: 10px;
    text-align: left;
}
th {
    background-color: #877878;
}
textarea {
    width: 100%;
    box-sizing: border-box;
}
a {
    color: #0066cc;
}
a:hover {
    text-decoration: underline;
}
.button1{
    background-color: #7ce180;
    border: none;
    color: white;
    padding: 15px 32px;
    text-align: center;
    text-decoration: none;
    display: inline-block;
    font-size: 16px;
    margin: 4px 2px;
    cursor: pointer;
}
.data1{
    border: #0066cc solid 5px;
}
.container {
    max-width: 600px;
    margin: 0 auto;
    padding: 20px;
    background: #fff;
    border-radius: 8px;
    box-shadow: 0 0 10px rgba(0, 0, 0, 0.1);
}
h1 {
    font-size: 24px;
    margin-bottom: 20px;
    color: #333;
    text-align: center;
}
.form {
    display: flex;
    flex-direction: column;
}
.form label {
    margin-bottom: 8px;
    font-weight: bold;
}
.form input[type="text"],
.form input[type="number"] {
    padding: 10px;
    border: 1px solid #ddd;
    border-radius: 4px;
    margin-bottom: 20px;
    font-size: 16px;
}
.form button {
    padding: 10px 15px;
    background-color: #007bff;
    color: white;
    border: none;
    border-radius: 4px;
    cursor: pointer;
    font-size: 16px;
}
.form button:hover {
    background-color: #0056b3;
}   

步骤 11：要运行应用程序，请使用命令 'node server.js'。

步骤 12：最后，访问 URL 'https://:3000/'，输入网站 URL。同时输入最大深度，然后点击 Crawl 按钮，这将带您进入结果页面，其中显示了网站上的链接和数据。

用户界面

输入网站 URL 和最大深度的表单

结果页面

结论

总之，本文帮助我们了解了网络爬虫及其优势。它还帮助我们使用 node.js、HTML 和嵌入式 JavaScript 创建自己的网络爬虫。

下一个主题Crypto-randombytes-in-nodejs

Node.js 中的 Crawler

网络爬虫的关键点

使用 Node.js 创建网络爬虫应用程序

应用程序的目录结构

用户界面

结论

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

Node.js 教程

Node.js MySQL

Node.js MongoDB

区别

其他

Node.js 选择题

Node.js Express

面试题

Node.js 中的 Crawler

网络爬虫的关键点

使用 Node.js 创建网络爬虫应用程序

应用程序的目录结构

用户界面

结论

相关帖子

Node.js 中的 net.getDefaultAutoSelectFamilyAttemptTimeout() 函数

Node.js 中返回回调函数和仅调用回调函数的区别

Node.js 中的 tracingChannel.traceCallback(fn[, position[, context[, thisArg[, ...args]]]]) 函数

Node.js crypto.checkPrimesync() 函数

Node.js 中的函数式编程

Node.js 中 CommonJS 和 es6 Modules 的区别

Node.js tlsSocket.getEphemeralKeyInfo() 方法

Node.js 中 Nodemailer 和 SendGrid 的区别

Node.js Buffer.swap16() 方法

Node.js tlsSocket.isSessionReused() 方法

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器