当前位置：首页 > article >正文

nodejs爬虫系统

article 2025/2/24 22:28:31

课程目录

爬虫以及robots协议介绍
配置爬虫系统开发环境
爬虫实战

爬虫以及robots协议介绍

爬虫，是一种自动获取网页内容的程序。是搜索引擎的重要组成部分，因此搜索引擎优化很大程度上就是针对爬虫而做出的优化。

robots.txt 是一个文本文件，是一个协议不是命令，是爬虫要查看的第一个文件。robots.txt 文件告诉爬虫在服务器上什么文件可以被查看，搜索机器人会按照该文件内容确定访问范围。

配置爬虫系统开发环境

需要用到的Node模块：

Express
Request
Cheerio

本文是使用express创建项目

mkdir spider
npm init
npm install express request cheerio

// 或者用express创建项目
express spider
cd spider
npm install request cheerio

爬虫实战

var express = require('express');
var app = express();
var request = require('request');
var cheerio = require('cheerio');

app.get('/', function(req, res) {
    request('http://www.google.com', function(error, response, body) {
        if (!error && response.statusCode === 200) {
            console.log(body);
            $ = cheerio.load(body); // 当前$是一个拿到了整个body的前端选择器
            res.send('hello world');
        }
    });
});

app.listen(3000);