How to write a web scraper in Node.js

Posted on Thursday, April 28 2016 by Ionică Bizău

Tagged as: nodejs, scraper

Sometimes we need to collect information from different web pages automagically. Obviously, a human is not needed for that. A smart script can do the job pretty good, especially if it's something repetitive. 💫

scrape-it

When there is no web based API to share the data with our app, and we still want to extract some data from that website, we have to fallback to scraping. 💥

This means:

We load the page (a GET request is often enough)
Parse the HTML result
Extract the needed data

In Node.js, all these three steps are quite easy because the functionality is already made for us in different modules, by different developers.

Because I often scrape random websites, I created yet another scraper: scrape-it–a Node.js scraper for humans. It's designed to be really simple to use and still is quite minimalist. ⚡️

Here is how I did it:

1. Load the page

To load the web page, we need to use a library that makes HTTP(s) requests. There are a lot of modules doing that that. Like always I recommend choosing simple/small modules, I wrote a tiny package that does it: tinyreq.

Using this module you can easily get the HTML rendered by the server from a web page:

const request = require("tinyreq");

request("http://ionicabizau.net/", function (err, body) {
    console.log(err || body); // Print out the HTML
});

tinyreq is actually a friendlier wrapper around the native http.request built-in solution.

Once we have a piece of HTML, we need to parse it. How to do that? 💭

2. Parsing the HTML

Well, that's a complex thing. Other people did that for us already. I like very much the cheerio module. It provides a jQuery-like interface to interact with a piece of HTML you already have.

const cheerio = require("cheerio");

// Parse the HTML
let $ = cheerio.load("<h2 class="title">Hello world</h2>");

// Take the h2.title element and show the text
console.log($("h2.title").text());
// => Hello world

Because I like to modularize all the things, I created cheerio-req which is basically tinyreq combined with cheerio (basically the previous two steps put together):

const cheerioReq = require("cheerio-req");

cheerioReq("http://ionicabizau.net", (err, $) => {
    console.log($(".header h1").text());
    // => Ionică Bizău
});

Since we already know how to parse the HTML, the next step is to build a nice public interface we can export into a module. ✨

3. Extract the needed data

Putting the previous steps together, we have this (follow the inline comments):

"use strict"

// Import the dependencies
const cheerio = require("cheerio")
    , req = require("tinyreq")
    ;

// Define the scrape function
function scrape(url, data, cb) {
    // 1. Create the request
    req(url, (err, body) => {
        if (err) { return cb(err); }

        // 2. Parse the HTML
        let $ = cheerio.load(body)
          , pageData = {}
          ;

        // 3. Extract the data
        Object.keys(data).forEach(k => {
            pageData[k] = $(data[k]).text();
        });

        // Send the data in the callback
        cb(null, pageData);
    });
}

// Extract some data from my website
scrape("http://ionicabizau.net", {
    // Get the website title (from the top header)
    title: ".header h1"
    // ...and the description
  , description: ".header h2"
}, (err, data) => {
    console.log(err || data);
});

When running this code, we get the following output in the console:

{ title: 'Ionică Bizău',
  description: 'Web Developer,  Linux geek and  Musician' }

Hey! That's my website information, so it's working. We now have a small function that can get the text from any element on the page.

In the module I have written I made it possible to scrape lists of things (e.g. articles, pages etc).

So, basically to get the latest 3 articles on my blog, you can do:

const scrapeIt = require("scrape-it");

// Fetch the articles on the page (list)
scrapeIt("http://ionicabizau.net", {
    listItem: ".article"
  , name: "articles"
  , data: {
        createdAt: {
            selector: ".date"
          , convert: x => new Date(x)
        }
      , title: "a.article-title"
      , tags: {
            selector: ".tags"
          , convert: x => x.split("|").map(c => c.trim()).slice(1)
        }
      , content: {
            selector: ".article-content"
          , how: "html"
        }
    }
}, (err, page) => {
    console.log(err || page);
});
// { articles:
//    [ { createdAt: Mon Mar 14 2016 00:00:00 GMT+0200 (EET),
//        title: 'Pi Day, Raspberry Pi and Command Line',
//        tags: [Object],
//        content: '<p>Everyone knows (or should know)...a" alt=""></p>\n' },
//      { createdAt: Thu Feb 18 2016 00:00:00 GMT+0200 (EET),
//        title: 'How I ported Memory Blocks to modern web',
//        tags: [Object],
//        content: '<p>Playing computer games is a lot of fun. ...' },
//      { createdAt: Mon Nov 02 2015 00:00:00 GMT+0200 (EET),
//        title: 'How to convert JSON to Markdown using json2md',
//        tags: [Object],
//        content: '<p>I love and ...' } ] }

Happy scraping! 😁

Have feedback on this article? Let @IonicaBizau know on Twitter.

Have any questions? Feel free to ping me on Twitter.