Ionică Bizău

Web Developer, Linux geek and Musician

How to write a web scraper in Node.js

Sometimes we need to collect information from different web pages automagically. Obviously, a human is not needed for that. A smart script can do the job pretty good, especially if it's something repetitive. :dizzy:

scrape-it

When there is no web based API to share the data with our app, and we still want to extract some data from that website, we have to fallback to scraping. :boom:

This means:

  1. We load the page (a GET request is often enough)
  2. Parse the HTML result
  3. Extract the needed data

In Node.js, all these three steps are quite easy because the functionality is already made for us in different modules, by different developers.

Because I often scrape random websites, I created yet another scraper: scrape-ita Node.js scraper for humans. It's designed to be really simple to use and still is quite minimalist. :zap:

Here is how I did it:

1. Load the page

To load the web page, we need to use a library that makes HTTP(s) requests. There are a lot of modules doing that that. Like always I recommend choosing simple/small modules, I wrote a tiny package that does it: tinyreq.

Using this module you can easily get the HTML rendered by the server from a web page:

const request = require("tinyreq");

request("http://ionicabizau.net/", function (err, body) {
    console.log(err || body); // Print out the HTML
});

tinyreq is actually a friendlier wrapper around the native http.request built-in solution.

Once we have a piece of HTML, we need to parse it. How to do that? :thought_balloon:

2. Parsing the HTML

Well, that's a complex thing. Other people did that for us already. I like very much the cheerio module. It provides a jQuery-like interface to interact with a piece of HTML you already have.

const cheerio = require("cheerio");

// Parse the HTML 
let $ = cheerio.load("<h2 class="title">Hello world</h2>");

// Take the h2.title element and show the text
console.log($("h2.title").text());
// => Hello world

Because I like to modularize all the things, I created cheerio-req which is basically tinyreq combined with cheerio (basically the previous two steps put together):

const cheerioReq = require("cheerio-req");

cheerioReq("http://ionicabizau.net", (err, $) => {
    console.log($(".header h1").text());
    // => Ionică Bizău
});

Since we already know how to parse the HTML, the next step is to build a nice public interface we can export into a module. :sparkles:

3. Extract the needed data

Putting the previous steps together, we have this (follow the inline comments):

"use strict"

// Import the dependencies
const cheerio = require("cheerio")
    , req = require("tinyreq")
    ;

// Define the scrape function
function scrape(url, data, cb) {
    // 1. Create the request
    req(url, (err, body) => {
        if (err) { return cb(err); }

        // 2. Parse the HTML
        let $ = cheerio.load(body)
          , pageData = {}
          ;

        // 3. Extract the data
        Object.keys(data).forEach(k => {
            pageData[k] = $(data[k]).text();
        });

        // Send the data in the callback
        cb(null, pageData);
    });
}

// Extract some data from my website
scrape("http://ionicabizau.net", {
    // Get the website title (from the top header)
    title: ".header h1"
    // ...and the description
  , description: ".header h2"
}, (err, data) => {
    console.log(err || data);
});

When running this code, we get the following output in the console:

{ title: 'Ionică Bizău',
  description: 'Web Developer,  Linux geek and  Musician' }

Hey! That's my website information, so it's working. We now have a small function that can get the text from any element on the page.


In the module I have written I made it possible to scrape lists of things (e.g. articles, pages etc).

So, basically to get the latest 3 articles on my blog, you can do:

const scrapeIt = require("scrape-it");

// Fetch the articles on the page (list)
scrapeIt("http://ionicabizau.net", {
    listItem: ".article"
  , name: "articles"
  , data: {
        createdAt: {
            selector: ".date"
          , convert: x => new Date(x)
        }
      , title: "a.article-title"
      , tags: {
            selector: ".tags"
          , convert: x => x.split("|").map(c => c.trim()).slice(1)
        }
      , content: {
            selector: ".article-content"
          , how: "html"
        }
    }
}, (err, page) => {
    console.log(err || page);
});
// { articles:
//    [ { createdAt: Mon Mar 14 2016 00:00:00 GMT+0200 (EET),
//        title: 'Pi Day, Raspberry Pi and Command Line',
//        tags: [Object],
//        content: '<p>Everyone knows (or should know)...a" alt=""></p>\n' },
//      { createdAt: Thu Feb 18 2016 00:00:00 GMT+0200 (EET),
//        title: 'How I ported Memory Blocks to modern web',
//        tags: [Object],
//        content: '<p>Playing computer games is a lot of fun. ...' },
//      { createdAt: Mon Nov 02 2015 00:00:00 GMT+0200 (EET),
//        title: 'How to convert JSON to Markdown using json2md',
//        tags: [Object],
//        content: '<p>I love and ...' } ] }

Happy scraping! :grin:

Read more »

Pi Day, Raspberry Pi and Command Line

Everyone knows (or should know) what the Pi number is. Basically, its value is defined by the division of a circle circumference to its diameter. That's the Pi number! The nice thing about it is its irrational nature. That means it has a lot an infinity of decimals.

Usually, Pi is approximated as 3.14. And today is 14th of March! So, happy Pi day! :)

I have a Raspberry Pi computer around, and I was thinking to use two of my libraries to create something nice: displaying the Raspberry Pi's logo and the Pi number in the command line: it's funny how it basically stands for Raspberry Pi (the raspberry image and the pi number). I wanted one more thing: using the pi number decimals in the output characters. Here's the result (see below how I did it):

How I did it

To display the images in the terminal, I used image-to-ascii. To pass different image urls, I decided to use command line arguments.

To get the first n decimals of the pi number, I used another module I created this time last year: pi. This module returns a good approximation of pi:

const pi = require("pi");

console.log(pi(10));
// => '3.141592653'

So, I created a file named index.js and I wrote the following stuff in it (follow the inline comments):

// Require the needed dependencies
// `pi` will be used to return the first `n` decimals of pi
const pi = require("pi")

      // image-to-ascii for displaying the images in the terminal
    , img = require("image-to-ascii")

      // We use this module to stringify the pixel matrix after
      // modifying the internal data (basically, the characters)
    , stringify = require("asciify-pixel-matrix")
    ;

// Take the image url/path from the command line arguments
img(process.argv[2], {
    // We turn off the stringifying, since we really want to do
    // some changes before displaying the images
    stringify: false
  , concat: false
}, (err, converted) => {
    // Handle the errors
    if (err) { return console.error(err); }

    // `converted` is an array of arrays (in fact, a matrix of pixels)
    // We use the `converted` matrix to know how many decimals we
    // need: width x height
    // `piNumber` will be a string in this format: "3.14...." (with a
    // lot of decimals)
    var piNumber = pi(converted.length * converted[0].length);

    // We will use this `i` variable to get the current index
    var i = -1;

    // For each row in the matrix
    converted.forEach(cRow => {
        // ...and for each pixel in the row
        cRow.forEach(px => {
            // ...update the character using a pi decimal, in order
            px.char = piNumber[i = ++i % piNumber.length];
        });
    });

    // Finally, stringify everything and display the result! Yay!
    console.log(stringify.stringifyMatrix(converted));
});

The requirements to run this script are:

  • Node.js (I installed it on my Raspberry Pi using nvm)
  • graphicsmagick: sudo apt-get install graphicsmagick (this is optional, kind of: if it's not available image-to-ascii will compile some C/C++ stuff, but it will probably take a long time)
  • ...and of course the npm dependencies: npm i image-to-ascii pi asciify-pixel-matrix

Happy Pi Day!

PS: I posted this article using my Raspberry Pi, connected to the Internet and using a 7" display. Just perfect. :)

Read more »

How I ported Memory Blocks to modern web

Playing computer games is a lot of fun. Playing games improving brain function and performance is even better. The first computer game I ever played was one of these. :zap:

It happened when I was 6-year-old. I visited a friend in the neighboring village. There were not so many computers in those days but my friend had one. He asked me to play a computer game he thought I would enjoy. And I did! It was my favorite computer game at the time (back in 2001).

The game I played was Memory Blocks, which is part of the Symantec Game Pack, created by Charles Timmerman—founder of Funster and author of over 90 puzzle books.

Since then I did a lot of things. One of them is that I became a web developer. So, I thought: it should be interesting to bring this Windows game back to life and everyone will be able to play it in their browsers (on any platform). Many people love oldies. Not long time ago I created a COBOL bridge for Node.js and the feedback was amazing.

And I did it! You can play it clicking here:

Below is how I did it:

:dizzy: Step 1: Modularize all the things!

Instead of having a monolithic project doing one thing I preferred to separate the core part into a separate library. That's how match.js appeared. Using match.js programmers can develop similar memory blocks games.

OctiMatch is a good example: it has the same rules, but it uses the GitHub Octicons as images.

:fire: Step 2: Steal the images from the original game

I installed Windows in a virtual machine and took screenshots and cropped the images that should be displayed in the yellow blocks. There are 30 distinct images.

:sparkles: Step 3: Use match.js to build the actual game

I used the library I created to build this game: basically passing the paths to the images and some options. This was probably the simplest part. I got an initial working version. The next step was to make it look exactly like in the original game.

:tophat: Step 4: CSS Magic

I worked together with @tonkec on this. She made the amazing 3D block spinning animations. It was tricky to do that but we did it. Btw, she has some nice CodePens! :art:

:bar_chart: Step 5: High Scores and other features

I implemented the high scores functionality and UI. The data is kept in the browser local storage and is rendered in retro tables, like in the original game. :joy:

Also, I implemented the retro windows (e.g. About & How to play), made it possible to switch between the grayscale and colored modes and choose the difficulty (little or big board).

:sunglasses: Step 6: Do it right!

Before making the game public, I wanted to see who is the original game author and where can I find him. After a quick investigation, I found Charles Timmerman (funster.com) and I was 90% sure this is the guy I want to talk to.

I kindly emailed and asked him to allow me using the original game images in my clone. I was wondering if he would reply me. :email:

After two days he sent me an enthusiastic reply. Amazing! That looks exactly my game! he said.

I like how the high score table looks the same and even the black & white option is preserved. In those days, there were Windows computers that were in black & white! You might be interested to know that I wrote all the games in assembly language- my choice in those days over the usual C.

And he agreed to use the designs and images in my game clone. Yay! :tada:

:crystal_ball: Step 7: Open-source all the things

I open-sourced the repository on GitHub and tweeted it.

:sparkling_heart: Feedback

One of the things I like to do is to teach people to code. I usually share with them the news I have (e.g. projects I build, something interesting I do etc).

It was a nice surprise that one of my mentees sent me three books containing Bible puzzles by the author (because, yes, I like the Bible). Thanks, Nuvi!!! :grin: :cake:

Then, Timmerman sent me two signed books with so many great puzzles! You cannot know how thankful and happy I am because he allowed me to clone his game. :smile:

So now, when taking a break, I know what to do! Solving puzzles! :joy:

:rocket: Conclusion

Sometimes it is good to invest time in bringing old things back to life. Not only you'll find people feeling nostalgic using them, but such things will make them happy and obviously it will also make you happy! And you may get shiny surprises too! :blush:

To read the source code, report bugs, contribute and play the game check out the game repository on GitHub. :octocat:

Read more »