Web scraping in NodeJS with Puppeteer

Reading Time: 5 minutes 🕑

Last updated: September 27, 2022.

Web scraping is a popular way to get data currently existing on a web page.

For example, you may want to ‘scrape’ the current price of some products online, and use these to display as the current prices on your website.

The Puppeteer library makes this possible using JavaScript running in the NodeJS environment.

Table of contents

When use a web scraper?

Web scraping is not always the best solution.

Most large, modern website offer an API endpoint to which you can make request to get data. Using an API service is more efficient and reliable than using a web scraper.

But sometimes this is not possible, especially for smaller websites. In these cases, web scraping is alternative.

Installing Puppeteer

To install Puppeteer, it is necessary to have NodeJS installed on your system.

To check, type the following from the command line (e.g. PowerShell in Windows, Terminal in Mac):

node -v

This should return a version number, such as v16.14.2. If not, you need to install Node first and then run this command again.

Assuming NodeJS is installed, create a new project folder anywhere on your system. Then, set the current directory to this folder from the command line. For example:

cd C:\Users\OpenJavaScript\Desktop\web-scraping

Now, in the new folder, initiate a new NodeJS project:

node init -yes

The -yes parameter accepts the default settings for a new NodeJS project, which is suitable for testing purposes. Running this will create a package.json file in the root directory of the project folder.

Now, it is time to install Puppeteer!

npm install Puppeteer -save

The save flag specifies that Puppeteer should be saved as a development dependency for this project, listed in package.json.

Using Puppeteer

First steps: app setup

First, create a new JavaScript file in the root directory of the project folder where Puppeteer has been installed.

Inside this file, we will build an app that will scrape data from a web page using Puppeteer.

So at the head of your file, include the following to import Puppeteer to the project:

/* Import Puppeteer to your script */

const puppeteer = require("puppeteer");

Below, create an asynchronous self-executing function. We will load Puppeteer inside this function with the help of the await keyword to wait for asynchronous processes.

To begin with, load Puppeteer with the launch method available on puppeteer and pass in an object argument with a headless property set to false:

/* Launching Puppeteer and creating a new browser instance */

const puppeteer = require("puppeteer");

(async () => {
  const browser = await puppeteer.launch({ headless: false });
})();

Now, from the command line while still in the project folder, run the app:

node index

Notice anything unusual?

If you have Chrome installed, it should have opened an empty page.

That’s how Puppeteer works: it opens a browser, loads the page, and then scrapes once loaded.

Remove { headless: false } and this will happen silently in the background. But, especially at first, it is helpful to also see what Puppeteer is seeing.

Navigating to a page

After launching Puppeteer, the next step is to launch a new page. You can do this using the newPage method available on the newly created browser object.

Now, to navigate to a page, you can use the goto method on the new page object.

/* Navigate to a page with Puppeteer */

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch( { headless: false } )
  const page = await browser.newPage()
  await page.goto("https://bbc.com")
})()

This will navigate to the Amazon.com. If you use { headless: false }, you will see this happening live!

Scraping data from a page

Now, for the exciting part.

You can now use the evaluate method on the page object, which is set the page you want to scrape.

Evaluate accepts a callback function. Inside this function, you can run JavaScript on the page in the browser!

/*Use page.evaluate() to run JavaScript in the browser */

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch( { headless: false } )
  const page = await browser.newPage()
  await page.goto("https://bbc.com")

  const data = await page.evaluate(() => {
  // Enter JavaScript to run on the page here!
  })
})()

So you have available to you all the methods you when running JavaScript in the browser to query the page get data.

Set whatever you scrape data to be the return value of the function. This will be stored in the data variable.

Here is an example that scrapes data about upcoming events in a town from a local tourism website:

/* Scraping data from a page */

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch( { headless: false } )
  const page = await browser.newPage()
  await page.goto("https://www.visit1066country.com/destinations/hastings/whats-on")


  const data = await page.evaluate(() => {
     // Create array to store objects:
    const list = [];
    // Get each product container:
    const items = document.querySelectorAll(".productList > li");

    // Each iteration of loop then pushes an object to the array:
    for (i=0; i<items.length; i++) {
      list.push({
        event: items[i].querySelector(".ProductName a").textContent,
        dates: items[i].querySelector(".dates").innerHTML,
        link: 'https://www.visit1066country.com'+items[i].querySelector(".ProductDetail").getAttribute('href'),
      })
    }
    // Return the array of objects (console.log here will print in the browser!)
    return list
  })

  console.log(data); // See the scraped data in Node
})()

In the example above, an empty array is created and data pushed into the array using a loop.

The result of this is then made the return value of page.evaluate(), and the output can be seen by calling console.log() after the function.

Scraping data from multiple pages

Scraping data from more than one page is just a repetition of the steps for scraping a single page.

Use page.goto() to go to a different page and then call page.evaluate() again, passing in a callback function that will return the result of the scraping.

In the example below, page 1 and then page 2 of the local events are scraped:

/* Scraping data from two pages */

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch( { headless: false } )
  const page = await browser.newPage()
  await page.goto("https://www.visit1066country.com/destinations/hastings/whats-on")

  const data_p1 = await page.evaluate(() => {
    const list = []
    const items = document.querySelectorAll(".productList > li");

    for (i=0; i<items.length; i++) {
      list.push({
        event: items[i].querySelector(".ProductName a").textContent,
        dates: items[i].querySelector(".dates").innerHTML,
        link: 'https://www.visit1066country.com'+items[i].querySelector(".ProductDetail").getAttribute('href'),
      })
    }
    
    return list
  })

  console.log(data_p1); // Show page 1 data

  await page.goto("https://www.visit1066country.com/destinations/hastings/whats-on/?p=2")

  const data_p2 = await page.evaluate(() => {
    const list = []
    const items = document.querySelectorAll(".productList > li");

    for (i=0; i<items.length; i++) {
      list.push({
        event: items[i].querySelector(".ProductName a").textContent,
        dates: items[i].querySelector(".dates").innerHTML,
        link: 'https://www.visit1066country.com'+items[i].querySelector(".ProductDetail").getAttribute('href'),
      })
    }
    
    return list
  })

  console.log(data_p2);   console.log(data_p1); // Show page 2 data
})()

Ethical issues

There are ethical issue to take into consideration before web scraping.

On the one hand, scraping publically available information is just another means of accessing a site and taking notes of the information – only it is automated.

But even for freely available public information, frequent scraping of a site can place a heavy load on servers. This can cause problems for webmasters. Therefore, try to avoid scraping too often.

And, of course, copyright laws still apply to scraped content.

Summary

Web scraping is a popular way to get data from a website. The Puppeteer library makes this possible using JavaScript running in the NodeJS environment.

Note that before using a web scraper, it is a good idea to check if a website provides the data you are looking for via an API.

Related links