Enhanced Web Scraping With Screenshots

Dirk Hoekstra author image By Dirk Hoekstra on 10 july, 2020


Web scraping is great for extracting data from websites.

However, sometimes you want to verify that the scraper works correctly.

Doing this by looking at the raw HTML is tough. From my personal experience, I would advise you not to do that.

You could manually load the page, but this is hard to automate and can be time-consuming.

The solution: you can let the scraper take an automated screenshot of the web pages it crawls.

This is awesome because:

  • You can easily verify that the right data is extracted.
  • If the scraper breaks down you can look at the screenshot to see what happened.

So, hopefully, I've convinced you on why it is cool to set up a web scraper with automated screenshots.

In this article, I'll show you how you can do it!

Screenshots

Setting up a scraper

In this example, I'm going to create a web scraper using NodeJS. The scraper will scrape product data from Amazon.

To create a new Node project I run the following command.

npm init

Next, I add the required dependencies.

npm install screenshotapi.net cheerio axios

Then I create an index.js file and add some Amazon URLs to scrape.

I loop over the urls and print a log message (for now).

const urlsToScrape = [
    "https://www.amazon.com/Acer-Display-Graphics-Keyboard-A515-43-R19L/dp/B07RF1XD36",
    "https://www.amazon.com/Fisher-Price-GJT93-Doodle-Stamper-Multicolor/dp/B07MDYVVSC",
    "https://www.amazon.com/NSQTBA-Womens-Summer-Sleeve-Casual/dp/B087R7JR3Y",
    "https://www.amazon.com/California-Design-Den-Long-Staple-Bedsheets/dp/B077N6W2G7",
];

for(const url of urlsToScrape) {
    console.log(`Scraping: ${url}`);
}

The result when running this is:

node index.js
Scraping: https://www.amazon.com/Acer-Display-Graphics-Keyboard-A515-43-R19L/dp/B07RF1XD36
Scraping: https://www.amazon.com/Fisher-Price-GJT93-Doodle-Stamper-Multicolor/dp/B07MDYVVSC
Scraping: https://www.amazon.com/NSQTBA-Womens-Summer-Sleeve-Casual/dp/B087R7JR3Y
Scraping: https://www.amazon.com/California-Design-Den-Long-Staple-Bedsheets/dp/B077N6W2G7

So far everything works, nice! 🔥

Getting the product titles

Now I should send a get request for each url. I'm going to use axios to do this.

To extract the title from the web pages I will use cheerio.

const axios = require("axios");
const cheerio = require('cheerio');

const urlsToScrape = [
    "https://www.amazon.com/Acer-Display-Graphics-Keyboard-A515-43-R19L/dp/B07RF1XD36",
    "https://www.amazon.com/Fisher-Price-GJT93-Doodle-Stamper-Multicolor/dp/B07MDYVVSC",
    "https://www.amazon.com/NSQTBA-Womens-Summer-Sleeve-Casual/dp/B087R7JR3Y",
    "https://www.amazon.com/California-Design-Den-Long-Staple-Bedsheets/dp/B077N6W2G7",
];

for(const url of urlsToScrape) {
    axios.get(url)
    .then(result => {
        const $ = cheerio.load(result.data);
        const title = $("#productTitle").text().trim();

        console.log(`Title: ${title}`);
    })
    .catch(error => {
        console.error("Something went wrong!");
        console.error(error);
    })
}

When running this it displays the correct titles!

Title: NSQTBA Womens Short Sleeve V Neck T Shirts Loose Casual Summer Tops Tees with Pocket
Title: California Design Den Americana Plaid Bedding King Size - 400 Thread Count Pure Cotton, Soft Sateen 4 Piece Checkered Sheet Set, Elasticized Deep Pocket Fits Low Profile Foam and Tall Mattresses
Title: Acer Aspire 5 Slim Laptop, 15.6 inches Full HD IPS Display, AMD Ryzen 3 3200U, Vega 3 Graphics, 4GB DDR4, 128GB SSD, Backlit Keyboard, Windows 10 in S Mode, A515-43-R19L,Silver
Title: Fisher-Price DoodlePro Slim, Aqua

Capturing the screenshots

Now let's do the fun part, getting the screenshots!

First I create the screenshotApiClient:

const screenshotApiClient = require('screenshotapi.net')('YOUR_API_TOKEN');

Then I use the following code to save a website to a file.

screenshotApiClient.saveScreenshotToImage(`${title}.png`, {
    url: url,
    width: 1920,
    height: 1080,
})
.catch((error) => {
    console.error("Error while getting screenshot.");
    console.dir(error);
})

Putting it all together it looks like this.

const axios = require("axios");
const cheerio = require('cheerio');
const screenshotApiClient = require('screenshotapi.net')('NQSB7KXAI3AX7LLF');

const urlsToScrape = [
    "https://www.amazon.com/Acer-Display-Graphics-Keyboard-A515-43-R19L/dp/B07RF1XD36",
    "https://www.amazon.com/Fisher-Price-GJT93-Doodle-Stamper-Multicolor/dp/B07MDYVVSC",
    "https://www.amazon.com/NSQTBA-Womens-Summer-Sleeve-Casual/dp/B087R7JR3Y",
    "https://www.amazon.com/California-Design-Den-Long-Staple-Bedsheets/dp/B077N6W2G7",
];

for(const url of urlsToScrape) {
    axios.get(url)
    .then(result => {
        const $ = cheerio.load(result.data);
        const title = $("#productTitle").text().trim();
        console.log(`${title}.png`);

        screenshotApiClient.saveScreenshotToImage(`${title}.png`, {
            url: url,
            width: 1920,
            height: 1080,
        })
        .catch((error) => {
            console.error("Error while getting screenshot.");
            console.dir(error);
        })
    })
    .catch(error => {
        console.error("Something went wrong!");
        console.error(error);
    })
}

And when I run the program a screenshot is saved for each Amazon web page. 🙌

Conclusion

In this example, I've shown how you can capture screenshots in a web scraper.

Note that right now it just captures every Amazon page. You could add a try catch block and only take a screenshot of the page when an error occurs.

This way you'll have way more insight into why your web scraper fails!

Happy coding!