Enhanced Web Scraping With Screenshots

Dirk Hoekstra author image By Dirk Hoekstra on 10 july, 2020


Web scraping is great for extracting data from websites.

However, suppose you want to capture a screenshot of the website you're scraping.

This could be useful to validate that the scraper extracts the right data from the website.

Web scrapers also often break down. This can be because of several reasons. But, usually it is either because of a change in the web page or because of an automated bot check.

You can add some logging to the scraper when it breaks, but how nice would it be to see what the scraper sees when the error occurred.

In this article, I'll set up such a system with the help of ScreenshotAPI.net

Screenshots

Setting up a scraper

In this example, I'm going to create a web scraper using NodeJS. The scraper will scrape product data from Amazon.

To create a new Node project I run the following command.

npm init

Next, I add the required dependencies.

npm install screenshotapi.net cheerio axios

Then I create an index.js file and add some Amazon URLs to scrape.

I loop over the urls and print a log message (for now).

const urlsToScrape = [
    "https://www.amazon.com/Acer-Display-Graphics-Keyboard-A515-43-R19L/dp/B07RF1XD36",
    "https://www.amazon.com/Fisher-Price-GJT93-Doodle-Stamper-Multicolor/dp/B07MDYVVSC",
    "https://www.amazon.com/NSQTBA-Womens-Summer-Sleeve-Casual/dp/B087R7JR3Y",
    "https://www.amazon.com/California-Design-Den-Long-Staple-Bedsheets/dp/B077N6W2G7",
];

for(const url of urlsToScrape) {
    console.log(`Scraping: ${url}`);
}

The result when running this is:

node index.js
Scraping: https://www.amazon.com/Acer-Display-Graphics-Keyboard-A515-43-R19L/dp/B07RF1XD36
Scraping: https://www.amazon.com/Fisher-Price-GJT93-Doodle-Stamper-Multicolor/dp/B07MDYVVSC
Scraping: https://www.amazon.com/NSQTBA-Womens-Summer-Sleeve-Casual/dp/B087R7JR3Y
Scraping: https://www.amazon.com/California-Design-Den-Long-Staple-Bedsheets/dp/B077N6W2G7

So far everything works, nice! 🔥

Getting the product titles

Now I should send a get request for each url. I'm going to use axios to do this.

To extract the title from the web pages I will use cheerio.

const axios = require("axios");
const cheerio = require('cheerio');

const urlsToScrape = [
    "https://www.amazon.com/Acer-Display-Graphics-Keyboard-A515-43-R19L/dp/B07RF1XD36",
    "https://www.amazon.com/Fisher-Price-GJT93-Doodle-Stamper-Multicolor/dp/B07MDYVVSC",
    "https://www.amazon.com/NSQTBA-Womens-Summer-Sleeve-Casual/dp/B087R7JR3Y",
    "https://www.amazon.com/California-Design-Den-Long-Staple-Bedsheets/dp/B077N6W2G7",
];

for(const url of urlsToScrape) {
    axios.get(url)
    .then(result => {
        const $ = cheerio.load(result.data);
        const title = $("#productTitle").text().trim();

        console.log(`Title: ${title}`);
    })
    .catch(error => {
        console.error("Something went wrong!");
        console.error(error);
    })
}

When running this it displays the correct titles!

Title: NSQTBA Womens Short Sleeve V Neck T Shirts Loose Casual Summer Tops Tees with Pocket
Title: California Design Den Americana Plaid Bedding King Size - 400 Thread Count Pure Cotton, Soft Sateen 4 Piece Checkered Sheet Set, Elasticized Deep Pocket Fits Low Profile Foam and Tall Mattresses
Title: Acer Aspire 5 Slim Laptop, 15.6 inches Full HD IPS Display, AMD Ryzen 3 3200U, Vega 3 Graphics, 4GB DDR4, 128GB SSD, Backlit Keyboard, Windows 10 in S Mode, A515-43-R19L,Silver
Title: Fisher-Price DoodlePro Slim, Aqua

Capturing the screenshots

Now let's do the fun part, getting the screenshots!

First I create the screenshotApiClient:

const screenshotApiClient = require('screenshotapi.net')('YOUR_API_TOKEN');

Then I use the following code to save a website to a file.

screenshotApiClient.saveScreenshotToImage(`${title}.png`, {
    url: url,
    width: 1920,
    height: 1080,
})
.catch((error) => {
    console.error("Error while getting screenshot.");
    console.dir(error);
})

Putting it all together it looks like this.

const axios = require("axios");
const cheerio = require('cheerio');
const screenshotApiClient = require('screenshotapi.net')('NQSB7KXAI3AX7LLF');

const urlsToScrape = [
    "https://www.amazon.com/Acer-Display-Graphics-Keyboard-A515-43-R19L/dp/B07RF1XD36",
    "https://www.amazon.com/Fisher-Price-GJT93-Doodle-Stamper-Multicolor/dp/B07MDYVVSC",
    "https://www.amazon.com/NSQTBA-Womens-Summer-Sleeve-Casual/dp/B087R7JR3Y",
    "https://www.amazon.com/California-Design-Den-Long-Staple-Bedsheets/dp/B077N6W2G7",
];

for(const url of urlsToScrape) {
    axios.get(url)
    .then(result => {
        const $ = cheerio.load(result.data);
        const title = $("#productTitle").text().trim();
        console.log(`${title}.png`);

        screenshotApiClient.saveScreenshotToImage(`${title}.png`, {
            url: url,
            width: 1920,
            height: 1080,
        })
        .catch((error) => {
            console.error("Error while getting screenshot.");
            console.dir(error);
        })
    })
    .catch(error => {
        console.error("Something went wrong!");
        console.error(error);
    })
}

And when I run the a screenshot is saved for each Amazon web page. 🙌

Conclusion

In this example, I've shown how you can capture screenshots in a web scraper.

Note that right now it just captures every Amazon page. You could add a try catch block and only take a screenshot of the page when an error occurs.

This way you'll have way more insight into why your web scraper fails!

Happy coding!