Businesses now are wiser than ever because they employ many streamlined data gathering methods to outmaneuver the market competition. Some of the most popular web scraping components and tools that highly adaptive enterprises commonly use today are Node.js and Puppeteer.
Specifically, this article describes a Puppeteer tutorial on employing such a time-saving tool in conjunction with Node.js to scrape any website you want to extract data for a wide variety of applications.
Technical Terms Defined
Before delving deeper into the step-by-step process, you?ll need to know the following technical words explained in more comprehensible detail and mentioned throughout this tutorial.
1. Web Scraping
Web scraping ? sometimes called web data extraction or web harvesting ? refers to an organized method of web data collection in an automated manner.?
It applies to lead generation, market research, news monitoring, and price intelligence, among other use cases. Web scraping is one of the most commonly used data scraping techniques today.
2. Node.js
As its filename extension implies, Node.js is an open-source JavaScript runtime environment with back-end support. It executes JavaScript codes beyond a web browser and runs on the V8 Engine, which is powered by JavaScript.
Node.js allows developers to use the JavaScript programming language to write command-line tools and run server-side scripts for dynamic web page content production.
3. Puppeteer
Puppeteer is a software library fully compatible with Node.js and features a high-level application programming interface (API).
Its API primarily controls ?headless? Google Chrome or Chromium apps or similar web browsers that don?t have a graphical user interface (GUI) over the Chrome ?devtools? (web development tools) protocol.?
In other words, hobbyists and professionals use Puppeteer for automating the Chrome browser via a command-line interface (CLI) instead. Moreover, as a Node.js library, you can customize Puppeteer for use with non-headless Chrome or Chromium browsers. For more information, this link explains what Puppeteer is in more depth.
Importance of Web Scraping With Node.js and Puppeteer
Generally, web scraping is crucial for businesses since it allows for efficient and fast extraction of data which usually comes in the form of news from multiple sources.?
As a result, web scraping grants businesses access to vital information, such as contact details (names, email addresses, websites, etc.), price comparison data, customer reviews, which they can then analyze thoroughly and use to upscale their daily operations.
Precisely, when web scraping is used together with Node.js and Puppeteer, it allows for the ethical scraping of dynamic websites.?
In addition to this morally professional way of extracting data, the integration of Node.js and Puppeteer on web scraping boosts the collation of the latest news around your area, analysis of product and service prices, or collection of all the newest back-end development job openings, among other specific processes.
Advantages of Using Node.js and Puppeteer for Web Scraping
The use of Node.js and Puppeteer for web scraping yields a reasonable number of benefits for businesses. Some of these advantages include the following:
1.Execution Performance and Speed
Although Puppeteer is known to work only with Chrome or Chromium as of the moment, there?s no denying that this Node.js library provides further control over the browser.?
Furthermore, the software library?s default headless mode is impressively fast, translating to an overall faster web scraping.
2. Support for Taking Screenshots
Puppeteer completely supports taking screenshots and Portable Document Formats (PDF) of web pages. This screenshot support allows for user interface (UI) testing.
3. Testing Platform Support
Perhaps one of the boons of using Puppeteer as an automation tool and go-to software library for web scraping is its testing platform compatibility. Performing a unit-level test for virtually any web application is made possible as a result.
4. Relatively Straightforward Installation and Setup
The installation and setup for Puppeteer are user-friendly. Nearly everything is easy to set up with just one command.
Setup Guide for Scraping A Website With Puppeteer
This section will focus only on the most basic Puppeteer tutorial aspects: setting up a web scraper with Puppeteer, setting up a web browser instance, scraping data from one web page, and saving that data as a JSON file.?
Anything else beyond the scope of this setup guide, such as Node.js installation as a prerequisite, won?t be included here.
With that said, once you have completed the Node.js installation, you may proceed to the first step.
Step 1 – Web Scraper and Puppeteer Setup
- Make a project root directory, install at least one required dependency using npm (the default Node.js package manager), create a folder for this project, and navigate inside.
$ mkdir web-scraper
$ cd web-scraper
- Initialize npm for this project.
$ npm init
- Simply press ?Enter? for each prompt for this Puppeteer tutorial walkthrough. When you?re prompted to type and enter ?yes,? please do so.
- Install Puppeteer using npm.
$ npm install –save Puppeteer
- After the npm, Puppeteer, and additional dependency installation, open the file named package.json for one last configuration.
$ nano package.json
- Look for a section named ?scripts?: and use the configurations below.
{
??. . .
??”scripts”: {
????”test”: “echo \”Error: no test specified\” && exit 1″,
“start”: “node index.js”
??},
??. . .
??”dependencies”: {
????”puppeteer”: “^5.2.1”
??}
}
Step 2 – Browser Instance Setup
- Create a javaScript file named browser.js within your project root directory and open it in a text editor.
$ nano browser.js
- Create an async function named startBrowser(), add the following code, and save and close browser.js.
const puppeteer = require(‘puppeteer’);
async function startBrowser(){
let browser;
try {
? ? console.log(“Opening the browser…”);
? ? browser = await puppeteer.launch({
? ? ? ? headless: true,
? ? ? ? args: [“–disable-setuid-sandbox”],
? ? ? ? ‘ignoreHTTPSErrors’: true
? ? });
} catch (err) {
? ? console.log(“Could not create a browser instance > : “, err);
}
return browser;
}
module.exports = {
startBrowser
};
- Create another JavaScript file named: index.js
$ nano index.js
- Require both browser.js and pageController.js, call the startBrowser() function, relay the new browser instance to the page controller, and save and close the file.
const browserObject = require(‘./browser’);
const scraperController = require(‘./pageController’);
//Starts the web browser and creates a browser instance
let browserInstance = browserObject.startBrowser();
// Passes the web browser instance to the web scraper controller
scraperController(browserInstance)
- Create a third JavaScript file named: pageController.js
$ nano pageController.js
- The pageController.js manages the scraping process using the web browser instance to control another JavaScript file (pageScraper.js), where all web scraping scripts are executed.
- In the meantime, use the code below and Chromium to try navigating to a web page and save and close the pageController.js file.
const pageScraper = require(‘./pageScraper’);
async function scrapeAll(browserInstance){
let browser;
try{
browser = await browserInstance;
await pageScraper.scraper(browser);
}
catch(err){
console.log(“Couldn’t resolve browser instance > “, err);
}
}
module.exports = (browserInstance) => scrapeAll(browserInstance)
- Create the fourth and last JavaScript file named: pageScraper.js
$ nano pageScraper.js
- Open pageScraper.js, add the code below, then save and close this file.
const scraperObject = {
url: ‘http://books.toscrape.com’,
async scraper(browser){
let page = await browser.newPage();
console.log(‘Navigating to ${this.url}…’);
await page.goto(this.url);
}
}
module.exports = scraperObject;
- Finally, type the following command, hit the ?Enter? key, and witness the execution of your scraper application.
$ npm run start
- The executed command opens a Chromium browser instance and a new page and then logs on to books.toscrape.com automatically.
Step 3 – Single-Page Data Scraping
- Open Google Chrome, manually log on to books.toscrape.com and then check out the site to get familiar with the data?s structure.
- Open the pageScraper.js file.
$ nano pageScraper.js
- Refer to the following code, save pageScraper.js and then close this file.
const scraperObject = {
url: ‘http://books.toscrape.com’,
async scraper(browser){
let page = await browser.newPage();
console.log(‘Navigating to ${this.url}…’);
await page.goto(this.url);
// Waits for required DOM to be rendered
await page.waitForSelector(‘.page_inner’);
// Gets the link to all required books
let urls = await page.$$eval(‘section ol > li’, links => {
// Makes sure the book about to be scraped is in stock
links = links.filter(link => link.querySelector(‘.instock.availability > i’).textContent !== “In stock”)
// Extracts the links from the data
links = links.map(el => el.querySelector(‘h3 > a’).href)
return links;
});
console.log(urls);
}
}
module.exports = scraperObject;
- Type and enter the following command to re-run your application.
$ npm run start
Conclusion
After finally setting up a Puppeteer-powered web scraper and browser instance, you?ll need to tinker with data scraping hacks so you can get the hang of it.?
Once you become accustomed to web scraping using the previous steps, you?ll undoubtedly find it helpful in data analysis and research.