This device is not compatible.
PROJECT
Headless Web Scraping Using Puppeteer
In this project, we’ll learn to scrape text, images, and URLs from the web page. We’ll also fetch data using multiple puppeteer commands in the form of HTML elements. Lastly, we’ll automate events using schedulers.
You will learn to:
Scrape text data from web pages.
Scrape HTML data to create PDFs.
Scrape images from web pages.
Schedule the scraping.
Skills
Web Scraping
Data Collection
Task Automation
Prerequisites
Intermediate understanding of JavaScript
Basic understanding of Node.js
Basic understanding of cron
Technologies
Node.js
Puppeteer
JavaScript
Project Description
The Node library Puppeteer is used to control browsers through an API. Initially, it was designed to only work with Chromium-based browsers, but now it supports multiple browsers. It runs in headless mode by default, but it can also be configured to run in a non-headless mode.
In this project, we’ll build a Node application to scrape data from a web-based e-library application using Puppeteer and a headless Chromium browser. Throughout this project, we’ll use multiple puppeteer functions to fetch HTML elements using CSS class names and HTML tags.
Furthermore, we’ll use Node functions to automate the processes on this website.
Project Tasks
1
Introduction
Task 0: Run the NextJS Application
Task 1: Access the Web Page
Task 2: Take a Web Page Screenshot
2
Extract Data
Task 3: Extract the Description from the Text
Task 4: Extract the Links from the Screen
Task 5: Extract Images from the Web Page
Task 6: Save the Extracted Images
Task 7: Create a PDF File from the Collected Data
3
Schedule
Task 8: Automate the Scrapping
Task 9: Use node-cron to Automate Scraping
Congratulations!
Relevant Course
Use the following content to review prerequisites or explore specific concepts in detail.