Puppeteer, created by Google, is a Node.js library offering an advanced API for managing both headless and headful browsers via the DevTools Protocol.
Retrieving the HTML of a page is useful in scenarios where we need to work with the raw HTML of a page, whether it’s for web scraping, data extraction, or other tasks that involve manipulating or analyzing the page’s structure.
To get the HTML content of the current page, we use Puppeteer's page.content()
function. It returns a Promise that resolves to the HTML string of the entire page.
await page.content();
The await
keyword in JavaScript is used to pause the execution of the script until the Promise returned by the following method is resolved.
Execute the following code by clicking the "Run" button and see the HTML content of the opened page logged in the "Terminal" tab.
const puppeteer = require('puppeteer'); (async () => { // Launch a headless browser const browser = await puppeteer.launch({ args: ['--no-sandbox'] }); // Open a new page const page = await browser.newPage(); // Navigate to the desired URL await page.goto('https://www.scrapethissite.com/login/'); // Get the HTML content of the page const html = await page.content(); // Log the extracted HTML content console.log(html); // Close the browser await browser.close(); })();
You may have observed that the browser is opened in the background, as you didn’t see it open here. This is because, in Puppeteer, the browser is launched in headless mode (no visible GUI) by default.
Line 1: We import the Puppeteer library using the require
function in Node.js. This action loads the Puppeteer module, making all of its functionality accessible within the script under the variable name puppeteer
.
Line 2: We define an asynchronous function using the async
keyword. Inside this function:
On lines 4–6, we launch the browser with Puppeteer.
On line 8, we create a new page.
On line 10, we open the desired URL.
On line 12, we extract the HTML of the opened page.
On line 14, we log the HTML of the page.
On line 16, we close the browser.
Note: We are passing the
--no-sandbox
argument to thepuppeteer.launch()
function to disable sandboxing to open the browser on the Educative platform. If you're running the script on your local machine, this argument might be unnecessary in your command.
Free Resources