How to get web page HTML with Puppeteer

Puppeteer, created by Google, is a Node.js library offering an advanced API for managing both headless and headful browsers via the DevTools Protocol.

Retrieving the HTML of a page is useful in scenarios where we need to work with the raw HTML of a page, whether it’s for web scraping, data extraction, or other tasks that involve manipulating or analyzing the page’s structure.

const puppeteer = require('puppeteer');
(async () => {
  // Launch a headless browser
  const browser = await puppeteer.launch({
    args: ['--no-sandbox']
  });
  // Open a new page
  const page = await browser.newPage();
  // Navigate to the desired URL
  await page.goto('https://www.scrapethissite.com/login/');
  // Get the HTML content of the page
  const html = await page.content();
  // Log the extracted HTML content
  console.log(html);
  // Close the browser
  await browser.close();
})();

Running code example for getting HTML of a web page with Puppeteer

You may have observed that the browser is opened in the background, as you didn’t see it open here. This is because, in Puppeteer, the browser is launched in headless mode (no visible GUI) by default.

Code explanation

Line 1: We import the Puppeteer library using the require function in Node.js. This action loads the Puppeteer module, making all of its functionality accessible within the script under the variable name puppeteer.
Line 2: We define an asynchronous function using the async keyword. Inside this function:
- On lines 4–6, we launch the browser with Puppeteer.
- On line 8, we create a new page.
- On line 10, we open the desired URL.
- On line 12, we extract the HTML of the opened page.
- On line 14, we log the HTML of the page.
- On line 16, we close the browser.

Note: We are passing the --no-sandbox argument to the puppeteer.launch() function to disable sandboxing to open the browser on the Educative platform. If you're running the script on your local machine, this argument might be unnecessary in your command.

Unlock your potential: Puppeteer fundamentals series, all in one place!

To deepen your understanding of Puppeteer, explore our series of Answers below:

What is Puppeteer?
Learn about Puppeteer, a Node.js library that provides a high-level API for browser automation using headless Chrome or Chromium.
How to check for the browser version in Puppeteer
Discover how to retrieve the browser version using Puppeteer's browser.version() method.
How to open the browser in headful mode with Puppeteer
Explore how to launch a visible browser instance by disabling the headless mode in Puppeteer.
How to get web page HTML with Puppeteer
Learn how to extract and manipulate a webpage’s HTML content using Puppeteer’s evaluate() method.
What is the use of the setViewport method in Puppeteer?
Understand how setViewport() customizes the browser’s viewport size for responsive testing and screenshots.
What is code coverage in Puppeteer?
Learn how to analyze unused JavaScript and CSS in web pages to optimize performance using Puppeteer’s coverage tool.
What is visual regression testing in Puppeteer?
Discover how Puppeteer can capture and compare screenshots to detect visual changes in web applications.
What is an accessibility test in Puppeteer?
Explore how Puppeteer, combined with accessibility tools like axe-core, helps evaluate web accessibility compliance.

How to get web page HTML with Puppeteer

Syntax

Code example

Code explanation