Advanced Guide

This guide shows how to use Hyperbrowser to scrape a single page, crawl multiple pages, and extract structured data. It also documents the most important parameters.

You can also see dedicated pages for Scrape, Crawl, and Extract, and try them in the Playground. For session configuration details, see Configuration Parameters. For full schemas, see the API Reference.

Scraping a web page

With just a URL, you can extract page contents in your chosen formats using the /scrape endpoint.

import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const main = async () => {
  // Handles both starting and waiting for scrape job response
  const scrapeResult = await client.scrape.startAndWait({
    url: "https://example.com",
  });
  console.log("Scrape result:", scrapeResult);
};

main();

Session Options

All Scraping APIs (scrape, crawl, extract) support session parameters. See Session Parameters for all options.

Scrape Options

formats

string[]

default:"[\"markdown\"]"

Output formats to include in the response. One or more of: "html", "links", "markdown", "screenshot".

includeTags

string[]

CSS selectors (tags, classes, IDs) to explicitly include. Only matching elements are returned.

excludeTags

string[]

CSS selectors (tags, classes, IDs) to exclude from the scraped content.

onlyMainContent

boolean

default:"true"

When true, attempts to extract only main content (omits headers/nav/footers).

waitFor

number

default:"0"

Milliseconds to wait after initial load before scraping (useful for dynamic content and CAPTCHA detection when sessionOptions.solveCaptchas is enabled).

timeout

number

default:"30000"

Maximum time (ms) to wait for navigation to complete. Equivalent to page.goto(url, { waitUntil: "load", timeout }).

waitUntil

string

default:"load"

Load condition: "load", "domcontentloaded", or "networkidle".

screenshotOptions

object

Screenshot settings (effective only when formats includes "screenshot"). Properties:

fullPage (boolean, default false) — capture full page beyond viewport
format ("webp" | "jpeg" | "png", default "webp")

storageState

object

Set the storage state of the page before scraping. Properties:

localStorage (object, optional) — Local storage data (key-value pairs where both keys and values must be strings)
sessionStorage (object, optional) — Session storage data (key-value pairs where both keys and values must be strings)

Example with options

By configuring these options when making a scrape request, you can control the format and content of the scraped data, as well as the behavior of the scraper itself. For example, to scrape a page with the following:

In stealth mode
Automatically accept cookies
Return only the main content as HTML
Exclude any <span> elements
Wait 2 seconds after the page loads and before scraping

import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const main = async () => {
  const scrapeResult = await client.scrape.startAndWait({
    url: "https://example.com",
    sessionOptions: {
      useStealth: true,
      acceptCookies: true,
    },
    scrapeOptions: {
      formats: ["html"],
      onlyMainContent: true,
      excludeTags: ["span"],
      waitFor: 2000,
    },
  });
  console.log("Scrape result:", scrapeResult);
};

main();

Crawl a site

Instead of scraping a single page, you can collect content across multiple pages using the /crawl endpoint. You can use the same sessionOptions and scrapeOptions as in /scrape, along with additional crawl-specific options below.

Crawl Options

url

string

required

The URL of the page to crawl.

maxPages

number

Maximum number of pages to crawl before stopping (minimum: 1).

followLinks

boolean

default:"true"

When true, follow links discovered on pages to expand the crawl.

ignoreSitemap

boolean

default:"false"

When true, skip pre-generating URLs from sitemaps at the target origin.

excludePatterns

string[]

Regex or wildcard patterns for URL paths to exclude from the crawl.

includePatterns

string[]

Regex or wildcard patterns for URL paths to include (only matching pages will be crawled).

sessionOptions

object

Session configuration used during the crawl. See Session Parameters.

scrapeOptions

object

Scrape options used during the crawl. See Scrape Options.

Example with options

import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const main = async () => {
  const crawlResult = await client.crawl.startAndWait({
    url: "https://hyperbrowser.ai",
    maxPages: 5,
    includePatterns: ["/blog/*"],
    scrapeOptions: {
      formats: ["markdown"],
      onlyMainContent: true,
      excludeTags: ["span"],
    },
  });
  console.log("Crawl result:", crawlResult);
};

main();

Structured extraction

The Extract API fetches data in a well-defined structure from any set of pages. Provide a list of URLs, and Hyperbrowser will collect relevant content (including optional crawling) and return data that fits your schema or prompt.

Extract Options

urls

string[]

required

List of page URLs. To crawl an origin for a URL, append /* (e.g., https://example.com/*) to follow relevant links up to maxLinks.

schema

object

JSON Schema for the desired output.

prompt

string

Instructional prompt describing how to structure the extracted data. If no schema is provided, we will try to generate a schema based on the prompt.

systemPrompt

string

Additional instructions to guide extraction behavior.

maxLinks

number

When crawling for any given /* URL, the maximum number of links to follow.

waitFor

number

default:"0"

Milliseconds to wait after page load before extraction (useful for dynamic content and CAPTCHA detection when sessionOptions.solveCaptchas is enabled).

sessionOptions

object

Session configuration used during extraction. See Session Parameters.

You can provide a schema, or a prompt, or both. For best results, provide both a schema and a prompt. The schema should define exactly how you want the extracted data formatted, and the prompt should include any information that can help guide the extraction. If no schema is provided, we will try to automatically generate a schema based on the prompt.

Get Started

Browser Sessions

Session Configuration

Scraping

Agents

Integrations

Scraping a web page

Session Options

Scrape Options

Example with options

Crawl a site

Crawl Options

Example with options

Structured extraction

Extract Options

Get Started

Browser Sessions

Session Configuration

Scraping

Agents

Integrations

​Scraping a web page

​Session Options

​Scrape Options

​Example with options

​Crawl a site

​Crawl Options

​Example with options

​Structured extraction

​Extract Options

Scraping a web page

Session Options

Scrape Options

Example with options

Crawl a site

Crawl Options

Example with options

Structured extraction

Extract Options