Skip to main content
This guide shows how to use Hyperbrowser to scrape a single page, crawl multiple pages, and extract structured data. It also documents the most important parameters.
You can also see dedicated pages for Scrape, Crawl, and Extract, and try them in the Playground. For session configuration details, see Configuration Parameters. For full schemas, see the API Reference.

Scraping a web page

With just a URL, you can extract page contents in your chosen formats using the /scrape endpoint.
import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const main = async () => {
  // Handles both starting and waiting for scrape job response
  const scrapeResult = await client.scrape.startAndWait({
    url: "https://example.com",
  });
  console.log("Scrape result:", scrapeResult);
};

main();

Session Options

All Scraping APIs (scrape, crawl, extract) support session parameters. See Session Parameters for all options.

Scrape Options

formats
string[]
default:"[\"markdown\"]"
Output formats to include in the response. One or more of: "html", "links", "markdown", "screenshot".
includeTags
string[]
CSS selectors (tags, classes, IDs) to explicitly include. Only matching elements are returned.
excludeTags
string[]
CSS selectors (tags, classes, IDs) to exclude from the scraped content.
onlyMainContent
boolean
default:"true"
When true, attempts to extract only main content (omits headers/nav/footers).
waitFor
number
default:"0"
Milliseconds to wait after initial load before scraping (useful for dynamic content and CAPTCHA detection when sessionOptions.solveCaptchas is enabled).
timeout
number
default:"30000"
Maximum time (ms) to wait for navigation to complete. Equivalent to page.goto(url, { waitUntil: "load", timeout }).
waitUntil
string
default:"load"
Load condition: "load", "domcontentloaded", or "networkidle".
screenshotOptions
object
Screenshot settings (effective only when formats includes "screenshot"). Properties:
  • fullPage (boolean, default false) — capture full page beyond viewport
  • format ("webp" | "jpeg" | "png", default "webp")
storageState
object
Set the storage state of the page before scraping. Properties:
  • localStorage (object, optional) — Local storage data (key-value pairs where both keys and values must be strings)
  • sessionStorage (object, optional) — Session storage data (key-value pairs where both keys and values must be strings)

Example with options

By configuring these options when making a scrape request, you can control the format and content of the scraped data, as well as the behavior of the scraper itself. For example, to scrape a page with the following:
  • In stealth mode
  • Automatically accept cookies
  • Return only the main content as HTML
  • Exclude any <span> elements
  • Wait 2 seconds after the page loads and before scraping
import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const main = async () => {
  const scrapeResult = await client.scrape.startAndWait({
    url: "https://example.com",
    sessionOptions: {
      useStealth: true,
      acceptCookies: true,
    },
    scrapeOptions: {
      formats: ["html"],
      onlyMainContent: true,
      excludeTags: ["span"],
      waitFor: 2000,
    },
  });
  console.log("Scrape result:", scrapeResult);
};

main();

Crawl a site

Instead of scraping a single page, you can collect content across multiple pages using the /crawl endpoint. You can use the same sessionOptions and scrapeOptions as in /scrape, along with additional crawl-specific options below.

Crawl Options

url
string
required
The URL of the page to crawl.
maxPages
number
Maximum number of pages to crawl before stopping (minimum: 1).
When true, follow links discovered on pages to expand the crawl.
ignoreSitemap
boolean
default:"false"
When true, skip pre-generating URLs from sitemaps at the target origin.
excludePatterns
string[]
Regex or wildcard patterns for URL paths to exclude from the crawl.
includePatterns
string[]
Regex or wildcard patterns for URL paths to include (only matching pages will be crawled).
sessionOptions
object
Session configuration used during the crawl. See Session Parameters.
scrapeOptions
object
Scrape options used during the crawl. See Scrape Options.

Example with options

import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const main = async () => {
  const crawlResult = await client.crawl.startAndWait({
    url: "https://hyperbrowser.ai",
    maxPages: 5,
    includePatterns: ["/blog/*"],
    scrapeOptions: {
      formats: ["markdown"],
      onlyMainContent: true,
      excludeTags: ["span"],
    },
  });
  console.log("Crawl result:", crawlResult);
};

main();

Structured extraction

The Extract API fetches data in a well-defined structure from any set of pages. Provide a list of URLs, and Hyperbrowser will collect relevant content (including optional crawling) and return data that fits your schema or prompt.

Extract Options

urls
string[]
required
List of page URLs. To crawl an origin for a URL, append /* (e.g., https://example.com/*) to follow relevant links up to maxLinks.
schema
object
JSON Schema for the desired output.
prompt
string
Instructional prompt describing how to structure the extracted data. If no schema is provided, we will try to generate a schema based on the prompt.
systemPrompt
string
Additional instructions to guide extraction behavior.
When crawling for any given /* URL, the maximum number of links to follow.
waitFor
number
default:"0"
Milliseconds to wait after page load before extraction (useful for dynamic content and CAPTCHA detection when sessionOptions.solveCaptchas is enabled).
sessionOptions
object
Session configuration used during extraction. See Session Parameters.
You can provide a schema, or a prompt, or both. For best results, provide both a schema and a prompt. The schema should define exactly how you want the extracted data formatted, and the prompt should include any information that can help guide the extraction. If no schema is provided, we will try to automatically generate a schema based on the prompt.