The Crawl API allows you to crawl websites and get data from multiple pages in a single request. Starting from a URL, it can navigate through the site and extract content from linked pages.
Hyperbrowser exposes endpoints for starting a crawl request and for getting its status and results. By default, crawling is handled in an asynchronous manner of first starting the job and then checking its status until it is completed. However, with our SDKs, we provide a simple function that handles the whole flow and returns the data once the job is completed.
Installation
npm install @hyperbrowser/sdk dotenv
Usage
import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";
config();
const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});
const main = async () => {
  // Handles both starting and waiting for crawl job response
  const crawlResult = await client.crawl.startAndWait({
    url: "https://example.com",
    maxPages: 10,
    followLinks: true,
  });
  console.log("Crawl result:", crawlResult);
};
main();
Response
The Start Crawl Job POST /crawl endpoint will return a jobId in the response which can be used to get information about the job in subsequent requests.
{
  "jobId": "962372c4-a140-400b-8c26-4ffe21d9fb9c"
}
GET /crawl/{jobId}/status will return the following data:
{
  "status": "completed"
}
GET /crawl/{jobId} will return the following data:
{
  "jobId": "962372c4-a140-400b-8c26-4ffe21d9fb9c",
  "status": "completed",
  "totalCrawledPages": 10,
  "data": [
    {
      "metadata": {
        "title": "Example Page",
        "description": "A sample webpage",
        "url": "https://example.com"
      },
      "markdown": "# Example Page\nThis is content..."
    }
  ]
}
pending, running, completed, failed. The results will be an array of scraped pages in the data field.
Each crawled page has it’s own status of completed or failed and can have it’s own error field, so be cautious of that.
Crawl Options
You can configure various options for the crawl job:
- maxPages: Maximum number of pages to crawl (default: 10, max: 100)
- followLinks: Whether to follow links on the crawled pages (default: true)
- ignoreSitemap: Whether to ignore the sitemap (default: false)
import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";
config();
const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});
const main = async () => {
  const crawlResult = await client.crawl.startAndWait({
    url: "https://example.com",
    maxPages: 50,
    followLinks: true,
    ignoreSitemap: false,
  });
  console.log("Crawl result:", crawlResult);
};
main();
Session Configurations
You can also provide configurations for the session that will be used to execute the crawl job, such as using a proxy or solving CAPTCHAs. To see all the different available session parameters, checkout the API Reference or Session Parameters.
import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";
config();
const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});
const main = async () => {
  const crawlResult = await client.crawl.startAndWait({
    url: "https://example.com",
    maxPages: 10,
    followLinks: true,
    sessionOptions: {
      useProxy: true,
      solveCaptchas: true,
      proxyCountry: "US",
    },
  });
  console.log("Crawl result:", crawlResult);
};
main();
Using proxy and solving CAPTCHAs will slow down the crawl so use it only if
necessary.
Scrape Configurations
You can also provide optional scrape options for the crawl job such as the formats to return, only returning the main content of the page, setting the maximum timeout for navigating to a page, etc.
import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";
config();
const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});
const main = async () => {
  const crawlResult = await client.crawl.startAndWait({
    url: "https://example.com",
    scrapeOptions: {
      formats: ["markdown", "html", "links"],
      onlyMainContent: false,
      timeout: 10000,
    },
  });
  console.log("Crawl result:", crawlResult);
};
main();
Hyperbrowser’s CAPTCHA solving and proxy usage features require being on a PAID plan.