The Extract API allows you to extract structured data from web pages using AI. You can define a schema and prompt, and Hyperbrowser will extract the data matching your requirements.
Installation
npm install @hyperbrowser/sdk dotenv
Usage
import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";
config();
const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});
const main = async () => {
  const extractResult = await client.extract.startAndWait({
    urls: ["https://example.com"],
    prompt: "Extract the main heading and description from the page",
    schema: {
      type: "object",
      properties: {
        heading: { type: "string" },
        description: { type: "string" },
      },
      required: ["heading", "description"],
    },
  });
  console.log("Extract result:", extractResult);
};
main();
Response
The Start Extract Job POST /extract endpoint will return a jobId in the response which can be used to get information about the job in subsequent requests.
{
  "jobId": "962372c4-a140-400b-8c26-4ffe21d9fb9c"
}
GET /extract/{jobId}/status will return the following data:
{
  "status": "completed"
}
GET /extract/{jobId} will return the following data:
{
  "jobId": "962372c4-a140-400b-8c26-4ffe21d9fb9c",
  "status": "completed",
  "data": {
    "heading": "Example Domain",
    "description": "This domain is for use in documentation examples without needing permission. Avoid use in operations."
  }
}
pending, running, completed, failed.
To see the full schema, checkout the API Reference.
Schema Definition
You can define a JSON schema to specify the structure of the data you want to extract. The schema should follow the JSON Schema specification.
import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";
config();
const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});
const main = async () => {
  const extractResult = await client.extract.startAndWait({
    urls: ["https://news.ycombinator.com"],
    prompt: "Extract all article titles and their URLs from the front page",
    schema: {
      type: "object",
      properties: {
        articles: {
          type: "array",
          items: {
            type: "object",
            properties: {
              title: { type: "string" },
              url: { type: "string" },
              score: { type: "number" },
            },
            required: ["title", "url"],
          },
        },
      },
      required: ["articles"],
    },
  });
  console.log("Extract result:", extractResult);
};
main();
For best results, provide both a schema and a prompt. The schema should define exactly how you want the extract data formatted and the prompt should have any information that can help guide the extraction. If no schema is provided, then we will try to automatically generate a schema based on the prompt.
Session Configurations
You can also provide configurations for the session that will be used to execute the extract job, such as using a proxy or solving CAPTCHAs. To see all the different available session parameters, checkout the API Reference or Session Parameters.
import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";
config();
const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});
const main = async () => {
  const extractResult = await client.extract.startAndWait({
    urls: ["https://example.com"],
    prompt: "Extract the main heading and description",
    schema: {
      type: "object",
      properties: {
        heading: { type: "string" },
        description: { type: "string" },
      },
    },
    sessionOptions: {
      useProxy: true,
      solveCaptchas: true,
      proxyCountry: "US",
    },
  });
  console.log("Extract result:", extractResult);
};
main();
Hyperbrowser’s CAPTCHA solving and proxy usage features require being on a PAID plan.
Using proxy and solving CAPTCHAs will slow down the page scraping in the extract job so use it only if necessary.