Extract

The Extract API allows you to extract structured data from web pages using AI. You can define a schema and prompt, and Hyperbrowser will extract the data matching your requirements.

For detailed usage, checkout the Extract API Reference. You can also try out the Extract API in our Playground.

Installation

npm install @hyperbrowser/sdk dotenv

Usage

import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const main = async () => {
  const extractResult = await client.extract.startAndWait({
    urls: ["https://example.com"],
    prompt: "Extract the main heading and description from the page",
    schema: {
      type: "object",
      properties: {
        heading: { type: "string" },
        description: { type: "string" },
      },
      required: ["heading", "description"],
    },
  });
  console.log("Extract result:", extractResult);
};

main();

Response

The Start Extract Job POST /extract endpoint will return a jobId in the response which can be used to get information about the job in subsequent requests.

{
  "jobId": "962372c4-a140-400b-8c26-4ffe21d9fb9c"
}

The Get Extract Job Status GET /extract/{jobId}/status will return the following data:

{
  "status": "completed"
}

The Get Extract Job GET /extract/{jobId} will return the following data:

{
  "jobId": "962372c4-a140-400b-8c26-4ffe21d9fb9c",
  "status": "completed",
  "data": {
    "heading": "Example Domain",
    "description": "This domain is for use in documentation examples without needing permission. Avoid use in operations."
  }
}

The status of an extract job can be one of pending, running, completed, failed. To see the full schema, checkout the API Reference.

Schema Definition

You can define a JSON schema to specify the structure of the data you want to extract. The schema should follow the JSON Schema specification.

import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const main = async () => {
  const extractResult = await client.extract.startAndWait({
    urls: ["https://news.ycombinator.com"],
    prompt: "Extract all article titles and their URLs from the front page",
    schema: {
      type: "object",
      properties: {
        articles: {
          type: "array",
          items: {
            type: "object",
            properties: {
              title: { type: "string" },
              url: { type: "string" },
              score: { type: "number" },
            },
            required: ["title", "url"],
          },
        },
      },
      required: ["articles"],
    },
  });
  console.log("Extract result:", extractResult);
};

main();

For best results, provide both a schema and a prompt. The schema should define exactly how you want the extract data formatted and the prompt should have any information that can help guide the extraction. If no schema is provided, then we will try to automatically generate a schema based on the prompt.

Session Configurations

You can also provide configurations for the session that will be used to execute the extract job, such as using a proxy or solving CAPTCHAs. To see all the different available session parameters, checkout the API Reference or Session Parameters.

import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const main = async () => {
  const extractResult = await client.extract.startAndWait({
    urls: ["https://example.com"],
    prompt: "Extract the main heading and description",
    schema: {
      type: "object",
      properties: {
        heading: { type: "string" },
        description: { type: "string" },
      },
    },
    sessionOptions: {
      useProxy: true,
      solveCaptchas: true,
      proxyCountry: "US",
    },
  });
  console.log("Extract result:", extractResult);
};

main();

Hyperbrowser’s CAPTCHA solving and proxy usage features require being on a PAID plan.

Using proxy and solving CAPTCHAs will slow down the page scraping in the extract job so use it only if necessary.

For a full reference on the extract endpoint, checkout the API Reference.

Get Started

Browser Sessions

Session Configuration

Scraping

Agents

Integrations

Installation

Usage

Response

Schema Definition

Session Configurations

Get Started

Browser Sessions

Session Configuration

Scraping

Agents

Integrations

​Installation

​Usage

​Response

​Schema Definition

​Session Configurations

Installation

Usage

Response

Schema Definition

Session Configurations