Dec 9, 2024

Webscraping with Hyperbrowser

Hyperbrowser provides two endpoints for web scraping: scrape for single pages and crawl for multi-page scraping. Both handle browser automation, JavaScript rendering, and convert the data to markdown format automatically.

This guide shows how to use these endpoints to extract structured data from any website. The data is returned in markdown format, which is ideal for LLMs since it preserves semantic structure while being more readable than HTML.

When to Use What

Use the Scrape Endpoint for:
- Quick extraction of data from a single URL.
- Testing and prototyping your data extraction logic.
- Gleaning metadata from a handful of pages.
Use the Crawl Endpoint for:
- Larger-scale data gathering from multiple pages.
- Automating site-wide audits, content indexing, or SEO analysis.
- Building datasets by crawling entire sections of a website.

Key Concepts and Parameters

For Scrape:

StartScrapeJobParams:

export interface StartScrapeJobParams {
  url: string; // The URL to scrape
}

ScrapeJobResponse:
Once your scrape job completes, you’ll receive a response that includes:
- status: The current job status (e.g., pending, running, completed, failed).
- data: The extracted content and metadata if successful.
- error: An error message if the scraping failed.

For Crawl:

StartCrawlJobParams:

export interface StartCrawlJobParams {
  url: string; // The starting URL
  maxPages: number; // Limit how deep or how wide the crawl goes
  followLinks: boolean; // Whether to follow internal links discovered on the pages
  excludePatterns: string[];
  includePatterns: string[];
}

These parameters let you fine-tune the crawl to target precisely what you want. For example, you might set followLinks to true but exclude certain URL patterns to skip irrelevant sections of a site.

CrawlJobResponse:
As the crawl progresses, you can query the endpoint to get:
- status: Current crawl status (e.g., running, completed, failed, pending).
- data: A batch of crawled pages and their content and metadata.
- totalPageBatches and currentPageBatch: Lets you iterate through the result set page by page.

How to Use the Endpoints

Next, let's see how to use the endpoints in your own project. Hyperbrowser provides SDKs in node and python so you can easily get started in minutes. Let's setup the Node.js SDK in our project.

If you haven't already, you can sign up for a free account at app.hyperbrowser.ai and get your API key.

1. Install the SDK

npm install @hyperbrowser/sdk

yarn add @hyperbrowser/sdk

2. Starting a Scrape Job

// Initialize the client
const client = new HyperbrowserClient({ apiKey: "YOUR_API_KEY" });

// Start a scrape job for a single page
const scrapeResponse = await client.startScrapeJob({
  url: "https://example.com",
  // useProxy: true, // uncomment to use a proxy, only available for paid plans
  // solveCaptchas: true, // uncomment to solve captchas, only available for paid plans
});
console.log("Scrape Job Started:", scrapeResponse.jobId);

// Poll for the scrape job result
let scrapeJobResult;
while (true) {
  scrapeJobResult = await client.getScrapeJob(scrapeResponse.jobId);

  if (scrapeJobResult.status === "completed") {
    break;
  } else if (scrapeJobResult.error) {
    console.error("Scrape Job Error:", scrapeJobResult.error);
    break;
  } else {
    // Job still in progress, wait 5 seconds before checking again
    console.log("Scrape still in progress. Checking again in 5 seconds...");
    await new Promise((resolve) => setTimeout(resolve, 5000));
  }
}

console.log("Scrape Metadata:", scrapeJobResult.data.metadata);
console.log("Page Content (Markdown):", scrapeJobResult.data.markdown);

3. Starting a Crawl Job

// Start a crawl job with parameters
const crawlParams: StartCrawlJobParams = {
  url: "https://example.com",
  maxPages: 10,
  followLinks: true,
  excludePatterns: [".*login.*"], // Exclude pages with 'login' in the URL
  includePatterns: ["https://example.com/blog.*"], // Only include blog pages
  // useProxy: true, // uncomment to use a proxy, only available for paid plans
  // solveCaptchas: true, // uncomment to solve captchas, only available for paid plans
};

const crawlResponse = await client.startCrawlJob(crawlParams);
console.log("Crawl Job Started:", crawlResponse.jobId);

// Retrieve results page by page
let pageIndex = 1;
while (true) {
  const crawlJobResult = await client.getCrawlJob(crawlResponse.jobId, {
    page: pageIndex,
    // batchSize: 5, default is 5
  });

  if (crawlJobResult.status === "completed" && crawlJobResult.data) {
    // Process each batch of crawled pages
    console.log(`Crawled Page Batch #${pageIndex}`, crawlJobResult.data);

    if (pageIndex >= crawlJobResult.totalPageBatches) {
      // No more pages
      break;
    }
    pageIndex++;
  } else if (crawlJobResult.error) {
    console.error("Crawl Job Error:", crawlJobResult.error);
    break;
  } else {
    // If not complete yet, you might wait or check back later
    console.log("Crawl still in progress. Checking again in a moment...");
    await new Promise((resolve) => setTimeout(resolve, 5000));
  }
}

Error Handling and Status Checks

The scrape and crawl jobs are asynchronous. They may take time to complete based on the website's speed, size, and other factors. Because of this, you need to:

Check the Status: Always check the status field before assuming the job is done.
Handle Errors Gracefully: If error is present, log it or take necessary actions.
Retry Strategies: Use a polling mechanism to check the status of the job until it's completed.

You can find the full documentation for any of the SDKs and all our endpoints in our docs.

Hyperbrowser makes web scraping simple. The scrape and crawl endpoints handle all the complexity of extracting data from websites, whether you need content from one page or many.

Start using these endpoints today to focus on what matters - working with your data.

Webscraping with Hyperbrowser

When to Use What

Key Concepts and Parameters

How to Use the Endpoints

Error Handling and Status Checks

Get started today!