Webscraping with Hyperbrowser

Hyperbrowser provides two endpoints for web scraping: scrape for single pages and crawl for multi-page scraping. Both handle browser automation, JavaScript rendering, and convert the data to markdown format automatically.

This guide shows how to use these endpoints to extract structured data from any website. The data is returned in markdown format, which is ideal for LLMs since it preserves semantic structure while being more readable than HTML.

When to Use What

Key Concepts and Parameters

For Scrape:

export interface StartScrapeJobParams { url: string; // The URL to scrape }

For Crawl:

export interface StartCrawlJobParams { url: string; // The starting URL maxPages: number; // Limit how deep or how wide the crawl goes followLinks: boolean; // Whether to follow internal links discovered on the pages excludePatterns: string[]; includePatterns: string[]; }

These parameters let you fine-tune the crawl to target precisely what you want. For example, you might set followLinks to true but exclude certain URL patterns to skip irrelevant sections of a site.

How to Use the Endpoints

Next, let's see how to use the endpoints in your own project. Hyperbrowser provides SDKs in node and python so you can easily get started in minutes. Let's setup the Node.js SDK in our project.

If you haven't already, you can sign up for a free account at app.hyperbrowser.ai and get your API key.

1. Install the SDK

npm install @hyperbrowser/sdk

or

yarn add @hyperbrowser/sdk

2. Starting a Scrape Job

// Initialize the client const client = new HyperbrowserClient({ apiKey: "YOUR_API_KEY" }); // Start a scrape job for a single page const scrapeResponse = await client.startScrapeJob({ url: "https://example.com", // useProxy: true, // uncomment to use a proxy, only available for paid plans // solveCaptchas: true, // uncomment to solve captchas, only available for paid plans }); console.log("Scrape Job Started:", scrapeResponse.jobId); // Poll for the scrape job result let scrapeJobResult; while (true) { scrapeJobResult = await client.getScrapeJob(scrapeResponse.jobId); if (scrapeJobResult.status === "completed") { break; } else if (scrapeJobResult.error) { console.error("Scrape Job Error:", scrapeJobResult.error); break; } else { // Job still in progress, wait 5 seconds before checking again console.log("Scrape still in progress. Checking again in 5 seconds..."); await new Promise((resolve) => setTimeout(resolve, 5000)); } } console.log("Scrape Metadata:", scrapeJobResult.data.metadata); console.log("Page Content (Markdown):", scrapeJobResult.data.markdown);

3. Starting a Crawl Job

// Start a crawl job with parameters const crawlParams: StartCrawlJobParams = { url: "https://example.com", maxPages: 10, followLinks: true, excludePatterns: [".*login.*"], // Exclude pages with 'login' in the URL includePatterns: ["https://example.com/blog.*"], // Only include blog pages // useProxy: true, // uncomment to use a proxy, only available for paid plans // solveCaptchas: true, // uncomment to solve captchas, only available for paid plans }; const crawlResponse = await client.startCrawlJob(crawlParams); console.log("Crawl Job Started:", crawlResponse.jobId); // Retrieve results page by page let pageIndex = 1; while (true) { const crawlJobResult = await client.getCrawlJob(crawlResponse.jobId, { page: pageIndex, // batchSize: 5, default is 5 }); if (crawlJobResult.status === "completed" && crawlJobResult.data) { // Process each batch of crawled pages console.log(`Crawled Page Batch #${pageIndex}`, crawlJobResult.data); if (pageIndex >= crawlJobResult.totalPageBatches) { // No more pages break; } pageIndex++; } else if (crawlJobResult.error) { console.error("Crawl Job Error:", crawlJobResult.error); break; } else { // If not complete yet, you might wait or check back later console.log("Crawl still in progress. Checking again in a moment..."); await new Promise((resolve) => setTimeout(resolve, 5000)); } }

Error Handling and Status Checks

The scrape and crawl jobs are asynchronous. They may take time to complete based on the website's speed, size, and other factors. Because of this, you need to:

You can find the full documentation for any of the SDKs and all our endpoints in our docs.

Hyperbrowser makes web scraping simple. The scrape and crawl endpoints handle all the complexity of extracting data from websites, whether you need content from one page or many.

Start using these endpoints today to focus on what matters - working with your data.

Hyperbrowser

Get started today!

Launch your browser in seconds. No credit card required.