Webscraping with Hyperbrowser
Hyperbrowser provides two endpoints for web scraping: scrape
for single pages and crawl
for multi-page scraping. Both handle browser automation, JavaScript rendering, and convert the data to markdown format automatically.
This guide shows how to use these endpoints to extract structured data from any website. The data is returned in markdown format, which is ideal for LLMs since it preserves semantic structure while being more readable than HTML.
When to Use What
-
Use the Scrape Endpoint for:
- Quick extraction of data from a single URL.
- Testing and prototyping your data extraction logic.
- Gleaning metadata from a handful of pages.
-
Use the Crawl Endpoint for:
- Larger-scale data gathering from multiple pages.
- Automating site-wide audits, content indexing, or SEO analysis.
- Building datasets by crawling entire sections of a website.
Key Concepts and Parameters
For Scrape:
- StartScrapeJobParams:
export interface StartScrapeJobParams { url: string; // The URL to scrape }
-
ScrapeJobResponse:
Once your scrape job completes, you’ll receive a response that includes:- status: The current job status (e.g.,
pending
,running
,completed
,failed
). - data: The extracted content and metadata if successful.
- error: An error message if the scraping failed.
- status: The current job status (e.g.,
For Crawl:
- StartCrawlJobParams:
export interface StartCrawlJobParams { url: string; // The starting URL maxPages: number; // Limit how deep or how wide the crawl goes followLinks: boolean; // Whether to follow internal links discovered on the pages excludePatterns: string[]; includePatterns: string[]; }
These parameters let you fine-tune the crawl to target precisely what you want. For example, you might set followLinks
to true
but exclude certain URL patterns to skip irrelevant sections of a site.
-
CrawlJobResponse:
As the crawl progresses, you can query the endpoint to get:- status: Current crawl status (e.g.,
running
,completed
,failed
,pending
). - data: A batch of crawled pages and their content and metadata.
- totalPageBatches and currentPageBatch: Lets you iterate through the result set page by page.
- status: Current crawl status (e.g.,
How to Use the Endpoints
Next, let's see how to use the endpoints in your own project. Hyperbrowser provides SDKs in node and python so you can easily get started in minutes. Let's setup the Node.js SDK in our project.
If you haven't already, you can sign up for a free account at app.hyperbrowser.ai and get your API key.
1. Install the SDK
npm install @hyperbrowser/sdk
or
yarn add @hyperbrowser/sdk
2. Starting a Scrape Job
// Initialize the client const client = new HyperbrowserClient({ apiKey: "YOUR_API_KEY" }); // Start a scrape job for a single page const scrapeResponse = await client.startScrapeJob({ url: "https://example.com", // useProxy: true, // uncomment to use a proxy, only available for paid plans // solveCaptchas: true, // uncomment to solve captchas, only available for paid plans }); console.log("Scrape Job Started:", scrapeResponse.jobId); // Poll for the scrape job result let scrapeJobResult; while (true) { scrapeJobResult = await client.getScrapeJob(scrapeResponse.jobId); if (scrapeJobResult.status === "completed") { break; } else if (scrapeJobResult.error) { console.error("Scrape Job Error:", scrapeJobResult.error); break; } else { // Job still in progress, wait 5 seconds before checking again console.log("Scrape still in progress. Checking again in 5 seconds..."); await new Promise((resolve) => setTimeout(resolve, 5000)); } } console.log("Scrape Metadata:", scrapeJobResult.data.metadata); console.log("Page Content (Markdown):", scrapeJobResult.data.markdown);
3. Starting a Crawl Job
// Start a crawl job with parameters const crawlParams: StartCrawlJobParams = { url: "https://example.com", maxPages: 10, followLinks: true, excludePatterns: [".*login.*"], // Exclude pages with 'login' in the URL includePatterns: ["https://example.com/blog.*"], // Only include blog pages // useProxy: true, // uncomment to use a proxy, only available for paid plans // solveCaptchas: true, // uncomment to solve captchas, only available for paid plans }; const crawlResponse = await client.startCrawlJob(crawlParams); console.log("Crawl Job Started:", crawlResponse.jobId); // Retrieve results page by page let pageIndex = 1; while (true) { const crawlJobResult = await client.getCrawlJob(crawlResponse.jobId, { page: pageIndex, // batchSize: 5, default is 5 }); if (crawlJobResult.status === "completed" && crawlJobResult.data) { // Process each batch of crawled pages console.log(`Crawled Page Batch #${pageIndex}`, crawlJobResult.data); if (pageIndex >= crawlJobResult.totalPageBatches) { // No more pages break; } pageIndex++; } else if (crawlJobResult.error) { console.error("Crawl Job Error:", crawlJobResult.error); break; } else { // If not complete yet, you might wait or check back later console.log("Crawl still in progress. Checking again in a moment..."); await new Promise((resolve) => setTimeout(resolve, 5000)); } }
Error Handling and Status Checks
The scrape and crawl jobs are asynchronous. They may take time to complete based on the website's speed, size, and other factors. Because of this, you need to:
- Check the Status: Always check the
status
field before assuming the job is done. - Handle Errors Gracefully: If
error
is present, log it or take necessary actions. - Retry Strategies: Use a polling mechanism to check the status of the job until it's completed.
You can find the full documentation for any of the SDKs and all our endpoints in our docs.
Hyperbrowser makes web scraping simple. The scrape and crawl endpoints handle all the complexity of extracting data from websites, whether you need content from one page or many.
Start using these endpoints today to focus on what matters - working with your data.