How to Scrape a Shopify Store Using Hyperbrowser and OpenAI

This guide shows you how to automatically collect product details from Shopify stores using two tools: Hyperbrowser for scraping the site and getting llm ready markdown and OpenAI for extracting the desired data from the markdown.

Why Hyperbrowser?

Most web scrapers are hard to set up and break when websites change. Hyperbrowser makes it simple - it handles the scraping/crawling for you and gives you clean data back. You can start scraping right away and get the site's data in markdown format and easily connect it with AI tools like GPT to process it.

Prerequisites

Before getting started, you’ll need:

  1. Node.js v16 or higher: Our example is a Node.js application.
  2. Hyperbrowser API Key: Sign up and obtain your API key from Hyperbrowser.
  3. OpenAI API Key: Get this from your OpenAI account settings.
  4. A Target Shopify URL: The public Shopify store you want to scrape.

Make sure you have these environment variables stored securely in a .env file. For example:

HYPERBROWSER_API_KEY=your_hyperbrowser_api_key OPENAI_API_KEY=your_openai_api_key

Step-by-Step Guide

  1. Project Setup
    Clone the example repository:

    git clone https://github.com/hyperbrowserai/shopify-scraper.git cd shopify-scraper yarn
  2. Code Below is the code you can use. This script does two things:

    • scrape <url>: Starts a crawl job on the given Shopify URL.
    • extract <jobId>: Extracts product information from the completed crawl job.
import { config } from "dotenv"; import Hyperbrowser from "@hyperbrowser/sdk"; import OpenAI from "openai"; import { zodResponseFormat } from "openai/helpers/zod"; import { z } from "zod"; config(); const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY }); const HYPERBROWSER_API_KEY = process.env.HYPERBROWSER_API_KEY || ""; const client = new Hyperbrowser({ apiKey: HYPERBROWSER_API_KEY, }); const Product = z.object({ name: z.string(), price: z.number(), description: z.string(), image: z.string(), }); const ProductSchema = z.object({ products: z.array(Product) }); const sleep = (ms: number) => new Promise((resolve) => setTimeout(resolve, ms)); const scrapeShopifySite = async (url: string) => { const crawlJob = await client.startCrawlJob({ url: url, maxPages: 1000, followLinks: false, excludePatterns: [], includePatterns: [], useProxy: false, solveCaptchas: false, }); console.log("Crawl job started: ", crawlJob.jobId); let completed = false; while (!completed) { const job = await client.getCrawlJob(crawlJob.jobId); if (job.status === "completed") { completed = true; console.log("Crawl job completed: ", crawlJob.jobId); } else if (job.status === "failed") { console.error("Crawl job failed: ", job.error); completed = true; } else { console.log("Crawl job is still running: ", job.status); await sleep(5_000); } } }; const extractProductData = async (jobId: string) => { let scrapedMarkdown = ""; let pageIndex = 1; while (true) { const crawlJobResult = await client.getCrawlJob(jobId, { page: pageIndex, batchSize: 10, }); const pages = crawlJobResult.data; if (pages) { for (const page of pages) { const pageData = page.markdown; if (pageData) { scrapedMarkdown += pageData; } } if (pageIndex >= crawlJobResult.totalPageBatches) { break; } pageIndex++; } } const completion = await openai.beta.chat.completions.parse({ model: "gpt-4o-mini", messages: [ { role: "system", content: `You are an expert data extractor. Your task is to extract product information from the provided scraped content. Ensure the output adheres to the following structure: - Name: Product name - Description: A brief description of the product - Price: The price in the provided currency (if available) - Availability: Whether the product is in stock or out of stock (if available) - Additional Details: Any other relevant information (e.g., SKU, category) Provide the extracted data as a JSON object. Parse the Markdown content carefully to identify and categorize the product details accurately.`, }, { role: "user", content: scrapedMarkdown }, ], response_format: zodResponseFormat(ProductSchema, "product"), }); const productInfo = completion.choices[0].message.parsed; console.log("Products:", productInfo); };

Let’s break down what’s happening inside the code:

Running the Script

  1. Set Environment Variables: First, create a .env file at the root of your project:
cp .env.example .env

Update .env with your keys:

HYPERBROWSER_API_KEY=your_hyperbrowser_api_key OPENAI_API_KEY=your_openai_api_key
  1. Build and Run:

    yarn build yarn start scrape https://your-target-shopify-store.com

    This will initiate a crawl job. Note the jobId printed out when the job completes. Then, to extract and process the data:

    yarn start extract <jobId>
  2. Review the Output:
    The script will print out the extracted product data as a JSON object. You can then pipe this JSON into another script, save it to a database, or integrate it into your workflow.


Example Output

A successful run might output something like:

{ "products": [ { "name": "Classic White T-Shirt", "price": 19.99, "description": "A soft cotton t-shirt with a classic fit.", "image": "https://cdn.shopify.com/s/files/…" }, { "name": "Black Hoodie", "price": 49.99, "description": "A warm fleece hoodie perfect for cooler days.", "image": "https://cdn.shopify.com/s/files/…" } ] }

You can find the full code in the Hyperbrowser GitHub repository.

Hyperbrowser

Get started today!

Launch your browser in seconds. No credit card required.