How to Scrape a Shopify Store Using Hyperbrowser and OpenAI
This guide shows you how to automatically collect product details from Shopify stores using two tools: Hyperbrowser for scraping the site and getting llm ready markdown and OpenAI for extracting the desired data from the markdown.
Why Hyperbrowser?
Most web scrapers are hard to set up and break when websites change. Hyperbrowser makes it simple - it handles the scraping/crawling for you and gives you clean data back. You can start scraping right away and get the site's data in markdown format and easily connect it with AI tools like GPT to process it.
Prerequisites
Before getting started, you’ll need:
- Node.js v16 or higher: Our example is a Node.js application.
- Hyperbrowser API Key: Sign up and obtain your API key from Hyperbrowser.
- OpenAI API Key: Get this from your OpenAI account settings.
- A Target Shopify URL: The public Shopify store you want to scrape.
Make sure you have these environment variables stored securely in a .env
file. For example:
HYPERBROWSER_API_KEY=your_hyperbrowser_api_key OPENAI_API_KEY=your_openai_api_key
Step-by-Step Guide
-
Project Setup
Clone the example repository:git clone https://github.com/hyperbrowserai/shopify-scraper.git cd shopify-scraper yarn
-
Code Below is the code you can use. This script does two things:
scrape <url>
: Starts a crawl job on the given Shopify URL.extract <jobId>
: Extracts product information from the completed crawl job.
import { config } from "dotenv"; import Hyperbrowser from "@hyperbrowser/sdk"; import OpenAI from "openai"; import { zodResponseFormat } from "openai/helpers/zod"; import { z } from "zod"; config(); const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY }); const HYPERBROWSER_API_KEY = process.env.HYPERBROWSER_API_KEY || ""; const client = new Hyperbrowser({ apiKey: HYPERBROWSER_API_KEY, }); const Product = z.object({ name: z.string(), price: z.number(), description: z.string(), image: z.string(), }); const ProductSchema = z.object({ products: z.array(Product) }); const sleep = (ms: number) => new Promise((resolve) => setTimeout(resolve, ms)); const scrapeShopifySite = async (url: string) => { const crawlJob = await client.startCrawlJob({ url: url, maxPages: 1000, followLinks: false, excludePatterns: [], includePatterns: [], useProxy: false, solveCaptchas: false, }); console.log("Crawl job started: ", crawlJob.jobId); let completed = false; while (!completed) { const job = await client.getCrawlJob(crawlJob.jobId); if (job.status === "completed") { completed = true; console.log("Crawl job completed: ", crawlJob.jobId); } else if (job.status === "failed") { console.error("Crawl job failed: ", job.error); completed = true; } else { console.log("Crawl job is still running: ", job.status); await sleep(5_000); } } }; const extractProductData = async (jobId: string) => { let scrapedMarkdown = ""; let pageIndex = 1; while (true) { const crawlJobResult = await client.getCrawlJob(jobId, { page: pageIndex, batchSize: 10, }); const pages = crawlJobResult.data; if (pages) { for (const page of pages) { const pageData = page.markdown; if (pageData) { scrapedMarkdown += pageData; } } if (pageIndex >= crawlJobResult.totalPageBatches) { break; } pageIndex++; } } const completion = await openai.beta.chat.completions.parse({ model: "gpt-4o-mini", messages: [ { role: "system", content: `You are an expert data extractor. Your task is to extract product information from the provided scraped content. Ensure the output adheres to the following structure: - Name: Product name - Description: A brief description of the product - Price: The price in the provided currency (if available) - Availability: Whether the product is in stock or out of stock (if available) - Additional Details: Any other relevant information (e.g., SKU, category) Provide the extracted data as a JSON object. Parse the Markdown content carefully to identify and categorize the product details accurately.`, }, { role: "user", content: scrapedMarkdown }, ], response_format: zodResponseFormat(ProductSchema, "product"), }); const productInfo = completion.choices[0].message.parsed; console.log("Products:", productInfo); };
Let’s break down what’s happening inside the code:
-
Environment Variables: We load API keys from
.env
usingdotenv
. -
Hyperbrowser Client Initialization: Using our
HYPERBROWSER_API_KEY
, we create a new Hyperbrowser client. This client starts crawl jobs and retrieves results. -
Scraping Function (
scrapeShopifySite
): When we runyarn start scrape <url>
, the script callsscrapeShopifySite(url)
. This function:- Initiates a crawl job against the target Shopify URL.
- Waits for the job to complete, periodically checking its status.
-
Extracting Product Data (
extractProductData
): Once the crawl job is complete,extractProductData(jobId)
fetches the scraped pages in batches. It accumulates all markdown content into a single string. -
Using OpenAI for Parsing: After gathering the raw markdown content, we send it to OpenAI’s model with instructions to parse product details. The model responds with structured JSON adhering to the defined
Product
schema, which includesname
,price
,description
, andimage
.
Running the Script
- Set Environment Variables:
First, create a
.env
file at the root of your project:
cp .env.example .env
Update .env
with your keys:
HYPERBROWSER_API_KEY=your_hyperbrowser_api_key OPENAI_API_KEY=your_openai_api_key
-
Build and Run:
yarn build yarn start scrape https://your-target-shopify-store.com
This will initiate a crawl job. Note the
jobId
printed out when the job completes. Then, to extract and process the data:yarn start extract <jobId>
-
Review the Output:
The script will print out the extracted product data as a JSON object. You can then pipe this JSON into another script, save it to a database, or integrate it into your workflow.
Example Output
A successful run might output something like:
{ "products": [ { "name": "Classic White T-Shirt", "price": 19.99, "description": "A soft cotton t-shirt with a classic fit.", "image": "https://cdn.shopify.com/s/files/…" }, { "name": "Black Hoodie", "price": 49.99, "description": "A warm fleece hoodie perfect for cooler days.", "image": "https://cdn.shopify.com/s/files/…" } ] }
You can find the full code in the Hyperbrowser GitHub repository.