Dec 10, 2024

How to Scrape a Shopify Store Using Hyperbrowser and OpenAI

This guide shows you how to automatically collect product details from Shopify stores using two tools: Hyperbrowser for scraping the site and getting llm ready markdown and OpenAI for extracting the desired data from the markdown.

Why Hyperbrowser?

Most web scrapers are hard to set up and break when websites change. Hyperbrowser makes it simple - it handles the scraping/crawling for you and gives you clean data back. You can start scraping right away and get the site's data in markdown format and easily connect it with AI tools like GPT to process it.

Prerequisites

Before getting started, you’ll need:

Node.js v16 or higher: Our example is a Node.js application.
Hyperbrowser API Key: Sign up and obtain your API key from Hyperbrowser.
OpenAI API Key: Get this from your OpenAI account settings.
A Target Shopify URL: The public Shopify store you want to scrape.

Make sure you have these environment variables stored securely in a .env file. For example:

HYPERBROWSER_API_KEY=your_hyperbrowser_api_key
OPENAI_API_KEY=your_openai_api_key

Step-by-Step Guide

Project Setup
Clone the example repository:

git clone https://github.com/hyperbrowserai/shopify-scraper.git
cd shopify-scraper
yarn

Code Below is the code you can use. This script does two things:
- scrape <url>: Starts a crawl job on the given Shopify URL.
- extract <jobId>: Extracts product information from the completed crawl job.

import { config } from "dotenv";
import Hyperbrowser from "@hyperbrowser/sdk";
import OpenAI from "openai";
import { zodResponseFormat } from "openai/helpers/zod";
import { z } from "zod";

config();

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

const HYPERBROWSER_API_KEY = process.env.HYPERBROWSER_API_KEY || "";

const client = new Hyperbrowser({
  apiKey: HYPERBROWSER_API_KEY,
});

const Product = z.object({
  name: z.string(),
  price: z.number(),
  description: z.string(),
  image: z.string(),
});

const ProductSchema = z.object({ products: z.array(Product) });

const sleep = (ms: number) => new Promise((resolve) => setTimeout(resolve, ms));

const scrapeShopifySite = async (url: string) => {
  const crawlJob = await client.startCrawlJob({
    url: url,
    maxPages: 1000,
    followLinks: false,
    excludePatterns: [],
    includePatterns: [],
    useProxy: false,
    solveCaptchas: false,
  });

  console.log("Crawl job started: ", crawlJob.jobId);

  let completed = false;
  while (!completed) {
    const job = await client.getCrawlJob(crawlJob.jobId);

    if (job.status === "completed") {
      completed = true;
      console.log("Crawl job completed: ", crawlJob.jobId);
    } else if (job.status === "failed") {
      console.error("Crawl job failed: ", job.error);
      completed = true;
    } else {
      console.log("Crawl job is still running: ", job.status);
      await sleep(5_000);
    }
  }
};

const extractProductData = async (jobId: string) => {
  let scrapedMarkdown = "";
  let pageIndex = 1;
  while (true) {
    const crawlJobResult = await client.getCrawlJob(jobId, {
      page: pageIndex,
      batchSize: 10,
    });

    const pages = crawlJobResult.data;
    if (pages) {
      for (const page of pages) {
        const pageData = page.markdown;
        if (pageData) {
          scrapedMarkdown += pageData;
        }
      }
      if (pageIndex >= crawlJobResult.totalPageBatches) {
        break;
      }
      pageIndex++;
    }
  }

  const completion = await openai.beta.chat.completions.parse({
    model: "gpt-4o-mini",
    messages: [
      {
        role: "system",
        content: `You are an expert data extractor. Your task is to extract product information from the provided scraped content. 
                Ensure the output adheres to the following structure:
                - Name: Product name
                - Description: A brief description of the product
                - Price: The price in the provided currency (if available)
                - Availability: Whether the product is in stock or out of stock (if available)
                - Additional Details: Any other relevant information (e.g., SKU, category)

                Provide the extracted data as a JSON object. Parse the Markdown content carefully to identify and categorize the product details accurately.`,
      },
      { role: "user", content: scrapedMarkdown },
    ],
    response_format: zodResponseFormat(ProductSchema, "product"),
  });

  const productInfo = completion.choices[0].message.parsed;
  console.log("Products:", productInfo);
};

Let’s break down what’s happening inside the code:

Environment Variables: We load API keys from .env using dotenv.
Hyperbrowser Client Initialization: Using our HYPERBROWSER_API_KEY, we create a new Hyperbrowser client. This client starts crawl jobs and retrieves results.
Scraping Function (scrapeShopifySite): When we run yarn start scrape <url>, the script calls scrapeShopifySite(url). This function:
- Initiates a crawl job against the target Shopify URL.
- Waits for the job to complete, periodically checking its status.
Extracting Product Data (extractProductData): Once the crawl job is complete, extractProductData(jobId) fetches the scraped pages in batches. It accumulates all markdown content into a single string.
Using OpenAI for Parsing: After gathering the raw markdown content, we send it to OpenAI’s model with instructions to parse product details. The model responds with structured JSON adhering to the defined Product schema, which includes name, price, description, and image.

Running the Script

Set Environment Variables: First, create a .env file at the root of your project:

cp .env.example .env

Update .env with your keys:

HYPERBROWSER_API_KEY=your_hyperbrowser_api_key
OPENAI_API_KEY=your_openai_api_key

Build and Run:

yarn build
yarn start scrape https://your-target-shopify-store.com

This will initiate a crawl job. Note the jobId printed out when the job completes. Then, to extract and process the data:

yarn start extract <jobId>

Review the Output:
The script will print out the extracted product data as a JSON object. You can then pipe this JSON into another script, save it to a database, or integrate it into your workflow.

Example Output

A successful run might output something like:

{
  "products": [
    {
      "name": "Classic White T-Shirt",
      "price": 19.99,
      "description": "A soft cotton t-shirt with a classic fit.",
      "image": "https://cdn.shopify.com/s/files/…"
    },
    {
      "name": "Black Hoodie",
      "price": 49.99,
      "description": "A warm fleece hoodie perfect for cooler days.",
      "image": "https://cdn.shopify.com/s/files/…"
    }
  ]
}

You can find the full code in the Hyperbrowser GitHub repository.