Dec 10, 2024

Nodriver: A Game-Changer in Web Automation

In this cat and mouse game of web scraping tools vs anti-scraping tools, the web scraping tools have long given up the out-of-the-box bypass ability. You're pretty much only able to scrape sites with little to no detection abilities, say based on user agent strings or simple browser presence detection. But now, we now have a new player in the web scraper corner. Nodriver has emerged as an alternative to the puppeteer, seleniums and playwrights of the world. Designed to bypass even the most sophisticated anti-bot measures, Nodriver is a high-performance, asynchronous web automation framework tailored for developers who require a robust and reliable tool for scraping, testing, and automating web interactions.

The Problem with Existing Automation Frameworks

The main issue with all the existing automation frameworks is that their default configuration is very well known. That means that their detection signals are very well known. Despite the existence of multiple plugins like puppeteer-stealth, rebrowser, real-browser and many more, they have been quite detectable by WAFs like Cloudflare, Imperva, and Datadome. Puppeteer especially, being the most popular framework, has been the primary target of this. Instead of the patching out these issues in their respective frameworks using plugins and extensions, Nodriver takes a different approach by getting in at the framework level itself. By minimizing the affected footprint and communicating directly over the Chrome Devtool Protocol itself, Nodriver leaves very little marks of its presence, if any at all. A side effect of this is that Nodriver is also one of the fastest scraping frameworks available.

A toy example

Let's try to get a simple example working by scraping from books.toscrape.com.

Installing Nodriver is simple and can be done via pip:

pip install nodriver

And here's a simple test script

from typing import List
import nodriver as nd

from pydantic import BaseModel

class BookInfo(BaseModel):
    star_rating: int
    title: str
    price: str

    def __str__(self) -> str:
        return f"""
title: {self.title}
star_rating: {self.star_rating}
price: {self.price}
"""

async def main():
    books: List[BookInfo] = []

    browser = await nd.start()
    page = await browser.get("https://books.toscrape.com/")
    elems = await page.select_all("article")

    for elem in elems:
        starsElement: nd.Element = await elem.query_selector("p.star-rating")  # type: ignore
        titleElement: nd.Element = await elem.query_selector("h3 > a")  # type: ignore
        priceElement: nd.Element = await elem.query_selector("div.product_price > p")  # type: ignore
        stars_classes: List[str] = starsElement.attrs["class_"]
        star_count = 0

        if "One" in stars_classes:
            star_count = 1
        elif "Two" in stars_classes:
            star_count = 2
        elif "Three" in stars_classes:
            star_count = 3
        elif "Four" in stars_classes:
            star_count = 4
        elif "Five" in stars_classes:
            star_count = 5

        books.append(
            BookInfo(
                star_rating=star_count, title=titleElement.text, price=priceElement.text
            )
        )
    for book in books:
        print(book)


if __name__ == "__main__":
    nd.loop().run_until_complete(main())

So, what do we do here ?

First, we initialise the main function by the line nd.loop().run_until_complete(main()). You can use asyncio functions as well to run the main function as well, but it's a bit cleaner to just use the provided function.
We start the browser using nd.start() and go to the page using browser.get("https://books.toscrape.com/"). Nodriver has a nice utility where the get function on the browser just works on the primary tab instead of having to select a page and then working on it.
We get all the elements matching the given tag using page.select_all("article).
We iterate over all maching elements to find the title, price, and stars for a particular book.

So the steps done here don't deviate much from other frameworks, and also has some syntactic sugar for simple and common use cases. Other common uses such as clicks (await element.click()) and js injection (await page.evaluate('console.log("Hello World!")')) are supported as well.

The main dish, stealth

Syntactic sugar aside, the main reason why you'd want to use Nodriver as opposed to other frameworks is the stealthiness it offers out of the box. Let's test this out by trying to open a page on G2

import puppeteer from "puppeteer-core";

const sleep = async (timeout: number) => {
  return new Promise((resolve) => setTimeout(resolve, timeout));
};

async function testFunction() {
  const browser = await puppeteer.launch({
    defaultViewport: null,
    headless: false, // Set headless to false to see browser actions
    executablePath:
      "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome",
  });
  const page = (await browser.pages())[0];
  await page.goto("https://www.g2.com/products/g2/reviews");
  await sleep(10000);

  await browser.close();
}

testFunction().catch(console.error);

And here's the result

As you can see, even though it gives us a captcha screen, that's just a way to waste the time of the bot. It ends up blocking us anyways.

from typing import List
import nodriver as uc
import asyncio


async def main():
    browser = await uc.start()
    page = await browser.get("https://www.g2.com/products/g2/reviews")
    await asyncio.sleep(10)


if __name__ == "__main__":
    uc.loop().run_until_complete(main())

And here's the result

And here it gives us the same captcha screen but when solved, it succeeds.

I'm sold. But what do I give up ?

Well there's some downsides. For one, migrating from other more established frameworks means that there's not much in terms of community support and packages for extending functionality. Also, you don't get the nice type checking you get with puppeteer by having the same language throughtout, including the calls to page.evaluate. The reliance on raw chrome devtool protocol too can be a double edged sword, both making things a bit more verbose, but also making the whole application easier to know inside-out. Lastly, it's not a cure-all for CAPTCHAs or WAFs. At some level, there will always be a need for CAPTCHA solving techniques as well since the primary intent of these programs has always been to discourage and slow the scrapers.

HyperBrowser and Nodriver

Here at HyperBrowser, we're trying out Nodriver in an alpha version of our crawl and scrape products. If you're intersted in trying it out or learning more, contact us at info@hyperbrowser.ai