Nodriver: A Game-Changer in Web Automation

In this cat and mouse game of web scraping tools vs anti-scraping tools, the web scraping tools have long given up the out-of-the-box bypass ability. You're pretty much only able to scrape sites with little to no detection abilities, say based on user agent strings or simple browser presence detection. But now, we now have a new player in the web scraper corner. Nodriver has emerged as an alternative to the puppeteer, seleniums and playwrights of the world. Designed to bypass even the most sophisticated anti-bot measures, Nodriver is a high-performance, asynchronous web automation framework tailored for developers who require a robust and reliable tool for scraping, testing, and automating web interactions.

The Problem with Existing Automation Frameworks

The main issue with all the existing automation frameworks is that their default configuration is very well known. That means that their detection signals are very well known. Despite the existence of multiple plugins like puppeteer-stealth, rebrowser, real-browser and many more, they have been quite detectable by WAFs like Cloudflare, Imperva, and Datadome. Puppeteer especially, being the most popular framework, has been the primary target of this. Instead of the patching out these issues in their respective frameworks using plugins and extensions, Nodriver takes a different approach by getting in at the framework level itself. By minimizing the affected footprint and communicating directly over the Chrome Devtool Protocol itself, Nodriver leaves very little marks of its presence, if any at all. A side effect of this is that Nodriver is also one of the fastest scraping frameworks available.

A toy example

Let's try to get a simple example working by scraping from books.toscrape.com.

Installing Nodriver is simple and can be done via pip:

pip install nodriver

And here's a simple test script

from typing import List import nodriver as nd from pydantic import BaseModel class BookInfo(BaseModel): star_rating: int title: str price: str def __str__(self) -> str: return f""" title: {self.title} star_rating: {self.star_rating} price: {self.price} """ async def main(): books: List[BookInfo] = [] browser = await nd.start() page = await browser.get("https://books.toscrape.com/") elems = await page.select_all("article") for elem in elems: starsElement: nd.Element = await elem.query_selector("p.star-rating") # type: ignore titleElement: nd.Element = await elem.query_selector("h3 > a") # type: ignore priceElement: nd.Element = await elem.query_selector("div.product_price > p") # type: ignore stars_classes: List[str] = starsElement.attrs["class_"] star_count = 0 if "One" in stars_classes: star_count = 1 elif "Two" in stars_classes: star_count = 2 elif "Three" in stars_classes: star_count = 3 elif "Four" in stars_classes: star_count = 4 elif "Five" in stars_classes: star_count = 5 books.append( BookInfo( star_rating=star_count, title=titleElement.text, price=priceElement.text ) ) for book in books: print(book) if __name__ == "__main__": nd.loop().run_until_complete(main())

So, what do we do here ?

So the steps done here don't deviate much from other frameworks, and also has some syntactic sugar for simple and common use cases. Other common uses such as clicks (await element.click()) and js injection (await page.evaluate('console.log("Hello World!")')) are supported as well.

The main dish, stealth

Syntactic sugar aside, the main reason why you'd want to use Nodriver as opposed to other frameworks is the stealthiness it offers out of the box. Let's test this out by trying to open a page on G2

import puppeteer from "puppeteer-core"; const sleep = async (timeout: number) => { return new Promise((resolve) => setTimeout(resolve, timeout)); }; async function testFunction() { const browser = await puppeteer.launch({ defaultViewport: null, headless: false, // Set headless to false to see browser actions executablePath: "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome", }); const page = (await browser.pages())[0]; await page.goto("https://www.g2.com/products/g2/reviews"); await sleep(10000); await browser.close(); } testFunction().catch(console.error);

And here's the result

puppeteer-fail

As you can see, even though it gives us a captcha screen, that's just a way to waste the time of the bot. It ends up blocking us anyways.


from typing import List import nodriver as uc import asyncio async def main(): browser = await uc.start() page = await browser.get("https://www.g2.com/products/g2/reviews") await asyncio.sleep(10) if __name__ == "__main__": uc.loop().run_until_complete(main())

And here's the result

nodriver-success

And here it gives us the same captcha screen but when solved, it succeeds.

I'm sold. But what do I give up ?

Well there's some downsides. For one, migrating from other more established frameworks means that there's not much in terms of community support and packages for extending functionality. Also, you don't get the nice type checking you get with puppeteer by having the same language throughtout, including the calls to page.evaluate. The reliance on raw chrome devtool protocol too can be a double edged sword, both making things a bit more verbose, but also making the whole application easier to know inside-out. Lastly, it's not a cure-all for CAPTCHAs or WAFs. At some level, there will always be a need for CAPTCHA solving techniques as well since the primary intent of these programs has always been to discourage and slow the scrapers.

HyperBrowser and Nodriver

Here at HyperBrowser, we're trying out Nodriver in an alpha version of our crawl and scrape products. If you're intersted in trying it out or learning more, contact us at info@hyperbrowser.ai

Hyperbrowser

Get started today!

Launch your browser in seconds. No credit card required.