Web Scraping with Playwright Aria Snapshots in crawl4ai

I recently came across Playwright’s Aria Snapshots, when testing the Playwright MCP.

As someone who regularly needs to extract data from websites, I’m always looking for more reliable ways to understand webpage structure.

When I tried out Microsoft’s Playwright MCP (Model Context Protocol) with the MCP boom, from a few months ago, I was impressed by how it used Aria Snapshots to navigate websites.

It moved much better than any browser agent I had tested until that point.

Here’s the full script in a Github gist.

What Are Aria Snapshots?

Aria Snapshots provide a semantic representation of a webpage’s accessibility tree – essentially showing the structure and relationships between elements in a way that assistive technologies (and now our scrapers) can understand.

It’s like getting a clean, hierarchical outline of a webpage that’s much easier to parse than raw HTML.

Why It’s Interesting

I’ve been using crawl4ai for my scraping projects, which already has good methods for converting content to Markdown. But after seeing how well Playwright MCP worked with GPT 4.1, I wanted to integrate Aria Snapshots into my workflow.

The JavaScript version of Playwright has had this feature for a while, but I was excited when Python support finally arrived in version 1.52. Now it’s available in crawl4ai.

How It Works

Here’s a simple script that shows how to extract an Aria Snapshot using crawl4ai:

import asyncio
from typing import Any
from crawl4ai import AsyncWebCrawler
from playwright.async_api import Page
import nest_asyncio

nest_asyncio.apply()

# Define context globally to be accessible by the hook
crawl_context = {}

URL = "https:testy.cool"

async def snapshot_hook(page: Page, context: Any, **kwargs) -> bool:
    """Get accessibility snapshot and store it in the global crawl_context."""
    try:
        snapshot = await page.locator("html").aria_snapshot()
        crawl_context['aria_snapshot'] = snapshot  # Store snapshot in the global context dict
        print("snapshot_hook: Stored snapshot in global crawl_context. Print it with print(crawl_context['aria_snapshot'])")
        return True # Indicate hook success
    except Exception as e:
        print(f"Error capturing snapshot in hook: {e}")
        return False # Indicate hook failure

async def main():
    # Clear context at the start of main in case it's run multiple times
    global crawl_context
    crawl_context = {}

    crawler = AsyncWebCrawler(verbose=True, headless=True)
    async with crawler:
        crawler.crawler_strategy.set_hook("after_goto", snapshot_hook)
        # No need to pass context to arun anymore
        result = await crawler.arun(
            url=URL
        )
        # Check success status first
        if result.success:
            print("Crawl completed successfully.")
            # Access the snapshot from the global context
            if 'aria_snapshot' in crawl_context:
                print("Aria Snapshot captured:")
                # print(crawl_context['aria_snapshot'])
                print("Snapshot is available in the global crawl_context dictionary.")
            else:
                print("Aria snapshot not found in the global crawl_context.")
        else:
            print(f"Crawl failed for URL: {getattr(result, 'url', 'N/A')}") 
            first_result = result._results[0] if result._results else None
            if first_result and hasattr(first_result, 'error_message'):
                 print(f"Error details: {first_result.error_message}")
            elif first_result:
                 print(f"Result details: {first_result}")

        return result

if __name__ == "__main__":
    crawl_result = asyncio.run(main())
    print(crawl_context['aria_snapshot'])

Example Output

When run against a simple website, the snapshot looks like this (truncated for brevity):

- document:
  - link "Skip to content":
    - /url: "#main"
  - banner:
    - navigation "Main Menu":
      - list:
        - listitem:
          - link "Home":
            - /url: https://testy.cool/
        - listitem:
          - link "About":
            - /url: https://testy.cool/about/
    - link "Stylized walnut wearing pink sunglasses logo":
      - /url: https://testy.cool/
      - img "Stylized walnut wearing pink sunglasses logo"
    - button "Search"
    - link "CONTACT":
      - /url: https://testy.cool/contact
  - main:
    - article:
      - list:
        - listitem:
          - link "Ramblings":
            - /url: https://testy.cool/category/ramblings/

Not Always Better Than Markdown

Important: Aria Snapshots aren’t always better than traditional Markdown extraction.

Not all websites are built well to respect accessibility standards, which can lead to incomplete snapshots, in which case converting to markdown is the better option.

I haven’t tested thoroughly on many sites, but I like having all these options around:

Aria Snapshots for well-structured sites
Regular Markdown extraction for simple content
Screenshots, that I’ll use alongside Aria Snapshots or Markdown, depending on the case.
In some cases recording the website and feeding it to Gemini 2.5 Flash. I haven’t done anything interesting with this, however.

Are Aria Snapshots better than markdown +/- screenshots?

I don’t know. I haven’t tested them thoroughly yet. I think it depends on the project.

In my case I need to deal with thousands of sites, and haven’t tested on many of them to see where it might fail.

Web Scraping with Playwright Aria Snapshots in crawl4ai

What Are Aria Snapshots?

Why It’s Interesting

How It Works

Example Output

Not Always Better Than Markdown

Are Aria Snapshots better than markdown +/- screenshots?

Leave a ReplyCancel Reply

Quarto Icons & Extension Path Resolution

How to Remove Empty Image Placeholders from Quarto Listings

Setting up Obsidian with the Obsidianite Theme and Smart Composer Plugin

Quick Fix for ‘Couldn’t Connect’ Bluetooth Issues on Windows 10