Web Scraping with Playwright Aria Snapshots in crawl4ai
I recently came across Playwright’s Aria Snapshots, when testing the Playwright MCP.
As someone who regularly needs to extract data from websites, I’m always looking for more reliable ways to understand webpage structure.
When I tried out Microsoft’s Playwright MCP (Model Context Protocol) with the MCP boom, from a few months ago, I was impressed by how it used Aria Snapshots to navigate websites.
It moved much better than any browser agent I had tested until that point.
Here’s the full script in a Github gist.
What Are Aria Snapshots?
Aria Snapshots provide a semantic representation of a webpage’s accessibility tree – essentially showing the structure and relationships between elements in a way that assistive technologies (and now our scrapers) can understand.
It’s like getting a clean, hierarchical outline of a webpage that’s much easier to parse than raw HTML.
Why It’s Interesting
I’ve been using crawl4ai for my scraping projects, which already has good methods for converting content to Markdown. But after seeing how well Playwright MCP worked with GPT 4.1, I wanted to integrate Aria Snapshots into my workflow.
The JavaScript version of Playwright has had this feature for a while, but I was excited when Python support finally arrived in version 1.52. Now it’s available in crawl4ai.
How It Works
Here’s a simple script that shows how to extract an Aria Snapshot using crawl4ai:
import asyncio
from typing import Any
from crawl4ai import AsyncWebCrawler
from playwright.async_api import Page
import nest_asyncio
nest_asyncio.apply()
# Define context globally to be accessible by the hook
crawl_context = {}
URL = "https:testy.cool"
async def snapshot_hook(page: Page, context: Any, **kwargs) -> bool:
"""Get accessibility snapshot and store it in the global crawl_context."""
try:
snapshot = await page.locator("html").aria_snapshot()
crawl_context['aria_snapshot'] = snapshot # Store snapshot in the global context dict
print("snapshot_hook: Stored snapshot in global crawl_context. Print it with print(crawl_context['aria_snapshot'])")
return True # Indicate hook success
except Exception as e:
print(f"Error capturing snapshot in hook: {e}")
return False # Indicate hook failure
async def main():
# Clear context at the start of main in case it's run multiple times
global crawl_context
crawl_context = {}
crawler = AsyncWebCrawler(verbose=True, headless=True)
async with crawler:
crawler.crawler_strategy.set_hook("after_goto", snapshot_hook)
# No need to pass context to arun anymore
result = await crawler.arun(
url=URL
)
# Check success status first
if result.success:
print("Crawl completed successfully.")
# Access the snapshot from the global context
if 'aria_snapshot' in crawl_context:
print("Aria Snapshot captured:")
# print(crawl_context['aria_snapshot'])
print("Snapshot is available in the global crawl_context dictionary.")
else:
print("Aria snapshot not found in the global crawl_context.")
else:
print(f"Crawl failed for URL: {getattr(result, 'url', 'N/A')}")
first_result = result._results[0] if result._results else None
if first_result and hasattr(first_result, 'error_message'):
print(f"Error details: {first_result.error_message}")
elif first_result:
print(f"Result details: {first_result}")
return result
if __name__ == "__main__":
crawl_result = asyncio.run(main())
print(crawl_context['aria_snapshot'])
Example Output
When run against a simple website, the snapshot looks like this (truncated for brevity):
- document:
- link "Skip to content":
- /url: "#main"
- banner:
- navigation "Main Menu":
- list:
- listitem:
- link "Home":
- /url: https://testy.cool/
- listitem:
- link "About":
- /url: https://testy.cool/about/
- link "Stylized walnut wearing pink sunglasses logo":
- /url: https://testy.cool/
- img "Stylized walnut wearing pink sunglasses logo"
- button "Search"
- link "CONTACT":
- /url: https://testy.cool/contact
- main:
- article:
- list:
- listitem:
- link "Ramblings":
- /url: https://testy.cool/category/ramblings/
Not Always Better Than Markdown
Important: Aria Snapshots aren’t always better than traditional Markdown extraction.
Not all websites are built well to respect accessibility standards, which can lead to incomplete snapshots, in which case converting to markdown is the better option.
I haven’t tested thoroughly on many sites, but I like having all these options around:
- Aria Snapshots for well-structured sites
- Regular Markdown extraction for simple content
- Screenshots, that I’ll use alongside Aria Snapshots or Markdown, depending on the case.
- In some cases recording the website and feeding it to Gemini 2.5 Flash. I haven’t done anything interesting with this, however.
Are Aria Snapshots better than markdown +/- screenshots?
I don’t know. I haven’t tested them thoroughly yet. I think it depends on the project.
In my case I need to deal with thousands of sites, and haven’t tested on many of them to see where it might fail.