Hazumi News | Show HN: wxpath – Declarative web crawling in XPath

wxpath is a declarative web crawler where web crawling and scraping are expressed directly in XPath.

Instead of writing imperative crawl loops, you describe what to follow and what to extract in a single expression:

    import wxpath

    # Crawl, extract fields, build a Wikipedia knowledge graph
    path_expr = """
    url('https://en.wikipedia.org/wiki/Expression_language')
         ///url(//main//a/@href[starts-with(., '/wiki/') and not(contains(., ':'))])
             /map{
                'title': (//span[contains(@class, "mw-page-title-main")]/text())[1] ! string(.),
                'url': string(base-uri(.)),
                'short_description': //div[contains(@class, 'shortdescription')]/text() ! string(.),
                'forward_links': //div[@id="mw-content-text"]//a/@href ! string(.)
             }
    """

    for item in wxpath.wxpath_async_blocking_iter(path_expr, max_depth=1):
        print(item)

The key addition is a `url(...)` operator that fetches and returns HTML for further XPath processing, and `///url(...)` for deep (or paginated) traversal. Everything else is standard XPath 3.1 (maps/arrays/functions).

Features:

- Async/concurrent crawling with streaming results

- Scrapy-inspired auto-throttle and polite crawling

- Hook system for custom processing

- CLI for quick experiments

Another example, paginating through HN comments (via "follow=" argument) pages and extracting data:

    url('https://news.ycombinator.com',
        follow=//a[text()='comments']/@href | //a[@class='morelink']/@href)
        //tr[@class='athing']
          /map {
            'text': .//div[@class='comment']//text(),
            'user': .//a[@class='hnuser']/@href,
            'parent_post': .//span[@class='onstory']/a/@href
          }

Limitations: HTTP-only (no JS rendering yet), no crawl persistence. Both are on the roadmap if there's interest.

GitHub: https://github.com/rodricios/wxpath

PyPI: pip install wxpath

I'd love feedback on the expression syntax and any use cases this might unlock.

Thanks!

Summary

wxpath is a Python library that provides a simple and efficient way to query and manipulate XML and HTML documents using XPath expressions. It offers a concise and intuitive API, making it a powerful tool for web scraping and data extraction tasks.

Story

Show HN: wxpath – Declarative web crawling in XPath