The files are actually on another computer. Web Crawling 101

Web crawling but very very simple.

The first bit of inspiration for this learning journey was this /r/dailyprogrammer Web Crawler challenge.

Admittedly, I also saw some posts about Anthropic’s webcrawler interview challenge and wondered if I’d be able to implement something for it.

TL;DR: See the full code here.

The challenge

You can read the full reddit post for the complete description but in a nutshell the challenge was to build a webcrawler that supports:

HTTP/1.1 client behaviors
GET requests
Parsing all links presented in HTML - anchors, images, scripts, etc
Accepts a starting (seed) URL
Accepts a maximum depth to recurse to (e.g. “1” would be fetch the HTML page and all resources like images and script associated with it but don’t visit any outgoing anchor links; a depth of “2” would visit the anchor links found on that first page only, etc …)
Does not visit the same link more than once per session

First pass

This took me longer than I care to admit…

After getting way too distracted by choosing the best HTML parser in the Elixir ecosystem I realized I was letting minutia distract me from attempting to solve the prompt (I ended up using the standard Floki…)

On the first pass I hit the following:

HTTP/1.1 client behaviours
GET requests are the only method you must support
Do not visit the same link more than once

Then I hit a little bonus of doing the requests concurrently. Which was an enjoyable sidequest in the Elixir ecosystem.

I also did a little bit of benchmarking between the sync and concurrent version. Concurrent was faster…

At this point I got distracted reading up about TCP, HTTP, and TLS until I wrote this article and put a little more into it.

Some code

I ended up going with a recursive solution, I thought it read much nicer and leaned into Elixir’s strengths.

A simple API entry point was provided to hideaway the internal mechanics:

defmodule WebCrawler do
  @moduledoc """
  Entry point for crawlin'

  https://crawler-test.com/
  """

  def crawl(url) do
    uri = URI.new!(url)
    
    [uri]
    |> crawl([])
    |> Enum.map(&URI.to_string/1)
    |> Enum.frequencies()
  end
end

Then the recursion was relatively simple:

defmodule WebCrawler do
  # Base: no more URIs to visit
  defp crawl([] = _visit, visited) do
    visited
  end
  
  # Process a batch of URIs
  defp crawl(visit, visited, context) do
    %{current_depth: current_depth} = context

    requests =
      for uri <- visit do
        Task.async(fn ->
          %Req.Response{body: body} = Req.get!(Req.new(url: uri))

          body
          |> Floki.parse_document!()
          |> Floki.find("a[href]")
          |> Enum.map(&Floki.attribute(&1, "href"))
          |> Enum.map(&hd/1)
          |> Enum.map(&String.replace_suffix(&1, "/", ""))
          |> Enum.map(&URI.merge(uri, &1))
          |> Enum.reject(&(&1.host != uri.host))
        end)
      end

    visited = visit ++ visited

    uris_to_visit =
      requests
      |> Task.await_many()
      |> List.flatten()
      |> Enum.reject(&(&1 in visited))

    crawl(uris_to_visit, visited)
  end
end

A note: An astute reader may notice that the above concurrency speed is bound by the slowest request… There’s improvements to be made.

Adding depth limits

Off the top of my head, adding a depth that supports a breadth first search should not be terribly difficult and would require:

The specification of the max depth
Tracking the current depth
Bailing out when the max depth is reached

Something like this:

defmodule WebCrawler do
  def crawl(url, opts \\ []) do
    uri = URI.new!(url)
  
    max_depth = Keyword.get(opts, :max_depth, :infinity)
  
    context = %{
      current_depth: 0,
      max_depth: max_depth
    }
  
    [uri]
    |> crawl([], context)
    |> Enum.map(&URI.to_string/1)
    |> Enum.frequencies()
  end
  

  defp crawl([], visited, _opts) do
    visited
  end

  # At max depth
  defp crawl(visit, visited, %{current_depth: max_depth, max_depth: max_depth}) do
    # Here the visited URIs AND the URIs to visit are returned to satisfy the
    # following spec:
    # > Accepts a maximum depth to recurse to (e.g. "1" would be fetch
    # > the HTML page and all resources like images and script associated with it
    # > but don't visit any outgoing anchor links;
    visited ++ visit
  end
  
  defp crawl(visit, visited, context) do
     %{current_depth: current_depth} = context
     
     #...
     
    new_context = %{context | current_depth: current_depth + 1}
    crawl(new_anchor_uris, uris_to_visit ++ visited, new_context)
  end
end

The above is a breadth-first search. Breadth-first search specifies that all nodes at the current depth are processed before moving onto the next depth level.

Since the recursive case processes all the URIs found by the previous recursion, the above is true.

To illustrate, say that $URI_A has links to $URI_B and $URI_C. $URI_B links to $URI_D and $URI_C links to $URI_E

With a max depth of 0, then vistied = [], visit = [$URI_A], and all URIs at depth 0 have been visited, ie: none.

With a max depth of 1, then vistied = [$URI_A], visit = [$URI_B, $URI_C], meaning all URIs at depth 1 have been visited and we have collected the URIs at depth 2.

With a max depth of 2, then vistied = [$URI_A, $URI_B, $URI_C], visit = [$URI_D, $URI_E], meaning all URIs at depth 2 have been visited and we have collected the URIs at depth 3.

And so on.

What I learnt

I hadn’t really appreciated how fundamental webcrawling is to a massive chunk of the internet. Which seems silly in hindsight…

Webcrawling as a challenge is also quite intersting and this provided a very small peek into that world. Skimming this crawler-test.com site shows a mass of edge cases to handle.

There’s also some really cool optimization side quests to go on!

Overall, this was an interesting challenge. I did not hit all the bench marks but I can see myself returning here to do things like:

Handle things that aren’t anchor tags
Add some more juice to the concurrency
Add some better web crawling manners, such as respecting robots.txt files

What’s next

Well, in doing the above I realized that it’s been a very long time since I’ve really thought about HTTP, TLS, and how communication happens on the internet!

So a likely diversion into refreshing my TCP knowledge with a small project will be next.

Thanks for tuning in!

2026-03-31

← The files are in the computer