======================
== fabulous.systems ==
======================
Welcome to the world of fabulous.systems

BTS#1: Submitting Entire Websites to web.archive.org Through Sitemap Parsing


#bts #sysops #webdev

This is the first episode of a new series: BTS, short for “Behind the Scenes”. In this series, I’m going through some scripts and techniques I use to build and maintain fabulous.systems.

Over the weekend, I wrote a script that parses my entire website and submits all URLs to the Internet Archive’s Wayback Machine, including all outgoing links.

Update 2025-01-21: I found out that the Wayback Machine has a limit of 200 submissions per day and IP address. In case your site is larger, you might want to limit your submissions to the latest changes of your website.

Why do I care about the Archive

I have to admit that I was not always a massive fan of the Archive’s approach of keeping websites for all eternity—in the end, I want to have the opportunity to remove content if I no longer identify with it, right?

Well, I can. A few years ago, I spotted a website in the Archive hosted on a domain I previously owned and let expire. Naturally, my content was there, then a gap of a couple of years, and suddenly, the new owner started to publish political stuff on the domain—content I can not identify with.

I contacted their support, and after providing evidence and the exact period where I owned the domain, they removed any references to the previously published content. We solved the situation within a couple of days, and working with their support staff was really great.

Then fabulous.systems happened. Given the more obscure nature of the content presented here, finding reliable sources is increasingly challenging. There are multiple articles on this website that wouldn’t exist without the help of the Wayback Machine. One day, people will think about your website the same way.

A brief introduction to sitemaps

In short, sitemaps represent your website’s table of contents. Following a standardized format, their primary purpose is to help web crawlers like search engines to find your content. In their most basic form, sitemaps only list all URLs on your website. Additionally, they can include information about a URL’s “weight” (or importance) and the date of the URL’s last modification.

For example, this is the beginning of this website’s sitemap:

 1<?xml version="1.0" encoding="utf-8" standalone="yes"?>
 2<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
 3  xmlns:xhtml="http://www.w3.org/1999/xhtml">
 4  <url>
 5    <loc>https://fabulous.systems/categories/</loc>
 6    <lastmod>2024-06-22T22:00:00+02:00</lastmod>
 7  </url>
 8  <url>
 9    <loc>https://fabulous.systems/posts/2024/06/new-installation-media-for-ms-dos-4/</loc>
10    <lastmod>2024-06-22T22:00:00+02:00</lastmod>
11  </url>
12  <url>
13    <loc>https://fabulous.systems/</loc>
14    <lastmod>2024-06-22T22:00:00+02:00</lastmod>
15  </url>
16...

The simple structure of the sitemap XML makes parsing relatively easy.

In fact, it is so simple that I managed to use a single Bash script with curl as the only dependency.

The script

 1#!/bin/bash
 2
 3# Auto-announce all pages and related links to archive.org by
 4# crawling the sitemap.xml rendered by Hugo, fetched locally
 5SUBMISSION_URL="https://web.archive.org/save/"
 6SOURCE_URL="https://fabulous.systems"
 7
 8try_count=0
 9
10announce_url() {
11	curl -s -o /dev/null -w "%{http_code}" "${SUBMISSION_URL}""${url_to_announce}"
12}
13
14archive_url() {
15	# Increment trial counter on each run since we don't know about the result here
16	((try_count++))
17
18	if [[ $(announce_url "${url_to_announce}") == '302' ]]; then
19		echo "Successfully archived \"${url_to_announce}\""
20		try_count=0
21	else
22		echo "Retrying to archive \"${url_to_announce}\""
23		
24		# Give the archive.org API some time to cool down...
25		sleep 30
26
27		# Retry announcing the same URL again until we reach the failure threshold
28		if [ "$try_count" -lt 5 ] && [ "$try_count" -ne 0 ]; then
29			archive_url "${url_to_announce}"
30		else
31			echo "Giving up on url \"${url_to_announce}\""
32			try_count=0
33			return 1
34		fi
35	fi
36}
37
38extract_external_urls() {
39	curl -s -f -L "${url_to_announce}" | grep -Eo '"(http|https)://[a-zA-Z0-9#~.*,/!?=+&_%:-]*"' | grep -v "${SOURCE_URL}" | sed 's/\"//g' | sort -u
40}
41
42# Starting the main loop, iterating through the
43# entire content of sitemap.xml.
44grep loc public/sitemap.xml | grep -v legal | grep -v privacy | sed 's/    <loc>//g' | sed 's/<\/loc>//g' | while IFS= read -r url_to_announce; do
45
46	# Start with building a list of all external URLs present on the current page
47	external_url_list=$(extract_external_urls "${url_to_announce}")
48
49	# Send archive request for current (internal) URL
50	archive_url "${url_to_announce}"
51
52	# Iterate through any external URLs we found and send archive request
53	for url_to_announce in ${external_url_list}; do
54		archive_url "${url_to_announce}"
55	done
56done

The script — demystified

Let’s break it down, shall we?

First, we need to parse the sitemap.xml file (located at public/sitemap.xml) and extract all URLs we can find in our main loop. Then, we call extract_external_urls, and archive_url for both the internal and the outgoing/external links.

 1grep loc public/sitemap.xml | grep -v legal | grep -v privacy | sed 's/    <loc>//g' | sed 's/<\/loc>//g' | while IFS= read -r url_to_announce; do
 2	# Start with building a list of all external URLs present on the current page
 3	external_url_list=$(extract_external_urls "${url_to_announce}")
 4
 5	# Send archive request for current (internal) URL
 6	archive_url "${url_to_announce}"
 7
 8	# Iterate through any external URLs we found and send archive request
 9	for url_to_announce in ${external_url_list}; do
10		archive_url "${url_to_announce}"
11	done
12done

For extract_external_urls, we fetch the HTML code of all URLs from the sitemap with curl again. grep -v "${SOURCE_URL}" ensures that we exclude all internal links. This grep command can be further expanded to exclude external links that are present on every page or in a navigation area.

1extract_external_urls() {
2	curl -s -f -L "${url_to_announce}" | grep -Eo '"(http|https)://[a-zA-Z0-9#~.*,/!?=+&_%:-]*"' | grep -v "${SOURCE_URL}" | sed 's/\"//g' | sort -u
3}

We can now track down all pages from the sitemap, and also have a list of all external URLs included on every page. To submit the URL to the Wayback Machine, we only have to send a single request to their endpoint.

1announce_url() {
2	curl -s -o /dev/null -w "%{http_code}" "${SUBMISSION_URL}""${url_to_announce}"
3}

For the front page, this expands to the following expression:

1announce_url() {
2	curl -s -o /dev/null -w "%{http_code}" https://web.archive.org/save/https://fabulous.systems/
3}

Finally, we need some (now annotated) logic to glue everything together.

 1archive_url() {
 2	# Initially, we start with a try_count of 0.
 3	# Since web.archive.org is often overloaded, we have to expect
 4	# multiple retries until we get a successful submission.
 5	# We have to avoid deadlocks, so we need to limit the amount
 6	# of retries somehow.
 7	((try_count++))
 8
 9	if [[ $(announce_url "${url_to_announce}") == '302' ]]; then
10		# For checking the status of our submission, we rely on the
11		# 302 redirection web.archive.org performs after it successfully
12		# saved a website.
13		#
14		# In case we receive any other HTTP status code ...
15		echo "Successfully archived \"${url_to_announce}\""
16		try_count=0
17	else
18		# ... we need to do some work.
19		echo "Retrying to archive \"${url_to_announce}\""
20		
21		# Give the archive.org API some time to cool down.
22		# Sometimes, multiple subsequent calls lead to rate limiting
23		# that we can mitigate by slowing down our requests.
24		sleep 30
25
26		# Retry announcing the same URL again until we reach the failure threshold.
27		# We want to try until we reach our retry threshold of 5 attepts.
28		# At the same time, we have to ensure that our current counter is
29		# _NOT_ 0. In theory, this should never happen, but _if_ it does,
30		# we are stuck in an endless loop.
31		if [ "$try_count" -lt 5 ] && [ "$try_count" -ne 0 ]; then
32			archive_url "${url_to_announce}"
33		else
34			# When reaching our threshold, we simply give up and continue
35			# with the next URL.
36			# We reset our counter and exit with an errorcode (which we don't check yet)
37			echo "Giving up on url \"${url_to_announce}\""
38			try_count=0
39			return 1
40		fi
41	fi
42}

With this script, I was able to update the copy of my website that is currently stored on archive.org. And thanks to some extensive testing, I discovered that archive.org has a limit of 5 submissions per URL and day — oops.

Do you have any comments or suggestions regarding this article? Feel free to join the discussion at our fabulous.community!