Published at
Updated at
Reading time
6min

You've probably heard of the Internet Archive and its "Wayback Machine".

 Search the history of over 946 billion web pages on the Internet.

The Wayback Machine crawls and archives over 946 billion web pages resulting in 100 petabytes (1 petabyte is 1000 terabytes) of data. The site serves as a public history of the internet. It's free for anyone to access and backed by a non-profit organization.

Internet Archive is a non-profit library of millions of free texts, movies, software, music, websites, and more.

I'm a fan, and even my site made it into the archive. Here's stefanjudis.com from December 16, 2017. Those were the good old days when I started blogging.

Screenshot of stefanjudis.com showing a colorful site with a bold "Hey, I'm Stefan. I develop digital services and products." heading.

I love the Wayback Machine and the best thing about it is that I didn't need to do anything for my site's free online archive.

Side note: I also run a weekly GitHub action to screenshot my site not to rely on external services for my own archive and to own my data.

How does the Internet Archive decide what to archive?

The Internet Archive crawls sites based on their own heuristics and the commonly used Alexa Internet data set.

Much of our archived web data comes from our own crawls or from Alexa Internet's crawls. Internet Archive's crawls tend to find sites that are well linked from other sites. The best way to ensure that we find your web site is to make sure it is included in online directories and that similar/related sites link to you.

My site seems to have enough traffic, backlinks or whatever to make it into the public archive. I'm happy about this, but not everybody celebrates being part of the public internet archive.

David discussed Deno and shared that deno.com isn't accessible on the Internet Archive.

[...] it's worth noting that deno.com is suspiciously absent from the Wayback Machine.

WaybackMachine statement: Sorry. This URL has been excluded from the Wayback Machine.

Hold on! You can get excluded from the Internet Archive?

How to exclude sites from the Wayback Machine

If you want to access a site's history there are multiple reasons why it wouldn't show up on the Internet Archive.

Some sites may not be included because the automated crawlers were unaware of their existence at the time of the crawl. It's also possible that some sites were not archived because they were password protected, blocked by robots.txt, or otherwise inaccessible to our automated systems. Site owners might have also requested that their sites be excluded from the Wayback Machine.

Technically inaccessible sites aren't stored (surprise surprise! ๐Ÿ˜…) and the Wayback Machine bot seems to respect a robots.txt (that's great!). However, if you really want to exclude your site from the public archive you need to contact the Internet Archive team and ask for your domain's exclusion. Real humans will then evaluate your request.

There's a public list of sites that are excluded from being crawled by the Internet Archive if you're curious who's making the effort to talk to the Internet Archive folks.

3800 site owners explicitly opted out of being part of the Internet Archive. I'm unsure about this list's ownership, but it seems to be maintained and some test crawls for the excluded sites confirmed that they're indeed unavailable on the Wayback Machine.

The idea of removing my own sites from the Internet Archive puzzles me because the things I put online are meant to be public. It's no hot news that the internet doesn't forget; if you publish online you must expect that your public and freely accessible content stays on.

Opting out of the Internet Archive might make it harder to access your site's history, but if you really mess up, people will screenshot and save things โ€” really deleting things from the internet seems unlikely.

I scanned the list and thanks to the :visited CSS pseudo-selector, which by the way doesn't work how it used to in Chrome, I could discover sites I've previously visited that opted out of the Internet Archive.

  • https://www.app.com/ (news)
  • https://www.brita.de/ (product)
  • https://danluu.com/ (personal blog)
  • https://deno.com/ (product)
  • https://gmail.com/ (product)
  • https://incogni.com/ (product)
  • https://miro.com/ (product)
  • https://www.robinsloan.com/ (personal blog)
  • https://www.thescore.com/ (news)

So, there are more sites opting out!

Should the Internet Archive be opt-in instead?

When I discussed this topic with my partner she turned things around and replied "Isn't it rude that there's a service archiving your stuff without asking you first?". That's an interesting take. Is the Internet Archive guilty of rude behavior?

Whenever you publish something online, you must be aware that people can access whatever you put out. And when people can access your content, they can screenshot or save it. Is it rude when people do these things? I don't think so. It's a given fact of the great invention we call the internet.

Is it rude when someone monitors and archives my site over time? Idk, maybe? If it's a single person doing it, it would be very strange, but it's also just the obvious consequence of publishing online. I can't prevent people from doing random things with my public and freely accessible stuff.

The Internet Archive isn't your average internet stranger, though. I want to think of the folks as the good ones. They're transparent, non-profit and give credit. They're archiving the entire internet and that's a good thing.

Turning things around and making the archive opt-in would defeat the idea behind the project. It just wouldn't be "the Internet Archive" but "the archive of people who decided to opt in". I see way more benefits of archiving public information than harm as long as everyone has a way to opt out.

That said, though, I noticed that it bothers me when companies opt out of the Internet Archive.

Companies without public history make me slightly uncomfortable

I've been working at startups and young companies for the last ten years. It's challenging because you have to build a good product. Then, you have to evaluate if and how much people would pay for it. There are a thousand other things to do just to survive and the journey is incredibly tough. That's why every company will mess up eventually.

Important rules to follow as a company are to build a great product, establish trust with your customers and be transparent. Opting out of the Internet Archive breaks two of these rules. It doesn't feel very trustworthy and isn't transparent. What's there to hide?

  • Is it this one pricing change that made people mad but needed to be done to survive?
  • Is it this one promised product feature that wasn't delivered because of other priorities?
  • Is it this one opinion blog post that doesn't hold anymore?

Mistakes will happen regardless if the Internet Archives has a copy of a company site or not. People will have screenshots, videos and copies. Customers will be angry and curse the company on social media. And all this is expected because building a company means building a plane while flying. Nobody said it'll be easy.

Covering one's tracks by opting out of the Internet Archive won't help to avoid critical situations, because they can't be avoided. But hiding internet history will change how people perceive a company because it feels like there's something to hide.

The only way to build trust can be owning the fuck ups and moving forward.

Am I missing something here? If so, let me know, I'd love to extend this post with more ideas.

If you enjoyed this article...

Join 5.8k readers and learn something new every week with Web Weekly.

Web Weekly โ€” Your friendly Web Dev newsletter
Reply to this post and share your thoughts via good old email.
Stefan standing in the park in front of a green background

About Stefan Judis

Frontend nerd with over ten years of experience, freelance dev, "Today I Learned" blogger, conference speaker, and Open Source maintainer.

Related Topics

Related Articles