The Internet Archive opt out itch
- Published at
- Updated at
- Reading time
- 6min
You've probably heard of the Internet Archive and its "Wayback Machine".
The Wayback Machine crawls and archives over 946 billion web pages resulting in 100 petabytes (1 petabyte is 1000 terabytes) of data. The site serves as a public history of the internet. It's free for anyone to access and backed by a non-profit organization.
Internet Archive is a non-profit library of millions of free texts, movies, software, music, websites, and more.
I'm a fan, and even my site made it into the archive. Here's stefanjudis
from December 16, 2017. Those were the good old days when I started blogging.
I love the Wayback Machine and the best thing about it is that I didn't need to do anything for my site's free online archive.
Side note: I also run a weekly GitHub action to screenshot my site not to rely on external services for my own archive and to own my data.
How does the Internet Archive decide what to archive?
The Internet Archive crawls sites based on their own heuristics and the commonly used Alexa Internet data set.
Much of our archived web data comes from our own crawls or from Alexa Internet's crawls. Internet Archive's crawls tend to find sites that are well linked from other sites. The best way to ensure that we find your web site is to make sure it is included in online directories and that similar/related sites link to you.
My site seems to have enough traffic, backlinks or whatever to make it into the public archive. I'm happy about this, but not everybody celebrates being part of the public internet archive.
David discussed Deno and shared that deno
isn't accessible on the Internet Archive.
[...] it's worth noting that deno.com is suspiciously absent from the Wayback Machine.
Hold on! You can get excluded from the Internet Archive?
If you want to access a site's history there are multiple reasons why it wouldn't show up on the Internet Archive.
Some sites may not be included because the automated crawlers were unaware of their existence at the time of the crawl. It's also possible that some sites were not archived because they were password protected, blocked by robots.txt, or otherwise inaccessible to our automated systems. Site owners might have also requested that their sites be excluded from the Wayback Machine.
Technically inaccessible sites aren't stored (surprise surprise! ๐
) and the Wayback Machine bot seems to respect a robots
(that's great!). However, if you really want to exclude your site from the public archive you need to contact the Internet Archive team and ask for your domain's exclusion. Real humans will then evaluate your request.
There's a public list of sites that are excluded from being crawled by the Internet Archive if you're curious who's making the effort to talk to the Internet Archive folks.
3800 site owners explicitly opted out of being part of the Internet Archive. I'm unsure about this list's ownership, but it seems to be maintained and some test crawls for the excluded sites confirmed that they're indeed unavailable on the Wayback Machine.
The idea of removing my own sites from the Internet Archive puzzles me because the things I put online are meant to be public. It's no hot news that the internet doesn't forget; if you publish online you must expect that your public and freely accessible content stays on.
Opting out of the Internet Archive might make it harder to access your site's history, but if you really mess up, people will screenshot and save things โ really deleting things from the internet seems unlikely.
I scanned the list and thanks to the :visited
CSS pseudo-selector, which by the way doesn't work how it used to in Chrome, I could discover sites I've previously visited that opted out of the Internet Archive.
https://www
(news).app .com/ https://www
(product).brita .de/ https://danluu
(personal blog).com/ https://deno
(product).com/ https://gmail
(product).com/ https://incogni
(product).com/ https://miro
(product).com/ https://www
(personal blog).robinsloan .com/ https://www
(news).thescore .com/
So, there are more sites opting out!
When I discussed this topic with my partner she turned things around and replied "Isn't it rude that there's a service archiving your stuff without asking you first?". That's an interesting take. Is the Internet Archive guilty of rude behavior?
Whenever you publish something online, you must be aware that people can access whatever you put out. And when people can access your content, they can screenshot or save it. Is it rude when people do these things? I don't think so. It's a given fact of the great invention we call the internet.
Is it rude when someone monitors and archives my site over time? Idk, maybe? If it's a single person doing it, it would be very strange, but it's also just the obvious consequence of publishing online. I can't prevent people from doing random things with my public and freely accessible stuff.
The Internet Archive isn't your average internet stranger, though. I want to think of the folks as the good ones. They're transparent, non-profit and give credit. They're archiving the entire internet and that's a good thing.
Turning things around and making the archive opt-in would defeat the idea behind the project. It just wouldn't be "the Internet Archive" but "the archive of people who decided to opt in". I see way more benefits of archiving public information than harm as long as everyone has a way to opt out.
That said, though, I noticed that it bothers me when companies opt out of the Internet Archive.
I've been working at startups and young companies for the last ten years. It's challenging because you have to build a good product. Then, you have to evaluate if and how much people would pay for it. There are a thousand other things to do just to survive and the journey is incredibly tough. That's why every company will mess up eventually.
Important rules to follow as a company are to build a great product, establish trust with your customers and be transparent. Opting out of the Internet Archive breaks two of these rules. It doesn't feel very trustworthy and isn't transparent. What's there to hide?
- Is it this one pricing change that made people mad but needed to be done to survive?
- Is it this one promised product feature that wasn't delivered because of other priorities?
- Is it this one opinion blog post that doesn't hold anymore?
Mistakes will happen regardless if the Internet Archives has a copy of a company site or not. People will have screenshots, videos and copies. Customers will be angry and curse the company on social media. And all this is expected because building a company means building a plane while flying. Nobody said it'll be easy.
Covering one's tracks by opting out of the Internet Archive won't help to avoid critical situations, because they can't be avoided. But hiding internet history will change how people perceive a company because it feels like there's something to hide.
The only way to build trust can be owning the fuck ups and moving forward.
Am I missing something here? If so, let me know, I'd love to extend this post with more ideas.
Join 5.8k readers and learn something new every week with Web Weekly.