I recently began submitting archive material to archive.org for the Brian Lehrer Show on WNYC. The reasons for this include being a huge fan of the show and what it provides to it's listeners. Additionally, we live in an age of destruction of truth and public records. As such I thought, it would be great to participate in archival of this public service.
For some time I've been recording the Brian Lehrer show and running a software to deal with transcription. I've got about one years worth of transcripts for the show on my server. It was only recently that I decided to start utilizing archive.org to save more permanent versions of this data.
I had gone back and forth some time about the storage mechanism that I would use for this project. Initially, the first choice I thought about was utilizing AWS S3. Obviously, this is a cheap and reliable service to use. However, I don't believe that AWS will always have it's users best interests in mind. Mostly, out of protest, I am refusing to store my data here. Obviously, if I were to ever stop paying my AWS bill, my data would relatively soon thereafter disappear.
I had also considered a more homegrown approach. I considered utilizing my own network-attached-storage (NAS) device for this. I still may yet pursue that as a redundancy option. But there's an on-boarding cost with this solution that I wasn't ready to commit to. A colleague of mine suggested that I could get away with using a Raspberry Pi, and purchase some hard-drives to connect, and have a sort of roll-your-own NAS. I may still yet pursue one of these approaches for redundancies.
Another colleague of mine then suggested that I could utilize archive.org for this project. While I had heard of the project, I never really considered it an option. Indeed, it aligns kind of perfectly with my needs. An archival project for things of interest to the general public.
archive.org provides an API for programmatically uploading various files and also supports a range of meta-data, which gives color to the things that you archiving and adds bits of functionality to your uploads within the archive.org site. Python is the tool of choice for this API as the archive.org project has an officially supported library and optionally a command-line tool. More or less you can follow the guides, and after creating an account and obtaining an access token, you're free to start uploading your content.
The API in my experience takes a while to upload content. I don't think that's a problem, and I assume that this time for uploading is well justified. There is definitely an array of security scanning, validations and processing that has to occur.
Initially when I uploaded the recordings, I had saved them as MPEG files, but I later realized that MP3 files are better supported. archive.org provides a media player for audio or video files which can also display synchronized lyrics (think karaoke). This was a particularly nice feature for my project because I have transcripts for each episode of the show, so I could upload those, and get a nice player experience.
archive.org provides some other nifty features, like the ability to download archived content via torrent. This is probably a way they can reduce the loads on their own servers, and I wonder how broad and well provisioned their torrenting network is, perhaps an avenue for more research and contribution in the future.