The Caboteria / Tech Web / FightEntropy (revision 25)
Information that's stored in computers "rusts" at least as fast as information stored in real-world media. I'm reminded of this every time I walk through the cemetery down the street from my house: many headstones manufactured in the 17th century still convey the information that they did when they were new. On the other hand, when my Mom "upgraded" to Windows '95 a few years back she found that all of the documents she had written in her low-end Microsoft word processor were completely illegible on the new version of the high-end office suite, even though only 5 years had passed. She was lucky to be computer-naive - she lost nothing since she had printed all of the documents on paper. If she had used the computer the way that the manufacturers wanted her to she would have either lost the documents or I would have spent a lot of time scraping the data out of them.

What can you do? To start with, be aware of information entropy, and decide whether or not you care about it. If you don't care then you're wasting your time reading this document; if you do then I hope that you'll learn something from it.

In my experience with computers, data becomes inaccessible for two reasons: the physical media that it's stored on becomes unreadable (i.e. http://www.informationweek.com/story/IWK20010719S0003 ), or the format that the data is stored in becomes indecipherable. While both reasons produce the same result, they are different in terms of the actions that you need to take to prevent them.

Media I've used:

Media I haven't used, but know about:

Media can become unreadable if the media itself fails (magnets, scratches, click of death) or if the reader breaks and can't be fixed. The media and the machines that read it are physical devices and age and degrade over time. A failure in either the media or the reader is enough to render the data lost forever: if one floppy fails you lose only the data on that floppy but if your floppy reader fails then you lose access to all of the data on all of that type of media. You can potentially get it back by finding someone else that has that type of reader and borrowing it from her, but you don't want to count on that.

Data can become unreadable, even if the media is readable, if there is no software that can decipher the format that the data was stored in. Think of storing data in terms of encrypting it using an encryption algorithm. Some encryption algorithms are stronger (i.e. harder to decipher) than others, and some algorithms are more widely understood than others. The software that you use to read and write the data are the encryption key. If you understand the "encryption algorithm" that you're using to write the data then you will not have to worry about deciphering it, but if you don't understand the algorithm then you are dependent on that software to read it for you. My Mom was dependent on software that could read data encrypted in a certain Microsoft algorithm that neither she nor I understood, and over time Microsoft themselves "lost the key" to that algorithm so her data was permanently locked up and Mom didn't have the key.

This is the most important reason (among many) why you should never store any data in a format that's not well documented. If the only person that understands the format is the person or company that produced it, they can decide at any time that they don't want to support it anymore and you have very little recourse. You could try to figure out how the format works by looking at your documents, but the process (known as "reverse-engineering") is time-consuming and boring and may be illegal in some cases. Note that the key distinction is not whether the format is proprietary or non-proprietary, it's whether the designer of the format has provided enough documentation of it that other people can read and write it. Some proprietary formats, for example Adobe Portable Document Format, are very well documented so many tools can read and write them.

The most important set of undocumented proprietary data formats are the Microsoft Office formats for documents, spreadsheets, presentations, etc. These formats are important because so much data is encoded in them every day, but they are not documented, and they change frequently. Many people spend many man-years reverse-engineering them, but this effort is frustrated when a new version of the formats appears and the reverse-engineering process must start from scratch. And no, the XML-based Office 2003 formats are not any less proprietary than the previous binary ones. For one thing, they're still undocumented, for another, they're probably patented so even if you were capable of figuring out how to read them it would be against the law for you to do so.

Notes

In the summer of 2005 the Commonwealth of Massachusetts decided to use the OpenDocument document format instead of MS's proprietary formats. One of their keys issues was the ability to read documents for a very long time. Here's an analysis of that decision: http://www.dwheeler.com/essays/why-opendocument-won.html

Dublin Core Metadata Initiative (http://dublincore.org/) offers standards for encoding many different types of data, for example http://dublincore.org/documents/dcmi-terms/.

Things that can be stored in standard formats:

Text wins over proprietary formats (see IETF, Project Gutenberg).

Documented proprietary formats win over undocumented formats (e.g. RTF over DOC).

Lossless wins over lossy (e.g. FLAC over MP3).

Backup vs. Archive: short-term vs long-term, bulk data vs document-oriented.

Allow users to export from your program. Provide a means to dump data from your internal format to some standard format.

http://slashdot.org/article.pl?sid=02/03/03/1821227&tid=126 - BBC digitizes old book, 15 years later the digital version is useless but the 1000-year-old book can still be read.

http://computerworld.co.nz/webhome.nsf/NL/A7D9D35CE6CC6DE3CC256D5F00728810

http://www.ecommercetimes.com/perl/story/31436.html

wrjpgcom is a tool to write data to the comment field of a jpeg image file. Need to find the source.

http://www.itl.nist.gov/div895/carefordisc/index.html - The US feds help keep your data safe.

http://www.ietf.org/internet-drafts/draft-ietf-geopriv-dhcp-civil-04.txt - an IETF draft for "civic location," also has some good references

http://www.theregister.co.uk/2005/02/21/forgetting_digital_memories/ - Digital memories: we can forget them for you wholesale!

http://photoshopnews.com/?p=226 - this issue affects even cameras. Here's a story about an expensive camera that uses a proprietary format that can only be read by that vendor's software.

http://lwn.net/Articles/240528/ - a link to an article on this topic by Jeremy Allison of the Samba team. Some of the comments are interesting, too.

http://arstechnica.com/science/news/2010/11/preserving-science-how-data-gets-lost.ars - an article about this topic in the context of scientific research

http://blog.longnow.org/02014/02/24/iceisee-3-to-return-to-an-earth-no-longer-capable-of-speaking-to-it/ - a sad tale of a perfectly functional satellite having to be mothballed because we can no longer communicate with it.

Edit | Attach | Print version | History: r26 < r25 < r24 < r23 < r22 | Backlinks | Raw View | Raw edit | More topic actions...
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding The Caboteria? Send feedback