This is a series of articles on digital archiving during which I will try to create an archiving system that is up to date in its approach, at the moment I do not know what that will look like but based on research and hopefully feedback I hope to get to a working solution.
At this stage I do not profess to be an expert in archiving but I do have some industry experience having worked for a short while with an archiving company where I worked with a number of people who were real experts in archiving and hopefully some of there knowledge will have rubbed off on me.
I have no intent to plagiarise any of there approaches to archiving but is inevitable that there will be some similarities.
I have always been fascinated in history and in particular I love visiting museums where we find history preserved and catalogued in a way that will make it available for our children and their children to enjoy and understand, we need to do the same for our digital history.
I looked for a good definition of the noun archive and the best one I found was on www.dictionary.com:
"Usually, archives. documents or records relating to the activities, business dealings, etc., of a person, family, corporation, association, community, or nation."
"archives, a place where public records or other historical documents are kept."
"any extensive record or collection of data: The encyclopedia is an archive of world history. The experience was sealed in the archive of her memory."
"a long-term storage device, as a disk or magnetic tape, or a computer directory or folder that contains copies of files for backup or future reference"
"a collection of digital data stored in this way."
"a computer file containing one or more compressed files."
"a collection of information permanently stored on the Internet"
Nothing surprising here but I think its worth expanding a bit more as I think there are other requirements and consequences of the digital archiving process no matter what you are archiving.
In the definition above it associates the term archive with both a collection of data and a single file (such as a zip or tar file), in this discussion I am primarily interested in collections of data.
Items that we archive we generally want to preserve in a way that ensures that they will not deteriorate while in the archive.
For example, paper documents and photographs may be kept in a fireproof vault, there may be multiple copies kept separately and using today’s technology they will also be digitally imaged. Without these preservation methods we can never be sure that the archived documents will be available in the future.
Digital data also has to be preserved, apart from keeping on reliable storage, keeping it in a safe and secure environment, making copies etc. we can also do other things such as take a digital fingerprint to enable us to verify they are as original.
One other thing that effects all archives, paper or digital is whether we will have the ability to read them in the future, if you think of ancient Egyptian documents it took many years before historians could read these documents and in the same way how can you be sure that the files written with the formats used today for your word processing documents for example will be able to be understood by a word processor in fifty years time, remember Wordstar? Multimate? No? well you probably would struggle to get a reader for these now.
I googled a bit on multimate conversion and got the following from Dec 2002
"There are indeed converters that can rescue data from many ancient word-processing file formats, including Multimate. Most of the converters are commercial products, though Microsoft apparently did release a Multimate converter for Word 6.0, as a downloadable update file. Earlier versions of WordPerfect are also said to have had support for Mulimate files."
I checked some of the products they mentioned and could only find one of them and that mentioned it was for Windows 95?
Of course if you are archiving data for many years then there is always a chance that someone will gain access to the data when not intended too and so it is important that personal data or data that could be misused by another party be protected. The normal way to do this is by encrypting the data however this has a number of issues. Firstly if you use encryption keys then you will need to store these separately from the data and ensure that they are still available in the future. If the keys are lost then the data is pretty useless.
If the data does not need to be secure then simply do not encrypt it.
If there are identifiable parts of the data that do need encryption then simply encrypt those parts of the data and then leave the rest unencrypted.
For example if you archive the medical history of Joe Blogs then the only identifiable part of the core medical data should be his Medical ID, so therefore only encrypt that, so that in later years when you want to look at all of the records of people who had specific symptoms then you can do that without any decryption and without any identifiable link to the individual, so designing your data archiving in this way could minimise the encryption required thus ensuring its accessibility in the future.
If you go into a museum and look at an exhibit you will look to the label on the front of the cabinet that will tell you more about the item, when and where it was discovered, what part of history it relates too, what it was made of. This data is invaluable in understanding the exhibit.
Data also often requires additional information known as metadata, for example it could contain the file type and version, it could contain the location of any encryption keys used. For digital images you may store Pantone information.
The metadata could also have search key words to help more easily identify data you want to select from the archive.
Time is one of the biggest issues to effect archives, over time companies will be bought and sold, they will cease to trade, buildings will catch fire, wars will happen, Aliens will invade. All of which will reduce the chances of ever being able to reuse the archived data. Would it not have been great if we still had archives of the building and use of Stonehenge, the Pyramids and other ancient monuments many of which we still do not understand.
Some data in an archive has a short lifespan, Tax records only need to be kept for so many years. Companies that go into liquidation probably are not interested in persisting there customer transactions. But medical researchers, weather scientists, film archivists all will want their data to be archived and accessible indefinitely.
Therefore different data will require different archiving approaches so its not a one size fits all.
In the next part I will start to look at building a modern archiving system and in particular look at storage solutions for digital archives.