In part 1 and part 2 I discussed some archiving principles, now I want to look at my spectific requirements.
No matter where the data is stored it should be easily retrievable and easily used by the relevant applications
Data should be kept in a secure way, should only be allowed authorized access, possibly encrypted although encryption should be avoided and used only for super secret files.
Data should be compressed when it makes sense, i.e. on data transfer for performance and at final destination for storage cost.
In the event that I move from one storage solution to another I should be able to move the data simply, for example all of my music is currently on Amazon storage but I may want to move it to my google drive.
Data formats need to be updated to the most current version, for example pdf files should be updated to the latest version possibly keeping the original version.
Storage of key words for files, add a set of key words to the meta data. for example a photograph of the family on holiday in Wales would have the keywords added to the meta data, 'family', 'holiday', 'Wales', 'Photograph'
Data must be stored reliably.
Data must be replicated reliably
The data must be available indefinitely and it must be possible to pass data onto later generations
Data must be verifiable that it is intact, the meta data should contain checksums that can be used to verify the file is intact, Files should be checked on a regular basis.
The development will be based on open source software and itself be made available as open source.
The solution must be deployable on Intel or Arm, Linux, BSD or OSX. The development world nowadays seems to be full of different build systems which often have complex dependencies and poor support for non Intel platforms and are generally poor quality.
When looking for an already built solution I found many that had grown from multiple projects that were very complex with different technologies and complex build processes and dependencies. I want to avoid this with this solution.
Meta data to be stored for each file including data used for verification, location of copies, number of copies, search data.
Keep multiple copies in multiple locations, reliability is based on the possibility of loosing one copy of the files but having another copy to replace it.
It must be possible for a file system to both replicate itself to multiple locations as well as those locations replicating to others. For example we could have a single replication to the cloud then with the cloud replicating elsewhere making multiple copies. It would be preferable to have simple rules that define how replication is configured
Use standard file systems local, network and remote file systems, including distributed systems to provide high availability storage.
Different replication solutions can be used
Ability to read multiple files types, to be able to convert files to a newer version. To determine the file type by examining content.
To ensure against obsolete file formats then keep files in different formats for example keep a word document also in a text version, keep a spreadsheet as a csv file, keep a png file as a jpeg etc.
The cost of storage, hardware, networking etc. should be affordable.
It is expected that the backing file systems will be read-only however if we want to allow replacing of files or deleting of files then the solution should support version control.
Over the next couple of months I will look at the various storage solutions that I could possibly use and in Part 4 I will summarise the various storage solutions and file systems.