Long Term Data Archiving Formats, Storage, and Architecture

jzawodn

from Favicon Jeremy Zawodny's blog, posted 2 months ago

I'm thinking about ways to store archival data for the long term and wanted to solicit anyone who's been down this road for some input, advice, warnings, etc.

Background

Essentially I'm dealing with a system where "live" and "recently live" data is stored in a set of replicated MySQL servers and queried in real-time. As time goes on, however, older "expired" data is moved to a smaller set of replicated "archive" servers that also happen to be MySQL.

This is problematic for a few reasons, but rather than be all negative, I'll express what I'm trying to do in the form of some goals.

Goals

There are a few high-level things I'd like this archive to handle based on current and future needs:

  1. Be able to store data for the foreseeable future. That means hundreds of millions of records and, ultimately, billions.
  2. Fast access to a small set of records. In other words, I like having MySQL and indexes that'll get me what I want in a hurry without having to write a lot of code. The archive needs to be able to handle real-time queries quickly. It does this today and needs to continue to work.
  3. Future-proof file/data format(s). One problem with simply using MySQL is that there will be schema changes over time. A column may be added or dropped or renamed. That change can't easily be implemented retroactively on a larger data set in a big table or whatnot. But if you don't then code needs to be willing to deal with those changes, NULLs appearing, etc.
  4. Fault tolerance. In other words, the data has to live in more than once place.
  5. Support for large scans on the data. This can be to do full-text style searches, looking for patterns that can't easily be indexed, computing statistics, etc.
  6. (more)...

Open link in new window »