Wednesday, April 16, 2014

Preventing Bit Rot


Data Decay (sometimes referred to as "Data Rot") can corrupt your .jpgs as they idly sit on your hard drive.  How many of YOUR precious memories are deteriorating on your storage media right now?

Also in this issue:
  • Unobvious Things about the Sony A7 and A7r
  • New Ebooks out!
  • The Friedman Archives is Hiring!

General Announcements

We'll get to the part about Data Rot (an important topic!) in just a minute, but first, I have some general announcements:

New Ebooks are out!  The Complete Guide to Fujifilm's X100s by Tony Phillips, and The Complete Guide to Sony's A7 and A7r.  Early reviews on both have been outstanding, which is gratifying because our customers' expectations continue to increase with each new camera and book.

Also, "Ways to 'Wow!' with Wireless Flash" has been translated into Spanish and is available here.
Ebooks on the horizon are the Olympus OM-D E-M1, and the Sony Alpha 6000.  Fire off an email to Gary at Friedman Archives dot com if you'd like to be notified of any of these releases.



Unobvious Things about the Alpha 7 and A7r
Some things are just easier to say and demonstrate... :-)




The Friedman Archives is (are?) Hiring!  (Well, sort of...)

Love photography?  Love to write and teach?  Have a thing for detail and curiosity?  (Have a few months of spare time? :-)  )  Then The Friedman Archives wants YOU!  

We're looking for a few people with which to collaborate on a handful of upcoming projects.  Email me (Gary at Friedman Archives dot com) if you'd like more details.  We may be very picky about the skill set required of our writers, but on the other hand we also offer one of the most compelling compensation packages in the business.  "By Enthusiasts, For Enthusiasts" isn't just a motto, you know!

Data Decay

Another, more common .jpg corruption example - shifted image and colors.
I worry about data decay.  A lot.  Already I have a few hundred corrupt .jpgs on my hard drive which standard file system check programs like CHKDSK (used in Microsoft Windows) have been unable to find or fix.  

The problem has two main causes: 1) physical media losing its magnetic orientation and strength (this is the kind of problem that CHKDSK and the underlying file system usually excel in finding and fixing), and 2) error-prone mass copying when transferring data from one hard drive to another, or from the camera to the computer.  Usually such copying employs no verification checks.

Error Detection and Correction codes (like the kind that are employed at the bit-level every time your hard drive writes something to the disk) are great at recovering from one bit error, but aren’t designed to handle two bit errors (which are statistically far less likely) which is why there’s a good chance you've never experienced the problem yourself.

Disk mirroring (and RAID configurations in general) won’t help you here, since if there’s an error in the file it will simply be copied onto the second disk.

What to do?  Well, there’s no automatic set-it-and-forget-it solution right now (I smell a business opportunity!), however there are some tools available to the serious archivist:

1) There are file integrity verification tools out there.  The way they work is you have the program scan a healthy directory and it will generate what’s known as a “hash” (which you can think of as a complex checksum).  You can then run the same scanning program in the future and it will re-calculate the hashes and compare it to the old ones, telling you if any of your original files have changed.  The tool doesn’t just look at data and time, but the entirety of 1’s and 0’s in the file.  One popular tool is called ExactFile.

The downside to these programs is that once a problem is found, what do you do then?  It’s not clear what you can do to recover from the error.  The other downside is that every time you intentionally access / modify that file you have to re-generate the hash.  (The THIRD downside is that you have to have the foresight to generate the hashes before the data rot begins.)

2) There are free tools you can download that search all directories for specific file types (such as image files) and check to see if they’re corrupted.  (Some also work on movie files, and other special-purpose files.  But I haven’t seen ANY tool that can work on a multitude of file types, including Microsoft Office files for example).  Here’s one I personally used to uncover hundreds of corrupt .jpgs on my hard drive, and there’s a version for Macs as well as PCs:  

Okay, so once you've found some corrupted image files, how can they be fixed?  In my experience, .JPG repair programs are about as effective and predictable as those programs that try to recover images from corrupted memory cards – it’s a crap shoot.  I've spent a few days going over countless website reviews of .jpg repair programs and I was either unimpressed by the success rate or unimpressed with the testing method of the article’s author.  So I can’t actually recommend anything because I haven’t had much success with the few that I've tried.  (And if something worked for me then it wouldn't necessarily work for you.)

However, in my research I was able to uncover a quirky website that offers a .jpg recovery service, and if their automatic tool can’t do it for you then they say they have an experienced staff that will go in and FIX THE .JPG BY HAND by analyzing the structure and doing a byte-level edit.  (Nothing beats the old fashioned way!)  If you really know what you’re doing this method holds the promise of the highest possible recovery rate.  Here’s their website and their other website.

Prevention
There are two things you should be doing NOW to help protect yourself from future data rot:
1) Remember that ON AVERAGE even the most durable storage medium will probably wear out after 5 years.  In addition to regular, daily backups (and keeping a rotating 3rd set offsite to guard against fire or theft), I strongly recommend implementing a data Replication / Refresh every 3 years or so.  This essentially means copying your entire data set over to a fresh hard drive every so often.  While this will help protect you against magnetic loss of certain bits, if you already have corrupted .jpgs due to sloppy copying then those corrupted files will get copied too.

2) Stop copying files using your computer’s file manager (Finder or Windows Explorer) and start using a file copy AND VERIFYING program like Teracopy (Windows) or Ultracopier (for OSX, Linux, and Windows).  These programs take twice as long but they would have eliminated the primary source of my corrupted .jpgs had I been using them from Day 1.

3) Advanced users: Industrial strength File Systems like ZFS (Unix / Linux), MacZFS, or Microsoft’s up-and-coming ReFS (Windows Server 2012, Windows 8.1) hold the promise of being more resilient and proactive about addressing these kinds of data rot problems.  They are the file system of the future (until our data sets get larger and older, that is. :-) ).

If you want to learn more about this very essential subject, I can recommend a book that just came out called Data Protection for Photographers by Patrick Corrigan.  An evaluation copy crossed my desk a few days after I wrote this blog post and I was incredibly impressed by its thoroughness and easy-to-understandedness (that's a word!).  It's hard to make such a dry subject readable but Patrick does a great job.

By the way, the tools casually mentioned above are by no means exhaustive.  If you have a tool or method that addresses this problem, please post it in the comments section below.  Everyone should learn from your experience.

==============

Seminars

The Friedman Archives High-Impact Photography Seminars are on hiatus this year because of some major projects I'm working on.  They will resume in 2015 starting with a tour of Australia (Melbourne and Sydney) and New Zealand (city TBD).  There is also ONE slot available in the September - October time frame for one camera club who wishes to bring me out.  Start submitting your applications now. :-)

Until next time...
Yours Truly, Gary Friedman


23 comments:

  1. +1 on the ZFS. It will probably be the easiest and most cost effective solution of the bunch, since it is occuring the at block device level. The main downside is that it's not natively supported on Windows and Mac(s). OpenSolaris/Solaris have native support and Linux have some additional drivers/modules that will grant support. FreeBSD has ZFS support as well.

    In fact, you can make a ZFS protected file server with FreeNAS(FreeBSD based NAS server), which will take your storage and create ZFS volumes from them. Use them as a network file server that is protected against bitrot.

    ReplyDelete
    Replies
    1. Yes, I should have mentioned FreeNAS in the blog. Great suggestion - The best of all worlds! FreeNAS also can host your own dropbox work-alike so you don't have to pay anyone.

      Delete
  2. Gary...Great post and a concern for all serous photographers. Curious to get your take on M-Disk for permanent archival storage and do online photo storage companies like Shutterfly address this concern with back office software? Appreciate any insights you might have. Keep up the great books and website!

    ReplyDelete
    Replies
    1. I talked (briefly) about M-disk in my October 2012 blog. My data sets are so large that just burning everything onto DVD could be a full-time job. I have no knowledge of what Shutterfly does.

      Delete
  3. I like a variation on the periodic refresh idea: when I get a new backup HD (every couple years), retire the old HD that has a copy to that point (rather than erase/reuse/discard it).. so I have a drawer of a few old HDs with everything to their retire date.. not an organized solution, but bitrot won't get copied to the retired drives no longer getting updated.

    ReplyDelete
  4. I always go on the assumption that every drive will fail, it's just a matter of when. I always have at least two backups of my files. I back up my computers to a Windows Server Essentials 2012 box. And I back up the important stuff of an external hard drive that gets stored off site. Newer stuff is getting stored in the Cloud as well.

    ReplyDelete
    Replies
    1. That's pretty much my approach. Now scan all of your .jpgs for corruption and tell us if any of them have corroded over time.

      Delete
  5. I had a few JPG files silently corrupted by old computer that was turned into a file server. The file system checked just fine, no errors were reported ever. Files were copied and recopied without any trouble. RAID did not help. Most probably bad RAM was the culprit.
    Fortunately, today we have filesystems that defend against such bit rot. Big props to folks who started ZFS at Sun and to FreeBSD and FreeNAS guys who made it available for free to the world. Now my pictures along with Lightroom metadata can stay error free for a long time for not a lot of money and effort on FreeNAS. Good thing Microsoft also woke up and decided to bring these features to mainstream with ReFS.

    ReplyDelete
  6. Try checking out the MultiPar program. It creates some recovery files. Using these files, you can recreate the exact originals from damaged versions. I use it to create .PAR files for an entire directory at once. It uses the complete files and the .PAR files to repair damaged files - no matter what format. As a matter of fact, it doesn't recognize or check the format - it just makes it binary identical to the original. It can even help if you delete one of the files by accident.
    Just search for MultiPar or Parchive. The format is open source and there are several tools out there.

    ReplyDelete
    Replies
    1. Thanks for this heads-up! It's like MD5 except there's a path to recovery as well. Are PAR3 files still not recommended? Do you have to manually create a new PAR file everytime you, say, go edit an image or modify a word document? GF

      Delete
    2. PAR3 is still in development, so stay away from it for a while. Yes, you would have to manually recreate PAR files after such changes or they would be rolled back. I use it after transferring a batch from the camera. I then create a set of PAR2 files, limiting each file to 20 MB if using RAW and 10 MB for jpegs. The reason is to guard against data rot in the PAR2 files as well. I then burn them in dublicate when I have enough for a DVD (and store one at my sister's flat).
      Incidentially, M-disk now has a BluRay 25 GB version. As usual, it is readable in a standard BluRay drive.

      Delete
  7. To address the problem of decay on the physical hard drive media, I recommend you take a close look at SpinRite. The program's author (Steve Gibson) gives a thorough explanation of what his program does here: https://www.grc.com/sr/whatitdoes.htm

    P.S. I think Steve is a man after you're own heart, Gary.You both have similar intellectual curiosity. :-)

    ReplyDelete
    Replies
    1. SpinRite has saved one of my FAT32 hard drives in a previous life, and I'm glad to hear that Steve has been able to update it. (Yes, I agree he's a genius!) BUT SpinRite is designed to address a much more severe problem than the one I described. In fact, I'll go one further and say Spinright wouldn't be able to sense or repair any of my corrupted .jpgs. GF

      Delete
  8. Great video Gary! Thanks for the great blog on data rot and what can be done to minimize it's affects. I will be getting one of the new Sony cameras soon and probably your ebook soon after.

    ReplyDelete
  9. Not too sure if you have looked into it. There is a program called as DiskFresh that I run regularly. I am yet to come across data decay but it is exactly to prevent such loss.
    http://www.puransoftware.com/DiskFresh.html

    ReplyDelete
    Replies
    1. I wasn't aware of this! It looks like SpinRite except you can use it while your system is still functioning. I also just downloaded Puran Utilities (same company) which looks very handy - and it's all FREE for home use! Thanks for this tip. Mind you, none of these tools will find or fix corrupted .jpgs...

      Delete
  10. Gary, I appreciate you reminding us of this important issue, but I need to correct a few things:

    "Error Detection and Correction codes...are great at recovering from one bit error, but aren’t designed to handle two bit errors": This is not correct. ERC will detect any combination of failed bits in a disk sector. The ZIP file compression algorithm uses a similar method. You can validate this yourself by zipping a file and using a hex editor to change any single bit or any multiple combination of bits. Any change will cause the CRC check to fail upon unzip.

    "error-prone mass copying when transferring data from one hard drive to another, or from the camera to the computer. Usually such copying employs no verification checks." This is misleading. File copies are ultimately done by I/O calls, which will fail if any single or multiple bit combination on a referenced disk sector fails. It is true that file copies usually don't do an additional verification check on top of this (although that's easy to do), but any failure should cause an error. How that error is handled is up to the application (Windows Explorer, Finder, Terminal, Command Prompt "copy" command, etc). Typically (but not always) the copy operation is aborted.

    It is very easy to add a verification check. With Windows command prompt, just use copy /v or xcopy /v. For major copying of large directory trees via a GUI tool, you can verify this with a tool like Beyond Compare, probably the best available file/folder compare tool.

    "Disk mirroring (and RAID configurations in general) won’t help you here, since if there’s an error in the file it will simply be copied onto the second disk": This is not really right. Data isn't copied from a bad disk to a good disk, rather an error is detected during an I/O operation due to a CRC check or other h/w or OS I/O error. Then the failing disk is put off line. The very reason RAID works is because CRC checks during a transfer will detect any single or multiple bit failure, hence trigger the RAID failover. If this wasn't reliable, RAID would be of no use to anybody.

    "I have a few hundred corrupt .jpgs on my hard drive which standard file system check programs like CHKDSK (used in Microsoft Windows) have been unable to find or fix." This touches on the overall misconception: there are multiple conceptual layers of data involved -- each with their own internal data structures, each affected by software which only knows that layer. E.g, a corrupt .jpg file is like a corrupt LightRoom catalog or SQL database. The damage is at a higher level of abstraction than the file system. Hence no file system check will find that, prevent it, or fix it.

    File system checks and fixes -- whether CHKDSK or the latest ReFS -- only know about file system-level items. They cannot detect or correct higher-layer data corruption.

    Unfortunately the parties responsible for the higher-level data structures often do not provide integrity checks or poorly document these. But make no mistake -- those higher level structures are written by software which can contain errors and induce corruption at those levels. This will not be found by file system checks, nor can it be fixed by them.

    There is no good solution for this, but it's important to understand that verifying file copies or enhanced file systems will not prevent or repair such corruption.

    One solution is keeping multiple versions of backups, in hopes that when a problem is detected you can rewind to before it happened.

    Another solution is using specific integrity checks for each important data set. Unfortunately this is unique to the application-layer data. E.g, checking .jpg structure will not check the .CR2 or .NEF files -- those must be checked by routines familiar with the internal structure.

    ReplyDelete
    Replies
    1. No solution? Damn! I'm either going to switch to film or start a new business to directly address this unsolved problem. Now where is that spare time I was hiding away?

      Delete
  11. I am a professional software developer since the early 1990ties. Silent data corruption on the hard disk that is not detected by the file system does not happen in my experience.

    Suppose bits would flip now and then. This would not only effect your JPG-files but also all other files including the executables of the operating system. Flip a single bit in a programm and you will most likely find that the program crashes. In the case of core operating system files you will get a blue Screen or the system hangs. The file system may not be able to correct 2-Bit faults but it would still detect the fault. And raid systems do protect you against this.

    If you have lots of "corrupted" JPG files but other files are uneffected I would assume that the program used to edit theses files years ago was interpretating the JPG standard differently that modern programs do. I would try to open theses Images using mspaint.exe on a Windows NT4.0 machine.

    Bad RAM is another way to get corrupted files. There are 2 possibilities: The file was damaged when it was copied or editied using a machine that had bad RAM. Second: The file is fine but your current machine has bad RAM.

    For long term storage I would suggest 2 USB hard disks. Every 5 years or so you should move the complete archive to new, current disks. Standards sometimes Change or the common interpretation of a standard changes. A new hard- and software plattform may not fully support media that were purchased and written more than 15 years ago.

    Regards,
    Paul Hoepping

    ReplyDelete
    Replies
    1. Hi, Paul. Yes, your recommendations pretty much match what I had mentioned early on in the article. Thanks for your input! GF

      Delete
    2. "Silent data corruption on the hard disk that is not detected by the file system does not happen in my experience."

      Such corruption absolutely, positively happens. In fact for many years my specialty was fixing corrupt SQL databases at the lowest application layer. There was no indication whatsoever of the problem by file system checks. Similar corruption can happen with any database -- a LightRoom catalog, a FCP X library, etc. It will not be detected by any filesystem check, because those checks have no idea about the higher level data structures.

      How do such things happen? Bugs in software. All software has bugs, at all conceptual layers -- both file system and application layers. The goal is avoid these, and especially those which corrupt data but perfect software is not possible.

      The fairly limited scope of experience of an end user or developer is not revealing in these cases. If you work at the highest level of escalation support for a file system or database product, you are exposed to dozens of such problem per year.

      There is an overlapping area between system and application layers where file system improvements can mitigate corruption. I/Os which are interrupted due to power or system failures can produce "torn writes", or incomplete logical I/Os. These don't cause a file system error but can prevent a higher-level data structure from being consistently modified. Currently torn write detection must be coded at the application layer which is difficult and often not implemented. However if this was implemented at the file system layer (as supposedly done with ReFS) it would be much more consistent and facilitate data reliability. But that's a single exception and even if implemented various failure modes would still exist which could corrupt data at the application layer which would not be visible from file system checks.

      There is unfortunately no good comprehensive answer, but it's important to understand that RAID will not protect you, nor will improved file systems or file system checking utilities give complete protection.

      Each major data type and associated software must have its own integrity checks. E.g, LightRoom has a special integrity check.

      Ultimately the long term solution is for the file system itself to take on greater functionality, so programs interact with it more as a database than as a low-level file system. Unfortunately that is a long way off.

      In the meantime, having a sequence of chronological backups is a good approach. As already mentioned, don't rely 100% on file copies. Verify major copies with xcopy /v, copy /v, plus some utility such as Beyond Compare or others. If one datatype is vitally important to you, investigate whether verification utilities exist for that. E.g, here is one media validation utility. I know nothing about it and am not recommending it, it's just an example: http://www.mediavalid.com/en/products/view/7/

      Delete