Dealing with duplicate backups

Entry posted by EricBall January 28, 2011

495 views

I love making backups onto external drives. Plug it in and use EZBack-it-up (now I'm using robocopy) to do a simple copy of every file and every subdirectory onto the external drive. Then put that drive on the shelf until it's time to restore a file or do another backup. Not good for a bare-metal reinstall, but great for a basic file recovery solution. (Also useful when the internal drive gets full and you need to delete some large media files.) And when the drive gets full, buy another drive, copy the old backup onto it, and keep going.

Last week I bought a 2TB (100,000 times larger than my first hard drive) for C$120+tax and made a backup of my PC (using Windows7 Backup, which seems to be actually viable). Then I proceeded to dump all of the old backups onto it too. (And it's still half empty!) But now I want to try to take the time to eliminate some of the duplication which has accumulated over the years. I'm hoping I can get down to 500GB used so I can copy it onto the previous drive and stick it in the safety deposit box.

But doing it is harder than I thought. I've been playing around with a duplicate file finder and have run into a few challenges. First is the number of subdirectories & files cause the tool to crash after 8+ hours. But I've after doing some more detection with a smaller set of subdirectories, I've made some interesting discoveries.

First is that I've got a lot of duplicates, but not for the reasons I thought. There are four categories of duplicates:

1. Duplicate backups. The same file (or subdirectory of files) appears in multiple places because I backed it up at different times to different destinations/drives or the subdirectory tree changed. What I'd like to do is delete as many of these as possible, but I also need to identify unique files and ensure those aren't lost and maybe try to consolidate them.

2. Files which have changed subdirectories - typically they've been recategorized, or appear in multiple subdirectories. Again, I'd like to delete all but one of these. (Or hard link in some special cases.)

3. Duplicate application files. For some reason the application has multiple copies of the same file in the same backup. This is where hard links are useful as that only keeps one copy of the data but with multiple directory entries.

4. Coincidental duplicates - typically because two applications use the same library. Again, hard links should handle this.

The problem is the duplicate file detector is focused on finding duplicate files, not looking at the subdirectory tree and identifying subdirectories with duplicate contents. And even if I give it two subdirectory trees and tell it to find duplicates between them, it still looks for duplicates in each tree.

I've tried to use windiff on the directory listings, but they are too large! So now I'm thinking I need to create a tool which could be used to identify possible duplicate subdirectories. i.e. two subdirectories with the same name might be the same. I don't know how well it will work, but it's worth a try.

I already found one duplicate backup and took back 64GB.