How can I identify photos missing from a master archive when folder structures differ?

Question

I’m consolidating about 15 years of photos from multiple hard drives into a single master archive on Windows. The files are spread across different folder structures, and there are many duplicates. Most duplicates should be exact, bit-identical copies rather than resized or edited variants.

What I need is a way to scan another drive and list the photo files that are not already present in the master archive, even if the folders are organized differently. I don’t want software that enforces its own library structure, and I’m not looking for visual duplicate detection—just a reliable way to find exact files that are missing from the archive so I can manually import and organize them.

Is there a script or tool that can do this?

user6508 · Answer

Hashes are actually the key to doing this and getting it right.

If you're up to getting your hands a little dirty, this would be an easy effort as a shell script:

Generate a list of hashes of every file in your master archive. Call this list "MA ("master archive").
Generate a two-field list of hashes and paths to every file in your other archives. Cal this list OA ("other archives").
Extract the list hashes in OA. Call this list OH ("other hashes").
Pull out a list of every hash in OH that isn't in OA. (Off the top of my head, fgrep -xv -f MA OH). Call this ML ("missing list").
Pull out a list of every file in OA that matches a line in ML. (fgrep -f ML OA | awk -f '{ print $2 }',).

The end result of the last command will be a list of the files in your other archives that don't match one in the master.

For Windows, there's a package called Cygwin that includes all of the utilities you'd need to do this (bash, find, sha1sum, grep and awk).

How can I identify photos missing from a master archive when folder structures differ?

2 Answers

Your Answer

Related Questions