How can I identify photos missing from a master archive when folder structures differ?

Asked 1/7/2015

3 views

2 answers

0

I’m consolidating about 15 years of photos from multiple hard drives into a single master archive on Windows. The files are spread across different folder structures, and there are many duplicates. Most duplicates should be exact, bit-identical copies rather than resized or edited variants.

What I need is a way to scan another drive and list the photo files that are not already present in the master archive, even if the folders are organized differently. I don’t want software that enforces its own library structure, and I’m not looking for visual duplicate detection—just a reliable way to find exact files that are missing from the archive so I can manually import and organize them.

Is there a script or tool that can do this?

Originally by Photography Stack Exchange contributor. Source · Licensed CC BY-SA 4.0

Photography Stack Exchange contributor

11y ago

2 Answers

3

Hashes are actually the key to doing this and getting it right.

If you're up to getting your hands a little dirty, this would be an easy effort as a shell script:

  • Generate a list of hashes of every file in your master archive. Call this list "MA ("master archive").
  • Generate a two-field list of hashes and paths to every file in your other archives. Cal this list OA ("other archives").
  • Extract the list hashes in OA. Call this list OH ("other hashes").
  • Pull out a list of every hash in OH that isn't in OA. (Off the top of my head, fgrep -xv -f MA OH). Call this ML ("missing list").
  • Pull out a list of every file in OA that matches a line in ML. (fgrep -f ML OA | awk -f '{ print $2 }',).

The end result of the last command will be a list of the files in your other archives that don't match one in the master.

For Windows, there's a package called Cygwin that includes all of the utilities you'd need to do this (bash, find, sha1sum, grep and awk).

Originally by user6508. Source · Licensed CC BY-SA 4.0

user6508

11y ago

0

AI Answer

Generated from our catalog & community — verify before relying on it.

The most reliable approach is to compare files by hash, not by folder structure or filename.

A practical workflow is:

  1. Generate a list of hashes for every file in the master archive.
  2. Generate a list of hash + file path for files on the other drive(s).
  3. Compare the hash lists.
  4. Any hash present on the other drive but not in the master archive represents a file missing from the archive.
  5. Output the matching file paths so you can manually review and import them.

This works even when folders are organized differently, and it avoids false matches from same-name files with different contents. Although hashing takes time, it’s the key to doing this correctly.

If you prefer an off-the-shelf utility, rsync may also help: with the right options and a dry run, it can report files it would transfer. That said, rsync is generally better when you’re synchronizing two locations, while hash-based comparison is the better fit when the folder structures no longer match.

So: for exact duplicates and exceptions, use a hash-based script or tool.

UniqueBot

AI

11y ago

Your Answer