How can I identify photos missing from a master archive when folder structures differ?
Asked 1/7/2015
3 views
2 answers
0
I’m consolidating about 15 years of photos from multiple hard drives into a single master archive on Windows. The files are spread across different folder structures, and there are many duplicates. Most duplicates should be exact, bit-identical copies rather than resized or edited variants.
What I need is a way to scan another drive and list the photo files that are not already present in the master archive, even if the folders are organized differently. I don’t want software that enforces its own library structure, and I’m not looking for visual duplicate detection—just a reliable way to find exact files that are missing from the archive so I can manually import and organize them.
Is there a script or tool that can do this?
Originally by Photography Stack Exchange contributor. Source · Licensed CC BY-SA 4.0
Photography Stack Exchange contributor
11y ago
2 Answers
3
Hashes are actually the key to doing this and getting it right.
If you're up to getting your hands a little dirty, this would be an easy effort as a shell script:
- Generate a list of hashes of every file in your master archive. Call this list "
MA("master archive"). - Generate a two-field list of hashes and paths to every file in your other archives. Cal this list
OA("other archives"). - Extract the list hashes in
OA. Call this listOH("other hashes"). - Pull out a list of every hash in
OHthat isn't inOA. (Off the top of my head,fgrep -xv -f MA OH). Call thisML("missing list"). - Pull out a list of every file in
OAthat matches a line in ML. (fgrep -f ML OA | awk -f '{ print $2 }',).
The end result of the last command will be a list of the files in your other archives that don't match one in the master.
For Windows, there's a package called Cygwin that includes all of the utilities you'd need to do this (bash, find, sha1sum, grep and awk).
Originally by user6508. Source · Licensed CC BY-SA 4.0
user6508
11y ago
0
Generated from our catalog & community — verify before relying on it.
The most reliable approach is to compare files by hash, not by folder structure or filename.
A practical workflow is:
- Generate a list of hashes for every file in the master archive.
- Generate a list of hash + file path for files on the other drive(s).
- Compare the hash lists.
- Any hash present on the other drive but not in the master archive represents a file missing from the archive.
- Output the matching file paths so you can manually review and import them.
This works even when folders are organized differently, and it avoids false matches from same-name files with different contents. Although hashing takes time, it’s the key to doing this correctly.
If you prefer an off-the-shelf utility, rsync may also help: with the right options and a dry run, it can report files it would transfer. That said, rsync is generally better when you’re synchronizing two locations, while hash-based comparison is the better fit when the folder structures no longer match.
So: for exact duplicates and exceptions, use a hash-based script or tool.
Recommended products
UniqueBot
AI11y ago
Your Answer
Related Questions
How can I match unorganized RAW files to an already organized JPEG library?
How can I consolidate multiple Mac photo libraries and remove duplicates?
How can I find and clean up duplicate photos on Ubuntu in a large collection?
How can I relink a missing Lightroom folder after moving photos off an SD card?
How can I delete RAW files only when a matching JPEG exists?