find and remove duplicates in a directory

Written by:

I have a directory with multiple img files and some of them are identical but they all have different names. I need to remove duplicates but with no external tools only with a bash script. I’m a beginner in Linux. I tried nested for loop to compare md5 sums and depending on the result remove but something is wrong with the syntax and it doesn’t work. any help?

what I’ve tried is…

for i in directory_path; do
    sum1='find $i -type f -iname "*.jpg" -exec md5sum '{}' ;'
    for j in directory_path; do
        sum2='find $j -type f -iname "*.jpg" -exec md5sum '{}' ;'
        if test $sum1=$sum2 ; then rm $j ; fi
    done
done

I get: test: too many arguments

Anthon

There are quite a few problems in your script.

  • First, in order to assign the result of a command to a variable you need to enclose it either in backtics (`command`) or, preferably, $(command). You have it in single quotes ('command') which instead of assigning the result of your command to your variable, assigns the command itself as a string. Therefore, your test is actually:
    $ echo "test $sum1=$sum2"
    test find $i -type f -iname "*.jpg" -exec md5sum {} ;=find $j -type f -iname "*.jpg" -exec md5sum {} ;
    
  • The next issue is that the command md5sum returns more than just the hash:
    $ md5sum /etc/fstab
    46f065563c9e88143fa6fb4d3e42a252  /etc/fstab
    

    You only want to compare the first field, so you should parse the md5sum output by passing it through a command that only prints the first field:

    find $i -type f -iname "*.png" -exec md5sum '{}' ; | cut -f 1 -d ' '
    

    or

    find $i -type f -iname "*.png" -exec md5sum '{}' ; | awk '{print $1}' 
    
  • Also, the find command will return many matches, not just one and each of those matches will be duplicated by the second find. This means that at some point you will be comparing the same file to itself, the md5sum will be identical and you will end up deleting all your files (I ran this on a test dir containing a.jpg and b.jpg):
    for i in $(find . -iname "*.jpg"); do
      for j in $(find . -iname "*.jpg"); do
         echo "i is: $i and j is: $j"
      done
    done   
    i is: ./a.jpg and j is: ./a.jpg   ## BAD, will delete a.jpg
    i is: ./a.jpg and j is: ./b.jpg
    i is: ./b.jpg and j is: ./a.jpg
    i is: ./b.jpg and j is: ./b.jpg   ## BAD will delete b.jpg
    
    
  • You don’t want to run for i in directory_path unless you are passing an array of directories. If all these files are in the same directory, you want to run for i in $(find directory_path -iname "*.jpg") to go through all the files.
  • It is a bad idea to use for loops with the output of find. You should use while loops or globbing:
    find . -iname "*.jpg" | while read i; do [...] ; done
    

    or, if all your files re in the same directory:

    for i in *jpg; do [...]; done
    

    Depending on your shell and the options you have set, you can use globbing even for files in subdirectories but let’s not get into that here.

  • Finally, you should also quote your variables else directory paths with spaces will break your script.

File names can contain spaces, new lines, backslashes and other weird characters, to deal with those correctly in a while loop you’ll need to add some more options. What you want to write is something like:

find dir_path -type f -iname "*.jpg" -print0 | while IFS= read -r -d '' i; do
  find dir_path -type f -iname "*.jpg" -print0 | while IFS= read -r -d '' j; do
    if [ "$i" != "$j" ]
    then
      sum1=$(md5sum "$i" | cut -f 1 -d ' ' )
      sum2=$(md5sum "$j" | cut -f 1 -d ' ' )
      [ "$sum1" = "$sum2" ] && rm "$j"
    fi
  done
done

An even simpler way would be:

find directory_path -name "*.jpg" -exec md5sum '{}' + | 
 perl -ane '$k{$F[0]}++; system("rm $F[1]") if $k{$F[0]}>1'

A better version that can deal with spaces in file names:

find directory_path -name "*.jpg" -exec md5sum '{}' + | 
 perl -ane '$k{$F[0]}++; system("rm "@F[1 .. $#F]"") if $k{$F[0]}>1'

This little Perl script will run through the results of the find command (i.e. the md5sum and file name). The -a option for perl splits input lines at whitespace and saves them in the F array, so $F[0] will be the md5sum and $F[1] the file name. The md5sum is saved in the hash k and the script checks if the hash has already been seen (if $k{$F[0]}>1) and deletes the file if it has (system("rm $F[1]")).


While that will work, it will be very slow for large image collections and you cannot choose which files to keep. There are many programs that handle this in a more elegant way including:

There is a nifty program called fdupes that simplifies the whole process and prompts the user for deleting duplicates. I think it is worth checking:

$ fdupes --delete DIRECTORY_WITH_DUPLICATES
[1] DIRECTORY_WITH_DUPLICATES/package-0.1-linux.tar.gz        
[2] DIRECTORY_WITH_DUPLICATES/package-0.1-linux.tar.gz.1

Set 1 of 1, preserve files [1 - 2, all]: 1

   [+] DIRECTORY_WITH_DUPLICATES/package-0.1-linux.tar.gz
   [-] DIRECTORY_WITH_DUPLICATES/package-0.1-linux.tar.gz.1

Basically, it prompted me for which file to keep, I typed 1, and it removed the second.

Other interesting options are:

-r --recurse
    for every directory given follow subdirectories encountered within

-N --noprompt
    when used together with --delete, preserve the first file in each set of duplicates and delete the others without prompting the user

From your example, you probably want to run it as:

fdupes --recurse --delete --noprompt DIRECTORY_WITH_DUPLICATES

See man fdupes for all options available.

find and remove duplicates in a directory
0 votes, 0.00 avg. rating (0% score)

Leave a Reply