skip to Main Content

I’m producing a text file (in zsh on MacOS) containing pathnames and their associated checksums.

# finding all the files in a directory and checksumming them
find . -type f -exec md5 -r {} ; > file1.txt

# sorting the file by the first field (checksum)
LC_ALL=C sort -k 1,1 file1.txt > file2.txt

# using awk to keep all/only lines with duplicated first/checksum fields
# (i.e., duplicate files in the directory)
# I found this awk on the net and it works
# yes, the input file is read twice
awk 'FNR==NR{a[$1]++;next}(a[$1] > 1)' file2.txt file2.txt > file3.txt

You can produce a sample file by executing the three commands above on the directory of your choice. Here’s a short sample:

0c1fe4bd35f263f1eb3944c3bd6036e7 ./photoshop-conversion/pano-work-02.psb
0c1fe4bd35f263f1eb3944c3bd6036e7 ./photoshop-conversion1/pano-work-02.psb
0d47004b36229ed68a7c1d820bc7bfa3 ./photoshop-conversion3/pano-03.psb
0d47004b36229ed68a7c1d820bc7bfa3 ./photoshop-conversion4/pano-03.psb
0d47004b36229ed68a7c1d820bc7bfa3 ./photoshop-conversion5/pano-03.psb
0d47004b36229ed68a7c1d820bc7bfa3 ./photoshop-conversion6/pano-03.psb
101e5579acc8389796d0155461ef5183 ./photoshop-conversion5/pano-01.psb
101e5579acc8389796d0155461ef5183 ./photoshop-conversion6/pano-01.psb

At this point, file3.txt lists all the checksum & pathnames (that have duplicated checksums), but there is no white-space (blank lines). I want to add blank lines between the groupings of 2 or more lines with duplicate first fields (in order to make the listing human-readable). This can be done either by another discreet stage (producing file4.txt from file3.txt) or by modifying the prior awk stage to insert new-lines between lines that have different first fields (as file3.txt is produced).

This would do something like:

if (first-field-of-current-line ^= first-field-of-next-line)
then insert new-line after end-of-current-line

This would result (in the sample above) inserting a blank line between the 2nd and 3rd lines and between the 6th and 7th lines.

I don’t care how it’s done — awk, sed, grep — so long as it’s available for zsh in MacOS.

Extra points if you can count how many groups there are (i.e., how many new-lines get inserted).

I’ve tried to change the awk line herein, but I don’t understand it well enough not to break it.

2

Answers


  1. One option is to do almost everything in zsh. This uses a fairly conventional shell loop, which can sometimes be easier to understand:

    #!/usr/bin/env zsh
    
    # create array with md5 values for all regular files
    #   - **/*: recursive file glob
    #   - (.): limit to regular files
    #   - (f): split by newlines
    #   - (o): sort the lines
    typeset -a md5sums=(${(fo)"$(md5 -r **/*(.))"})
    
    count=0
    inGroup=false
    for entry in $md5sums ''; do
        # split line using equals expansion, get sum
        sum=${${=entry}[1]}
        if [[ $sum == $priorSum ]]; then
            inGroup=true
            print $priorEntry
        elif $inGroup; then
            inGroup=false
            print $priorEntry
            print
            ((++count))
        fi
        priorEntry=$entry
        priorSum=$sum
    done
    print "Count: $count"
    

    This script should do everything in the process you outlined, from generating the md5 values to creating the final output. It uses some zsh-isms that may not be familiar; most are documented in the zshexpn man page and in the online guide.

    Login or Signup to reply.
  2. The temporary files seem ugly and unnecessary.

    find . -type f -exec md5 -r {} + |
    LC_ALL=C sort -k 1,1 |
    awk 'a[$1]++ { if(prev) print rs prev;
        prev=""; print; rs="n";
        if(a[$1] == 2) n++; next }
      { prev=$0 }
      END { print "=== total", n }' >file.txt
    

    Because the input is already sorted, you only need one pass through the Awk script. We keep track of the previous line in prev and if we see another line with the same prefix, we print that too.

    In some more detail, a keeps track of whether we have seen this MD5 before. If we have, print prev and then make sure we don’t print it again if we see additional instances of the same MD5. The variable rs will be empty originally, but contain a newline after we have printed the first group (so we avoid printing an empty line before the first group, but print one before each subsequent group). Finally, we increment n if this was a new group.

    If we fall through to the next unconditional action, that means we are looking at a new unique MD5. Keep track of this line in prev in case it would turn out to be a duplicate when the next line is read in.

    Finally, in the END block, print how many groups we counted.

    As a minor additional efficiency hack, I changed the ; to + which causes find -exec to behave like xargs.

    The addition of the total at the end is slightly ugly, too; perhaps instead simply grep -cxF '' file.txt and add 1.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search