I’m producing a text file (in zsh on MacOS) containing pathnames and their associated checksums.
# finding all the files in a directory and checksumming them
find . -type f -exec md5 -r {} ; > file1.txt
# sorting the file by the first field (checksum)
LC_ALL=C sort -k 1,1 file1.txt > file2.txt
# using awk to keep all/only lines with duplicated first/checksum fields
# (i.e., duplicate files in the directory)
# I found this awk on the net and it works
# yes, the input file is read twice
awk 'FNR==NR{a[$1]++;next}(a[$1] > 1)' file2.txt file2.txt > file3.txt
You can produce a sample file by executing the three commands above on the directory of your choice. Here’s a short sample:
0c1fe4bd35f263f1eb3944c3bd6036e7 ./photoshop-conversion/pano-work-02.psb
0c1fe4bd35f263f1eb3944c3bd6036e7 ./photoshop-conversion1/pano-work-02.psb
0d47004b36229ed68a7c1d820bc7bfa3 ./photoshop-conversion3/pano-03.psb
0d47004b36229ed68a7c1d820bc7bfa3 ./photoshop-conversion4/pano-03.psb
0d47004b36229ed68a7c1d820bc7bfa3 ./photoshop-conversion5/pano-03.psb
0d47004b36229ed68a7c1d820bc7bfa3 ./photoshop-conversion6/pano-03.psb
101e5579acc8389796d0155461ef5183 ./photoshop-conversion5/pano-01.psb
101e5579acc8389796d0155461ef5183 ./photoshop-conversion6/pano-01.psb
At this point, file3.txt lists all the checksum & pathnames (that have duplicated checksums), but there is no white-space (blank lines). I want to add blank lines between the groupings of 2 or more lines with duplicate first fields (in order to make the listing human-readable). This can be done either by another discreet stage (producing file4.txt from file3.txt) or by modifying the prior awk stage to insert new-lines between lines that have different first fields (as file3.txt is produced).
This would do something like:
if (first-field-of-current-line ^= first-field-of-next-line)
then insert new-line after end-of-current-line
This would result (in the sample above) inserting a blank line between the 2nd and 3rd lines and between the 6th and 7th lines.
I don’t care how it’s done — awk, sed, grep — so long as it’s available for zsh in MacOS.
Extra points if you can count how many groups there are (i.e., how many new-lines get inserted).
I’ve tried to change the awk line herein, but I don’t understand it well enough not to break it.
2
Answers
One option is to do almost everything in
zsh
. This uses a fairly conventional shell loop, which can sometimes be easier to understand:This script should do everything in the process you outlined, from generating the
md5
values to creating the final output. It uses somezsh
-isms that may not be familiar; most are documented in thezshexpn
man page and in the online guide.The temporary files seem ugly and unnecessary.
Because the input is already sorted, you only need one pass through the Awk script. We keep track of the previous line in
prev
and if we see another line with the same prefix, we print that too.In some more detail,
a
keeps track of whether we have seen this MD5 before. If we have, printprev
and then make sure we don’t print it again if we see additional instances of the same MD5. The variablers
will be empty originally, but contain a newline after we have printed the first group (so we avoid printing an empty line before the first group, but print one before each subsequent group). Finally, we incrementn
if this was a new group.If we fall through to the next unconditional action, that means we are looking at a new unique MD5. Keep track of this line in
prev
in case it would turn out to be a duplicate when the next line is read in.Finally, in the
END
block, print how many groups we counted.As a minor additional efficiency hack, I changed the
;
to+
which causesfind -exec
to behave likexargs
.The addition of the total at the end is slightly ugly, too; perhaps instead simply
grep -cxF '' file.txt
and add 1.