I have 4466 .tsv files with this structure:
file_structure
I want to compare the 4466 files to see how many IDs (the first column) matches.
I only found bash commands with two files with "comm". Could you tell me how I could do that?
Thank you
I have 4466 .tsv files with this structure:
file_structure
I want to compare the 4466 files to see how many IDs (the first column) matches.
I only found bash commands with two files with "comm". Could you tell me how I could do that?
Thank you
2
Answers
The question sounds quite vague. So, assuming that you want to extract IDs that all 4466 files have in common, i.e. IDs such that each of them occurs at least once in all of the
*.tsv
files, you can do this (e.g.) in pure Bash using associative arrays and calculating “set intersections” on them.I read your question as:
If that’s true, we want the intersection of all the sets of column IDs from all files. We can use the join command to get the intersection of any two files, and we can use the algebraic properites of an intersection to effectively join all files.
Consider the intersection of ID for these three files:
"3" is the only ID shared between all three. We can only join two files together at a time, so we need some way to effectively get,
join (join file1.tsv file2.tsv) file3.tsv
. Fortunately for us intersections are idempotent and associative, so we can apply join iteratively in a loop over all the files, like so:When I run that it prints the following:
Technically, join considers the string "ID" to be one of the things to evaluate… it doesn’t know what a header line is, or an what an ID is… it just knows to look in some number of fields for common values. In that example we didn’t specify a field so it defaulted to the first field, and it always found "ID" and it always found "3".
For your files, we need to tell join to:
Here’s my full implementation:
For an explanation of why
"$(printf 't')"
, check out the following for POSIX compliance: