skip to Main Content

I have 4466 .tsv files with this structure:
file_structure

I want to compare the 4466 files to see how many IDs (the first column) matches.
I only found bash commands with two files with "comm". Could you tell me how I could do that?

Thank you

2

Answers


  1. The question sounds quite vague. So, assuming that you want to extract IDs that all 4466 files have in common, i.e. IDs such that each of them occurs at least once in all of the *.tsv files, you can do this (e.g.) in pure Bash using associative arrays and calculating “set intersections” on them.

    #!/bin/bash
    
    # removes all IDs from array $1 that do not occur in array $2.
    intersect_ids() {
      local -n acc="$1"
      local -rn operand="$2"
      local id
      for id in "${!acc[@]}"; do
        ((operand["$id"])) || unset "acc['${id}']"
      done
    }
    
    # prints IDs that occur in all files called *.tsv in directory $1.
    get_ids_intersection() (
      shopt -s nullglob
      local -ar files=("${1}/"*.tsv)
      local -Ai common_ids next_ids
      local file id _
      if ((${#files[@]})); then
        while read -r id _; do ((++common_ids["$id"])); done < "${files[0]}"
        for file in "${files[@]:1}"; do
          while read -r id _; do ((++next_ids["$id"])); done < "$file"
          intersect_ids common_ids next_ids
          next_ids=()
        done
      fi
      for id in "${!common_ids[@]}"; do printf '%sn' "$id"; done
    )
    
    get_ids_intersection /directory/where/tsv/files/are
    
    Login or Signup to reply.
  2. I read your question as:

    Amongst all TSV files, which column IDs are found in every file?

    If that’s true, we want the intersection of all the sets of column IDs from all files. We can use the join command to get the intersection of any two files, and we can use the algebraic properites of an intersection to effectively join all files.

    Consider the intersection of ID for these three files:

    file1.tsv   file2.tsv   file3.tsv
    ---------   ---------   ---------
    ID          ID          ID       
    1           1           2        
    2           3           3        
    3                                
    

    "3" is the only ID shared between all three. We can only join two files together at a time, so we need some way to effectively get, join (join file1.tsv file2.tsv) file3.tsv. Fortunately for us intersections are idempotent and associative, so we can apply join iteratively in a loop over all the files, like so:

    # "Prime" the common file
    cp file1.tsv common.tsv
    
    for TSV in file*.tsv; do
        join "$TSV" common.tsv > myTmp
        mv myTmp common.tsv
    
        echo "After joining $TSV, common IDs are:"
        cat common.tsv
    done
    

    When I run that it prints the following:

    After joining file1.tsv, common IDs are:
    ID
    1
    2
    3
    After joining file2.tsv, common IDs are:
    ID
    1
    3
    After joining file3.tsv, common IDs are:
    ID
    3
    
    1. The first iteration joins file1 with itself (because we primed common with file1); this is where we intersection to be idempotent
    2. The second iteration joins in file2, cutting out ID "2"
    3. The third iteration joins in file3, cutting ID down to just "3"

    Technically, join considers the string "ID" to be one of the things to evaluate… it doesn’t know what a header line is, or an what an ID is… it just knows to look in some number of fields for common values. In that example we didn’t specify a field so it defaulted to the first field, and it always found "ID" and it always found "3".

    For your files, we need to tell join to:

    • separate on a tab character, with -t <TAB-CHAR>
    • only output the join field (which, by default, is the first field), with -o 0

    Here’s my full implementation:

    #!/bin/sh
    TAB="$(printf 't')"
    
    # myJoin joins tsvX with the previously-joined common on 
    # the first field of both files; saving the the first field
    # of the joined output back into common
    myJoin() {
        tsvX="$1"
    
        join -t "$TAB" -o 0 common.tsv "$tsvX" > myTmp.tsv
        mv myTmp.tsv common.tsv
    }
    
    # "Prime" common
    cp input1.tsv common.tsv
    
    for TSV in input*.tsv; do
        myJoin "$TSV"
    done
    
    echo "The common IDs are:"
    tail -n -1 common.tsv
    

    For an explanation of why "$(printf 't')", check out the following for POSIX compliance:

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search