skip to Main Content

I am using Awk 4.1.4 on Centos 7.6 (x86_64) with 250 GB RAM to transform a row-wide csv file into a column-wide csv based on the last column (Sample_Key). Here is an example small row-wide csv

Probe_Key,Ind_Beta,Sample_Key
1,0.6277,7417
2,0.9431,7417
3,0.9633,7417
4,0.8827,7417
5,0.9761,7417
6,0.1799,7417
7,0.9191,7417
8,0.8257,7417
9,0.9111,7417
1,0.6253,7387
2,0.9495,7387
3,0.5551,7387
4,0.8913,7387
5,0.6197,7387
6,0.7188,7387
7,0.8282,7387
8,0.9157,7387
9,0.9336,7387

This is what the correct output looks like for the above small csv example

7387,0.6253,0.9495,0.5551,0.8913,0.6197,0.7188,0.8282,0.9157,0.9336
7417,0.6277,0.9431,0.9633,0.8827,0.9761,0.1799,0.9191,0.8257,0.9111

Here is the awk code (based on https://unix.stackexchange.com/questions/522046/how-to-convert-a-3-column-csv-file-into-a-table-or-matrix) to achieve the row to column wide transformation

BEGIN{
printf "Probe_Key,ind_beta,Sample_Keyn";
}

NR > 1{
    ks[$3 $1] = $2; # save the second column using the first and third as index
    k1[$1]++;       # save the first column
    k2[$3]++;       # save the third column
}

END {
                              # After processing input
   for (i in k2)              # loop over third column
      {
        printf "%s,", i ;            # print it as first value in the row
        for (j in k1)                # loop over the first column (index)
        {   
            if ( j < length(k1) )
            {
                printf "%s,",ks[i j];  #and print values ks[third_col first_col]
            }
            else
                printf "%s",ks[i j]; #print last value
        }
        print "";                    # newline
    }
}

However, when I input a relatively large row-wide csv file (5 GB size), I get tons of values without any commas in the output and then values start to appear with commas and then values without commas. This keeps on going. Here is small excerpt from the portion without comma

0.04510.03580.81470.57690.8020.89630.90950.10880.66560.92240.05  060.78130.86910.07330.03080.0590.06440.80520.05410.91280.16010.19420.08960.0380.95010.7950.92760.9410.95710.2830.90790     .94530.69330.62260.90520.1070.95480.93220.01450.93390.92410.94810.87380.86920.9460.93480.87140.84660.33930.81880.94740     .71890.11840.05050.93760.94920.06190.89280.69670.03790.8930.84330.9330.9610.61760.04640.09120.15520.91850.76760.94840.     61340.02310.07530.93660.86150.79790.05090.95130.14380.06840.95690.04510.75220.03150.88550.82920.11520.11710.5710.94340    .50750.02590.97250.94760.91720.37340.93580.84730.81410.95510.93080.31450.06140.81670.04140.95020.73390.87250.93680.20240.05810.93660.80870.04480.8430.33120.88170.92670.92050.71290.01860.93260.02940.91820

and when I use the largest row-wide csv file (126 GB size), I get the following Error

ERROR (EXIT CODE 255) Unknow error code

How do I debug the two situations when the code works for small input size?

2

Answers


  1. if your data is already grouped in fields 3 and sorted 1, you can just simply do

    $ awk -F, 'NR==1 {next} 
                     {if(p!=$3) 
                        {if(p) print v; v=$3 FS $2; p=$3} 
                      else v=v FS $2} 
               END{print v}' file
    
    7417,0.6277,0.9431,0.9633,0.8827,0.9761,0.1799,0.9191,0.8257,0.9111
    7387,0.6253,0.9495,0.5551,0.8913,0.6197,0.7188,0.8282,0.9157,0.9336
    

    if not, pre-sorting is better idea instead of caching all the data in memory which will blow up for large input files.

    Login or Signup to reply.
  2. Instead of trying to hold all 5GB’s (Or 126GB’s) worth of data in memory at once and printing out everything all together at the end, here’s an approach using sort and GNU datamash to group each set of values together as they come through its input:

    $ datamash --header-in -t, -g3 collapse 2 < input.csv | sort -t, -k1,1n
    7387,0.6253,0.9495,0.5551,0.8913,0.6197,0.7188,0.8282,0.9157,0.9336
    7417,0.6277,0.9431,0.9633,0.8827,0.9761,0.1799,0.9191,0.8257,0.9111
    

    This assumes your file is already grouped with all the identical third column values together in blocks, and the first/second columns sorted in the appropriate order already, like your sample input. If that’s not the case, the slower:

    $ tail -n +2 input.csv | sort -t, -k3,3n -k1,1n | datamash -t, -g3 collapse 2
    7387,0.6253,0.9495,0.5551,0.8913,0.6197,0.7188,0.8282,0.9157,0.9336
    7417,0.6277,0.9431,0.9633,0.8827,0.9761,0.1799,0.9191,0.8257,0.9111
    

    If you can get rid of that header line so sort can be passed the file directly instead of in a pipe, it might be able to pick a more efficient sorting method knowing the full size in advance.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search