I am using Awk 4.1.4 on Centos 7.6 (x86_64) with 250 GB RAM to transform a row-wide csv file into a column-wide csv based on the last column (Sample_Key). Here is an example small row-wide csv
Probe_Key,Ind_Beta,Sample_Key
1,0.6277,7417
2,0.9431,7417
3,0.9633,7417
4,0.8827,7417
5,0.9761,7417
6,0.1799,7417
7,0.9191,7417
8,0.8257,7417
9,0.9111,7417
1,0.6253,7387
2,0.9495,7387
3,0.5551,7387
4,0.8913,7387
5,0.6197,7387
6,0.7188,7387
7,0.8282,7387
8,0.9157,7387
9,0.9336,7387
This is what the correct output looks like for the above small csv example
7387,0.6253,0.9495,0.5551,0.8913,0.6197,0.7188,0.8282,0.9157,0.9336
7417,0.6277,0.9431,0.9633,0.8827,0.9761,0.1799,0.9191,0.8257,0.9111
Here is the awk code (based on https://unix.stackexchange.com/questions/522046/how-to-convert-a-3-column-csv-file-into-a-table-or-matrix) to achieve the row to column wide transformation
BEGIN{
printf "Probe_Key,ind_beta,Sample_Keyn";
}
NR > 1{
ks[$3 $1] = $2; # save the second column using the first and third as index
k1[$1]++; # save the first column
k2[$3]++; # save the third column
}
END {
# After processing input
for (i in k2) # loop over third column
{
printf "%s,", i ; # print it as first value in the row
for (j in k1) # loop over the first column (index)
{
if ( j < length(k1) )
{
printf "%s,",ks[i j]; #and print values ks[third_col first_col]
}
else
printf "%s",ks[i j]; #print last value
}
print ""; # newline
}
}
However, when I input a relatively large row-wide csv file (5 GB size), I get tons of values without any commas in the output and then values start to appear with commas and then values without commas. This keeps on going. Here is small excerpt from the portion without comma
0.04510.03580.81470.57690.8020.89630.90950.10880.66560.92240.05 060.78130.86910.07330.03080.0590.06440.80520.05410.91280.16010.19420.08960.0380.95010.7950.92760.9410.95710.2830.90790 .94530.69330.62260.90520.1070.95480.93220.01450.93390.92410.94810.87380.86920.9460.93480.87140.84660.33930.81880.94740 .71890.11840.05050.93760.94920.06190.89280.69670.03790.8930.84330.9330.9610.61760.04640.09120.15520.91850.76760.94840. 61340.02310.07530.93660.86150.79790.05090.95130.14380.06840.95690.04510.75220.03150.88550.82920.11520.11710.5710.94340 .50750.02590.97250.94760.91720.37340.93580.84730.81410.95510.93080.31450.06140.81670.04140.95020.73390.87250.93680.20240.05810.93660.80870.04480.8430.33120.88170.92670.92050.71290.01860.93260.02940.91820
and when I use the largest row-wide csv file (126 GB size), I get the following Error
ERROR (EXIT CODE 255) Unknow error code
How do I debug the two situations when the code works for small input size?
2
Answers
if your data is already grouped in fields 3 and sorted 1, you can just simply do
if not, pre-sorting is better idea instead of caching all the data in memory which will blow up for large input files.
Instead of trying to hold all 5GB’s (Or 126GB’s) worth of data in memory at once and printing out everything all together at the end, here’s an approach using
sort
and GNU datamash to group each set of values together as they come through its input:This assumes your file is already grouped with all the identical third column values together in blocks, and the first/second columns sorted in the appropriate order already, like your sample input. If that’s not the case, the slower:
If you can get rid of that header line so
sort
can be passed the file directly instead of in a pipe, it might be able to pick a more efficient sorting method knowing the full size in advance.