skip to Main Content

i have a big number of CSV files which should be filtered and eventually combined, sorted and get duplicate rows removed.
so usually no big deal with sed and sort and i would achieve this with something like…

sed -E -e '/KEYWORD/I d' *.csv | sort -bf -ru -t ',' -k 5

but i had to notice that in one column there is JSON content, formatted as prettyprint with line breaks, which causes sed to choke. not to mention that the , within the JSON also makes sort impossible as the columns of the CSV are also separated by a ,.


fortunately i can tell, that – except the header – a CSV row starts with a 9 digit number (which represents the first column) and ends with an ISO formated date (YYYY-MM-DDTHH:MM:SSZ) in column 5, which should be used for sorting. the KEYWORD appears in column 3 (and additionally within the JSON part) and the JSON part appears in column 4.

unfortunately i have no idea if it is anyhow possible to tell sed and sort or anything else to ignore line breaks (and , within the JSON part) and use regex like definitions for determining CSV rows and columns.

it doesn’t have to be sed and/or sort, i’m grateful for every suggestion how this could be achieved!


each of the CSV files has slightly less than 2500 rows and are usually about 600 kB big and i get every day at least one additional file, hence the combining, sorting and de-duplication would be an ongoing process. they look something like this:

ID,"SHORT, DESCRIPTION",NAME,JSON,TIME
...
123456789,ABC:XYZ,NAME: KEYWORD,"{
    ""name"": ""keyword"",
    ""name1"": 0,
    ""name2"": true,
    ""name3"": [""value""],
    ""name4"": {
        ""name5"": ""keyword""
    }
}",2000-01-01T00:00:00Z
...
123456789,"ABC:XYZ, DEF",NAME: OTHERWORD,"{
    ""name"": ""otherword"",
    ""name1"": 0,
    ""name2"": false,
    ""name3"": [""value1"", ""value2"", ""value3""],
    ""name4"": {
        ""name5"": ""otherword""
    }
}",2000-01-01T01:00:00Z
...
123456789,"ABC:XYZ, GHI",NAME: OTHERWORDS,"{
    ""name"": ""otherwords"",
    ""name1"": 0,
    ""name2"": false,
    ""name3"": [""value11"", ""value12"", ""value13""],
    ""name4"": {
        ""name5"": ""otherwords""
    }
}",2000-01-01T02:00:00Z

3

Answers


  1. gawk 5.3.1 and newer can parse csv directly and can sort.

    So, your shown sed/sort code might become something like:

    gawk --csv '
        NR==1 { print }
        FNR==1 { next }
        !index(toupper($3),"KEYWORD") { next }
        { rows[$0] = $5 }
        END {
            PROCINFO["sorted_in"] = "@val_str_desc"
            for (row in rows) print row
        }
    ' *.csv
    

    I didn’t try to work out why you gave -b (ignore leading blanks) and -f (ignore case) options to sort.

    Login or Signup to reply.
  2. it doesn’t have to be sed and/or sort, i’m grateful for every
    suggestion how this could be achieved!

    I suggest trying csvkit which is able to cope with newlines in fields. Consider following example, let file.csv content be

    number,action,data
    uno,ignorethis,"{""A"":""Able"",
    ""B"":""Baker"",
    ""C"":""Charlie""}"
    dos,keepthis,"{""D"":""Dog"",
    ""E"":""Easy""}"
    tres,keepthistoo,"{""F"":""Fox""}"
    

    then

    csvsql --query 'SELECT * FROM file WHERE action NOT LIKE "%ignore%"' file.csv > filtered.csv
    

    creates filtered.csv with following content

    number,action,data
    dos,keepthis,"{""D"":""Dog"",
    ""E"":""Easy""}"
    tres,keepthistoo,"{""F"":""Fox""}"
    

    Explanation: I instruct csvsql to select row where action (2nd column) does not have substring ignore (case-insensitive). Keep in mind this is looking for keyword solely in 2nd column rather than your sed approach which is looking for keyword in all columns.

    After you manage to make filtering work as intended you might add sorting using ORDER BY for example to sort at 1st column in descending order one might change query above to

    SELECT * FROM file WHERE action NOT LIKE "%ignore%" ORDER BY 1 DESC
    

    As I am not sort-ninja I do know what

    sort -bf -ru -t ',' -k 5
    

    does, therefore I am unable to translate this into SQL. I can only suggest to summon sort-ninjas coven, make them express above sort call to natural language, then summon SQL-ninjas coven and make them convert what was prepared into SQL.

    (tested in csvsql 2.0.1)

    Login or Signup to reply.
  3. When you know SQL, you can use SQLite to import the CSV.

    After that sorting, filtering should be easy, even some JSON functions are available

    sqlite3 -csv %TEMP%temp.sqlite ".import so79183230.csv mytable"
    sqlite3 %TEMP%temp.sqlite
    SQLite version 3.47.0 2024-10-21 16:30:22
    Enter ".help" for usage hints.
    sqlite> .header on 
    sqlite> .mode table
    sqlite> SELECT * FROM mytable
    
    ID SHORT, DESCRIPTION NAME JSON TIME
    123456789 ABC:XYZ NAME: KEYWORD { 2000-01-01T00:00:00Z
    "name": "keyword",
    "name1": 0,
    "name2": true,
    "name3": ["value"],
    "name4": {
    "name5": "keyword"
    }
    }
    123456789 ABC:XYZ, DEF NAME: OTHERWORD { 2000-01-01T01:00:00Z
    "name": "otherword",
    "name1": 0,
    "name2": false,
    "name3": ["value1", "value2", "value3"],
    "name4": {
    "name5": "otherword"
    }
    }
    123456789 ABC:XYZ, GHI NAME: OTHERWORDS { 2000-01-01T02:00:00Z
    "name": "otherwords",
    "name1": 0,

    P.S. In above example so79183230.csv is the source CSV-file.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search