how to filter and sort CSV files with JSON prettyprint in a column?

DJCrashdummy
November 13, 2024
141 views
3 votes
3 Answers

i have a big number of CSV files which should be filtered and eventually combined, sorted and get duplicate rows removed.
so usually no big deal with sed and sort and i would achieve this with something like…

sed -E -e '/KEYWORD/I d' *.csv | sort -bf -ru -t ',' -k 5

but i had to notice that in one column there is JSON content, formatted as prettyprint with line breaks, which causes sed to choke. not to mention that the , within the JSON also makes sort impossible as the columns of the CSV are also separated by a ,.

fortunately i can tell, that – except the header – a CSV row starts with a 9 digit number (which represents the first column) and ends with an ISO formated date (YYYY-MM-DDTHH:MM:SSZ) in column 5, which should be used for sorting. the KEYWORD appears in column 3 (and additionally within the JSON part) and the JSON part appears in column 4.

unfortunately i have no idea if it is anyhow possible to tell sed and sort or anything else to ignore line breaks (and , within the JSON part) and use regex like definitions for determining CSV rows and columns.

it doesn’t have to be sed and/or sort, i’m grateful for every suggestion how this could be achieved!

each of the CSV files has slightly less than 2500 rows and are usually about 600 kB big and i get every day at least one additional file, hence the combining, sorting and de-duplication would be an ongoing process. they look something like this:

ID,"SHORT, DESCRIPTION",NAME,JSON,TIME
...
123456789,ABC:XYZ,NAME: KEYWORD,"{
    ""name"": ""keyword"",
    ""name1"": 0,
    ""name2"": true,
    ""name3"": [""value""],
    ""name4"": {
        ""name5"": ""keyword""
    }
}",2000-01-01T00:00:00Z
...
123456789,"ABC:XYZ, DEF",NAME: OTHERWORD,"{
    ""name"": ""otherword"",
    ""name1"": 0,
    ""name2"": false,
    ""name3"": [""value1"", ""value2"", ""value3""],
    ""name4"": {
        ""name5"": ""otherword""
    }
}",2000-01-01T01:00:00Z
...
123456789,"ABC:XYZ, GHI",NAME: OTHERWORDS,"{
    ""name"": ""otherwords"",
    ""name1"": 0,
    ""name2"": false,
    ""name3"": [""value11"", ""value12"", ""value13""],
    ""name4"": {
        ""name5"": ""otherwords""
    }
}",2000-01-01T02:00:00Z

Answers

- jhnc
- November 13, 2024 at 6:02 am
- 0 votes
0
gawk 5.3.1 and newer can parse csv directly and can sort.

So, your shown sed/sort code might become something like:
```
gawk --csv '
    NR==1 { print }
    FNR==1 { next }
    !index(toupper($3),"KEYWORD") { next }
    { rows[$0] = $5 }
    END {
        PROCINFO["sorted_in"] = "@val_str_desc"
        for (row in rows) print row
    }
' *.csv
```
I didn’t try to work out why you gave -b (ignore leading blanks) and -f (ignore case) options to sort.
Login or Signup to reply.

- Daweo
- November 13, 2024 at 4:41 pm
- 0 votes
0
it doesn’t have to be sed and/or sort, i’m grateful for every
suggestion how this could be achieved!

I suggest trying csvkit which is able to cope with newlines in fields. Consider following example, let file.csv content be
```
number,action,data
uno,ignorethis,"{""A"":""Able"",
""B"":""Baker"",
""C"":""Charlie""}"
dos,keepthis,"{""D"":""Dog"",
""E"":""Easy""}"
tres,keepthistoo,"{""F"":""Fox""}"
```
then
```
csvsql --query 'SELECT * FROM file WHERE action NOT LIKE "%ignore%"' file.csv > filtered.csv
```
creates filtered.csv with following content
```
number,action,data
dos,keepthis,"{""D"":""Dog"",
""E"":""Easy""}"
tres,keepthistoo,"{""F"":""Fox""}"
```
Explanation: I instruct csvsql to select row where action (2nd column) does not have substring ignore (case-insensitive). Keep in mind this is looking for keyword solely in 2nd column rather than your sed approach which is looking for keyword in all columns.

After you manage to make filtering work as intended you might add sorting using ORDER BY for example to sort at 1st column in descending order one might change query above to
```
SELECT * FROM file WHERE action NOT LIKE "%ignore%" ORDER BY 1 DESC
```
As I am not sort-ninja I do know what
```
sort -bf -ru -t ',' -k 5
```
does, therefore I am unable to translate this into SQL. I can only suggest to summon sort-ninjas coven, make them express above sort call to natural language, then summon SQL-ninjas coven and make them convert what was prepared into SQL.

(tested in csvsql 2.0.1)
Login or Signup to reply.

When you know SQL, you can use SQLite to import the CSV.

After that sorting, filtering should be easy, even some JSON functions are available

sqlite3 -csv %TEMP%temp.sqlite ".import so79183230.csv mytable"
sqlite3 %TEMP%temp.sqlite
SQLite version 3.47.0 2024-10-21 16:30:22
Enter ".help" for usage hints.
sqlite> .header on 
sqlite> .mode table
sqlite> SELECT * FROM mytable

ID	SHORT, DESCRIPTION	NAME	JSON	TIME
123456789	ABC:XYZ	NAME: KEYWORD	{	2000-01-01T00:00:00Z
			"name": "keyword",
			"name1": 0,
			"name2": true,
			"name3": ["value"],
			"name4": {
			"name5": "keyword"
			}
			}
123456789	ABC:XYZ, DEF	NAME: OTHERWORD	{	2000-01-01T01:00:00Z
			"name": "otherword",
			"name1": 0,
			"name2": false,
			"name3": ["value1", "value2", "value3"],
			"name4": {
			"name5": "otherword"
			}
			}
123456789	ABC:XYZ, GHI	NAME: OTHERWORDS	{	2000-01-01T02:00:00Z
			"name": "otherwords",
			"name1": 0,

P.S. In above example so79183230.csv is the source CSV-file.

Please signup or login to give your own answer.

Click here to cancel reply.