i have a big number of CSV files which should be filtered and eventually combined, sorted and get duplicate rows removed.
so usually no big deal with sed
and sort
and i would achieve this with something like…
sed -E -e '/KEYWORD/I d' *.csv | sort -bf -ru -t ',' -k 5
but i had to notice that in one column there is JSON content, formatted as prettyprint with line breaks, which causes sed
to choke. not to mention that the ,
within the JSON also makes sort
impossible as the columns of the CSV are also separated by a ,
.
fortunately i can tell, that – except the header – a CSV row starts with a 9 digit number (which represents the first column) and ends with an ISO formated date (YYYY-MM-DDTHH:MM:SSZ
) in column 5, which should be used for sorting. the KEYWORD
appears in column 3 (and additionally within the JSON part) and the JSON part appears in column 4.
unfortunately i have no idea if it is anyhow possible to tell sed
and sort
or anything else to ignore line breaks (and ,
within the JSON part) and use regex like definitions for determining CSV rows and columns.
it doesn’t have to be sed
and/or sort
, i’m grateful for every suggestion how this could be achieved!
each of the CSV files has slightly less than 2500 rows and are usually about 600 kB big and i get every day at least one additional file, hence the combining, sorting and de-duplication would be an ongoing process. they look something like this:
ID,"SHORT, DESCRIPTION",NAME,JSON,TIME
...
123456789,ABC:XYZ,NAME: KEYWORD,"{
""name"": ""keyword"",
""name1"": 0,
""name2"": true,
""name3"": [""value""],
""name4"": {
""name5"": ""keyword""
}
}",2000-01-01T00:00:00Z
...
123456789,"ABC:XYZ, DEF",NAME: OTHERWORD,"{
""name"": ""otherword"",
""name1"": 0,
""name2"": false,
""name3"": [""value1"", ""value2"", ""value3""],
""name4"": {
""name5"": ""otherword""
}
}",2000-01-01T01:00:00Z
...
123456789,"ABC:XYZ, GHI",NAME: OTHERWORDS,"{
""name"": ""otherwords"",
""name1"": 0,
""name2"": false,
""name3"": [""value11"", ""value12"", ""value13""],
""name4"": {
""name5"": ""otherwords""
}
}",2000-01-01T02:00:00Z
3
Answers
gawk 5.3.1 and newer can parse csv directly and can sort.
So, your shown sed/sort code might become something like:
I didn’t try to work out why you gave
-b
(ignore leading blanks) and-f
(ignore case) options to sort.I suggest trying
csvkit
which is able to cope with newlines in fields. Consider following example, letfile.csv
content bethen
creates
filtered.csv
with following contentExplanation: I instruct
csvsql
to select row where action (2nd column) does not have substringignore
(case-insensitive). Keep in mind this is looking for keyword solely in 2nd column rather than yoursed
approach which is looking for keyword in all columns.After you manage to make filtering work as intended you might add sorting using
ORDER BY
for example to sort at 1st column in descending order one might change query above toAs I am not
sort
-ninja I do know whatdoes, therefore I am unable to translate this into SQL. I can only suggest to summon sort-ninjas coven, make them express above
sort
call to natural language, then summon SQL-ninjas coven and make them convert what was prepared into SQL.(tested in csvsql 2.0.1)
When you know SQL, you can use SQLite to import the CSV.
After that sorting, filtering should be easy, even some JSON functions are available
P.S. In above example
so79183230.csv
is the source CSV-file.