I have a column in my dataset with the following format:
hashtags
1 [#newyears, #christmas, #christmas]
2 [#easter, #newyears, #fourthofjuly]
3 [#valentines, #christmas, #easter]
I have managed to count the hashtags like so:
SELECT hashtags, (LENGTH(hashtags) - LENGTH(REPLACE(hashtags, ',', '')) + 1) AS hashtag_count
FROM full_data
ORDER BY hashtag_count DESC NULLS LAST
But I’m not sure if it’s possible to count the occurrences of each hashtag. Is it possible to return the count of the most popular hashtags in the following format:
hashtags count
christmas 3
newyears 2
The datatype is just varchar, but I’m a bit confused on how I should approach this. Any help would be appreciated!
3
Answers
I think you should split all the data in Array to record and then count it with Group by. Something like this query
Hopefully, it will match your request!
That’s a bad idea to store this data. It’s risky because we don’t know whether the text will always be stored in exactly this form. Better save the different strings in separate columns.
Anyway, if you can’t improve that and must deal with this structure, we could basically use a combination of
UNNEST
,STRING_TO_ARRAY
andGROUP BY
to split the hashtags and count them.So the general idea is something like this:
Due to the braces and spaces within your column, this will not produce the correct result.
So we could additionaly use
TRIM
andTRANSLATE
to get rid of all other things except the hashtags.With your sample data, following construct will produce the intended outcome:
See here
But as already said, this is unpleasant and risky.
So if possible, find out which hashtags are possible (it seems these are all special days) and then create columns or a mapping table for them.
This said, store 0 or 1 in the column to indicate whether the hashtag appears or not and then sum the values per column.
You can do it as follows :
Result :
REGEXP_REPLACE
to remove any special characters.string_to_array
to generate an arrayunnest
to expand an array to a set of rowsDemo here