skip to Main Content

I have a column in my dataset with the following format:

hashtags
1 [#newyears, #christmas, #christmas]
2 [#easter, #newyears, #fourthofjuly]
3 [#valentines, #christmas, #easter]

I have managed to count the hashtags like so:

SELECT hashtags, (LENGTH(hashtags) - LENGTH(REPLACE(hashtags, ',', '')) + 1) AS hashtag_count
FROM full_data
ORDER BY hashtag_count DESC NULLS LAST

But I’m not sure if it’s possible to count the occurrences of each hashtag. Is it possible to return the count of the most popular hashtags in the following format:

hashtags     count
christmas     3
newyears      2

The datatype is just varchar, but I’m a bit confused on how I should approach this. Any help would be appreciated!

3

Answers


  1. I think you should split all the data in Array to record and then count it with Group by. Something like this query

    SELECT hashtag, count(*) as hashtag_count
    FROM full_data, unnest(hashtags) s(hashtag)
    GROUP BY hashtag
    ORDER BY hashtag_count DESC
    

    Hopefully, it will match your request!

    Login or Signup to reply.
  2. That’s a bad idea to store this data. It’s risky because we don’t know whether the text will always be stored in exactly this form. Better save the different strings in separate columns.

    Anyway, if you can’t improve that and must deal with this structure, we could basically use a combination of UNNEST, STRING_TO_ARRAY and GROUP BY to split the hashtags and count them.

    So the general idea is something like this:

    WITH unnested AS
    (SELECT
    UNNEST(STRING_TO_ARRAY(hashtags, ',')) AS hashtag
    FROM full_data)
    SELECT hashtag, COUNT(hashtag) 
    FROM unnested
    GROUP BY hashtag
    ORDER BY COUNT(hashtag) DESC;
    

    Due to the braces and spaces within your column, this will not produce the correct result.

    So we could additionaly use TRIM and TRANSLATE to get rid of all other things except the hashtags.

    With your sample data, following construct will produce the intended outcome:

    WITH unnested AS
    (SELECT
    TRIM(TRANSLATE(UNNEST(STRING_TO_ARRAY(hashtags, ',')),'#,[,]','')) AS hashtag
    FROM full_data)
    SELECT hashtag, COUNT(hashtag) 
    FROM unnested
    GROUP BY hashtag
    ORDER BY COUNT(hashtag) DESC;
    

    See here

    But as already said, this is unpleasant and risky.

    So if possible, find out which hashtags are possible (it seems these are all special days) and then create columns or a mapping table for them.

    This said, store 0 or 1 in the column to indicate whether the hashtag appears or not and then sum the values per column.

    Login or Signup to reply.
  3. You can do it as follows :

    select unnest(string_to_array(REGEXP_REPLACE(hashtags,'[^w,]+','','g'), ',')) as tags, count(1)
    from full_data
    group by tags
    order by count(1) desc
    

    Result :

    tags         count
    christmas      3
    newyears       2
    easter         2
    fourthofjuly   1
    valentines     1
    

    REGEXP_REPLACE to remove any special characters.

    string_to_array to generate an array

    unnest to expand an array to a set of rows

    Demo here

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search