Count the most popular occurrences of a hashtag in a string column postgreSQL

counterculture
January 30, 2023
160 views
2 votes
3 Answers

I have a column in my dataset with the following format:

hashtags
1 [#newyears, #christmas, #christmas]
2 [#easter, #newyears, #fourthofjuly]
3 [#valentines, #christmas, #easter]

I have managed to count the hashtags like so:

SELECT hashtags, (LENGTH(hashtags) - LENGTH(REPLACE(hashtags, ',', '')) + 1) AS hashtag_count
FROM full_data
ORDER BY hashtag_count DESC NULLS LAST

But I’m not sure if it’s possible to count the occurrences of each hashtag. Is it possible to return the count of the most popular hashtags in the following format:

hashtags     count
christmas     3
newyears      2

The datatype is just varchar, but I’m a bit confused on how I should approach this. Any help would be appreciated!

Tags: postgresql sql

Answers

- MabuKloesen
- January 30, 2023 at 5:27 am
- 0 votes
0
I think you should split all the data in Array to record and then count it with Group by. Something like this query
```
SELECT hashtag, count(*) as hashtag_count
FROM full_data, unnest(hashtags) s(hashtag)
GROUP BY hashtag
ORDER BY hashtag_count DESC
```
Hopefully, it will match your request!
Login or Signup to reply.

- JonasMetzler
- January 30, 2023 at 8:49 am
- 0 votes
0
That’s a bad idea to store this data. It’s risky because we don’t know whether the text will always be stored in exactly this form. Better save the different strings in separate columns.

Anyway, if you can’t improve that and must deal with this structure, we could basically use a combination of UNNEST, STRING_TO_ARRAY and GROUP BY to split the hashtags and count them.

So the general idea is something like this:
```
WITH unnested AS
(SELECT
UNNEST(STRING_TO_ARRAY(hashtags, ',')) AS hashtag
FROM full_data)
SELECT hashtag, COUNT(hashtag) 
FROM unnested
GROUP BY hashtag
ORDER BY COUNT(hashtag) DESC;
```
Due to the braces and spaces within your column, this will not produce the correct result.

So we could additionaly use TRIM and TRANSLATE to get rid of all other things except the hashtags.

With your sample data, following construct will produce the intended outcome:
```
WITH unnested AS
(SELECT
TRIM(TRANSLATE(UNNEST(STRING_TO_ARRAY(hashtags, ',')),'#,[,]','')) AS hashtag
FROM full_data)
SELECT hashtag, COUNT(hashtag) 
FROM unnested
GROUP BY hashtag
ORDER BY COUNT(hashtag) DESC;
```
See here

But as already said, this is unpleasant and risky.

So if possible, find out which hashtags are possible (it seems these are all special days) and then create columns or a mapping table for them.

This said, store 0 or 1 in the column to indicate whether the hashtag appears or not and then sum the values per column.
Login or Signup to reply.

- SelVazi
- January 30, 2023 at 10:03 am
- 0 votes
0
You can do it as follows :
```
select unnest(string_to_array(REGEXP_REPLACE(hashtags,'[^w,]+','','g'), ',')) as tags, count(1)
from full_data
group by tags
order by count(1) desc
```
Result :
```
tags         count
christmas      3
newyears       2
easter         2
fourthofjuly   1
valentines     1
```
REGEXP_REPLACE to remove any special characters.

string_to_array to generate an array

unnest to expand an array to a set of rows

Demo here
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.