Given a row in an example table as:
Column A |
---|
abcccaaaaddeeaabbaa |
How can we get the following output using MYSQL?
sequence | occurrences |
---|---|
aaaa | 1 |
ccc | 1 |
dd | 1 |
ee | 1 |
aa | 2 |
bb | 1 |
a | 1 |
b | 1 |
Tried to compute this using MySQL. Cannot get the right logic for the same.
Edit: My latest attempt with BigQuery:
WITH Input AS (
SELECT ‘abcccaaaaddeeaabbaa’ AS str
),
Exploded AS (
SELECT
SUBSTR(str, pos, 1) AS char,
pos,
LAG(SUBSTR(str, pos, 1)) OVER (ORDER BY pos) AS prev_char
FROM Input,
UNNEST(GENERATE_ARRAY(1, LENGTH(str))) AS pos
),
Grouped AS (
SELECT
char,
pos,
IF(char = prev_char, 0, 1) AS new_group
FROM Exploded
),
Sequenced AS (
SELECT
char,
pos,
SUM(new_group) OVER (ORDER BY pos) AS sequence_id
FROM Grouped
),
Aggregated AS (
SELECT
sequence_id,
STRING_AGG(char, ” ORDER BY pos) AS sequence
FROM Sequenced
GROUP BY sequence_id
),
FinalCounts AS (
SELECT
sequence,
COUNT(*) AS occurrences
FROM Aggregated
GROUP BY sequence
)
select * from FinalCounts
2
Answers
Split it up into one row per character, then identify runs of characters and summarize them:
fiddle
here is a shorter query
result:
sample dbfiddle