I have a large database with around 75K points. This database represents a time series associated with countries with the following format:
|country | value | date |
I am using a fastapi API to serve this data on my front. Since the chart library I use will not use more than 300 points, I’d like to downsample the time series so that I have 300 points equally spaced, including first point and last point for each country.
As for now, I have come up with this SQL solution:
WITH
E AS (
SELECT *,
PERCENT_RANK() OVER w as rw
FROM my_table
WINDOW w AS (PARTITION BY country ORDER BY date)
)
SELECT *, FLOOR(300*rw) as sel FROM E GROUP BY E.country, sel;
I was wondering if there was a better way to do this / more optimized way?
I’m already using indexes, and I’m unsure if partitions on the SQL database has any impact
Example
If I downsample to 5 points per country, I would like to go from this:
+---------+--------+------------+
| country | value | date |
+---------+--------+------------+
| ar | 1.4 | 2010-02-03 |
| ar | 1.4 | 2010-02-04 |
| ar | 1.3 | 2010-02-05 |
| ar | 1.4 | 2010-02-06 |
| ar | 1.2 | 2010-02-07 |
| ar | 1.4 | 2010-02-08 |
| ar | 1.5 | 2010-02-09 |
| ar | 1.7 | 2010-02-10 |
| ar | 1.4 | 2010-02-11 |
| ar | 1.6 | 2010-02-12 |
| ar | 1.4 | 2010-02-13 |
| ar | 1.5 | 2010-02-14 |
| ar | 1.3 | 2010-02-15 |
| ar | 1.2 | 2010-02-16 |
| fr | 1.3 | 2010-02-03 |
| fr | 1.3 | 2010-02-04 |
| fr | 1.4 | 2010-02-05 |
| fr | 1.6 | 2010-02-06 |
| fr | 1.9 | 2010-02-07 |
| fr | 1.3 | 2010-02-08 |
| fr | 1.3 | 2010-02-09 |
| fr | 1.2 | 2010-02-10 |
| fr | 1.3 | 2010-02-11 |
| fr | 1.5 | 2010-02-12 |
| fr | 1.3 | 2010-02-13 |
| fr | 1.3 | 2010-02-14 |
| fr | 1.5 | 2010-02-15 |
| fr | 1.3 | 2010-02-16 |
+---------+--------+------------+
To this:
+---------+--------+------------+
| country | value | date |
+---------+--------+------------+
| ar | 1.4 | 2010-02-03 |
| ar | 1.4 | 2010-02-06 |
| ar | 1.5 | 2010-02-09 |
| ar | 1.4 | 2010-02-13 |
| ar | 1.2 | 2010-02-16 |
| fr | 1.3 | 2010-02-03 |
| fr | 1.6 | 2010-02-06 |
| fr | 1.3 | 2010-02-09 |
| fr | 1.3 | 2010-02-13 |
| fr | 1.3 | 2010-02-16 |
+---------+--------+------------+
Edit 2
With @Rick James answer, I’ve come up with this:
SELECT country, FLOOR(300*(TO_DAYS(date) - x.da) /(x.dd - x.da)) as g, date, value
FROM table
JOIN ( SELECT TO_DAYS(MIN(date)) AS da, TO_DAYS(MAX(date)) as dd, country as cc
FROM table GROUP BY country
) AS x
ON table.country = x.cc
GROUP BY country, g;
I can tell this is faster, but as I am no expert, I can’t tell if it is the best I can get / if it is not an absurd way of doing this
2
Answers
I would develop a formula that maps the date a number such that MIN(date) maps to 0.5 and MAX(date) maps to 299.5.
Note that
AVG
will effectively smooth out the curve rather than picking a particular entry (which might be especially high or low.).Putting the pieces together:
The hard-coded
* 4
should be* (sample_size - 1)
Demo: https://dbfiddle.uk/llmch7nm
EDIT:
The above behaves badly when the starting data set size is smaller than the target sample size.
This demo has two different fudges to work around that : https://dbfiddle.uk/J0Nb_CNb