Syndorik
December 25, 2022
244 views
0 votes
2 Answers

I have a large database with around 75K points. This database represents a time series associated with countries with the following format:

|country | value | date |

I am using a fastapi API to serve this data on my front. Since the chart library I use will not use more than 300 points, I’d like to downsample the time series so that I have 300 points equally spaced, including first point and last point for each country.

As for now, I have come up with this SQL solution:

WITH
    E AS (
        SELECT *,
        PERCENT_RANK() OVER w as rw
        FROM my_table
        WINDOW w AS (PARTITION BY country ORDER BY date)
    )
SELECT *, FLOOR(300*rw) as sel FROM E GROUP BY E.country, sel;

I was wondering if there was a better way to do this / more optimized way?

I’m already using indexes, and I’m unsure if partitions on the SQL database has any impact

Example

If I downsample to 5 points per country, I would like to go from this:

+---------+--------+------------+
| country | value  | date       |
+---------+--------+------------+
| ar      |   1.4  | 2010-02-03 |
| ar      |   1.4  | 2010-02-04 |
| ar      |   1.3  | 2010-02-05 |
| ar      |   1.4  | 2010-02-06 |
| ar      |   1.2  | 2010-02-07 |
| ar      |   1.4  | 2010-02-08 |
| ar      |   1.5  | 2010-02-09 |
| ar      |   1.7  | 2010-02-10 |
| ar      |   1.4  | 2010-02-11 |
| ar      |   1.6  | 2010-02-12 |
| ar      |   1.4  | 2010-02-13 |
| ar      |   1.5  | 2010-02-14 |
| ar      |   1.3  | 2010-02-15 |
| ar      |   1.2  | 2010-02-16 |
| fr      |   1.3  | 2010-02-03 |
| fr      |   1.3  | 2010-02-04 |
| fr      |   1.4  | 2010-02-05 |
| fr      |   1.6  | 2010-02-06 |
| fr      |   1.9  | 2010-02-07 |
| fr      |   1.3  | 2010-02-08 |
| fr      |   1.3  | 2010-02-09 |
| fr      |   1.2  | 2010-02-10 |
| fr      |   1.3  | 2010-02-11 |
| fr      |   1.5  | 2010-02-12 |
| fr      |   1.3  | 2010-02-13 |
| fr      |   1.3  | 2010-02-14 |
| fr      |   1.5  | 2010-02-15 |
| fr      |   1.3  | 2010-02-16 |
+---------+--------+------------+

To this:

+---------+--------+------------+
| country | value  | date       |
+---------+--------+------------+
| ar      |   1.4  | 2010-02-03 |
| ar      |   1.4  | 2010-02-06 |
| ar      |   1.5  | 2010-02-09 |
| ar      |   1.4  | 2010-02-13 |
| ar      |   1.2  | 2010-02-16 |
| fr      |   1.3  | 2010-02-03 |
| fr      |   1.6  | 2010-02-06 |
| fr      |   1.3  | 2010-02-09 |
| fr      |   1.3  | 2010-02-13 |
| fr      |   1.3  | 2010-02-16 |
+---------+--------+------------+

Edit 2

With @Rick James answer, I’ve come up with this:

SELECT country, FLOOR(300*(TO_DAYS(date) - x.da) /(x.dd - x.da)) as g, date, value
    FROM table
    JOIN ( SELECT TO_DAYS(MIN(date)) AS da, TO_DAYS(MAX(date)) as dd, country as cc
              FROM table GROUP BY country
         ) AS x
         ON table.country = x.cc
    GROUP BY country, g;

I can tell this is faster, but as I am no expert, I can’t tell if it is the best I can get / if it is not an absurd way of doing this

Answers

- RickJames
- December 25, 2022 at 7:38 pm
- 0 votes
0
I would develop a formula that maps the date a number such that MIN(date) maps to 0.5 and MAX(date) maps to 299.5.
```
SELECT country, ROUND(...that formula...), AVG(value)
    FROM t
    GROUP BY 1,2;
```
Note that AVG will effectively smooth out the curve rather than picking a particular entry (which might be especially high or low.).

Putting the pieces together:
```
 SELECT country,
       ROUND((TO_DAYS(date) - x.da) / x.span),
       AVG(value)
    FROM my_table
    JOIN ( SELECT TO_DAYS(MIN(date)) AS da,
                  ROUND((TO_DAYS(MAX(date)) -
                         TO_DAYS(MIN(date))) / 6) AS span
              FROM my_table
         ) AS x
    GROUP BY 1,2;
```
Login or Signup to reply.

- MatBailie
- December 25, 2022 at 9:35 pm
- 0 votes
0
```
WITH
  sorted AS
(
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY country
          ORDER BY date     -- choose a different column name!
    ) - 1
      AS row_id,
    COUNT(*) OVER (
      PARTITION BY country
    ) - 1
      AS step
  FROM
    my_table
)
SELECT
  sorted.*
FROM
  sorted
WHERE
  mod( row_id             * 4, step)
  <
  mod((row_id + step - 1) * 4, step)
ORDER BY
  country, row_id
```
The hard-coded * 4 should be * (sample_size - 1)

Demo: https://dbfiddle.uk/llmch7nm

EDIT:

The above behaves badly when the starting data set size is smaller than the target sample size.

This demo has two different fudges to work around that : https://dbfiddle.uk/J0Nb_CNb
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

downsampling time series equally spaced points for line plots – Mysql

Example

Edit 2

Answers