skip to Main Content

I have a large database with around 75K points. This database represents a time series associated with countries with the following format:

|country | value | date |

I am using a fastapi API to serve this data on my front. Since the chart library I use will not use more than 300 points, I’d like to downsample the time series so that I have 300 points equally spaced, including first point and last point for each country.

As for now, I have come up with this SQL solution:

WITH
    E AS (
        SELECT *,
        PERCENT_RANK() OVER w as rw
        FROM my_table
        WINDOW w AS (PARTITION BY country ORDER BY date)
    )
SELECT *, FLOOR(300*rw) as sel FROM E GROUP BY E.country, sel;

I was wondering if there was a better way to do this / more optimized way?

I’m already using indexes, and I’m unsure if partitions on the SQL database has any impact

Example

If I downsample to 5 points per country, I would like to go from this:

+---------+--------+------------+
| country | value  | date       |
+---------+--------+------------+
| ar      |   1.4  | 2010-02-03 |
| ar      |   1.4  | 2010-02-04 |
| ar      |   1.3  | 2010-02-05 |
| ar      |   1.4  | 2010-02-06 |
| ar      |   1.2  | 2010-02-07 |
| ar      |   1.4  | 2010-02-08 |
| ar      |   1.5  | 2010-02-09 |
| ar      |   1.7  | 2010-02-10 |
| ar      |   1.4  | 2010-02-11 |
| ar      |   1.6  | 2010-02-12 |
| ar      |   1.4  | 2010-02-13 |
| ar      |   1.5  | 2010-02-14 |
| ar      |   1.3  | 2010-02-15 |
| ar      |   1.2  | 2010-02-16 |
| fr      |   1.3  | 2010-02-03 |
| fr      |   1.3  | 2010-02-04 |
| fr      |   1.4  | 2010-02-05 |
| fr      |   1.6  | 2010-02-06 |
| fr      |   1.9  | 2010-02-07 |
| fr      |   1.3  | 2010-02-08 |
| fr      |   1.3  | 2010-02-09 |
| fr      |   1.2  | 2010-02-10 |
| fr      |   1.3  | 2010-02-11 |
| fr      |   1.5  | 2010-02-12 |
| fr      |   1.3  | 2010-02-13 |
| fr      |   1.3  | 2010-02-14 |
| fr      |   1.5  | 2010-02-15 |
| fr      |   1.3  | 2010-02-16 |
+---------+--------+------------+

To this:

+---------+--------+------------+
| country | value  | date       |
+---------+--------+------------+
| ar      |   1.4  | 2010-02-03 |
| ar      |   1.4  | 2010-02-06 |
| ar      |   1.5  | 2010-02-09 |
| ar      |   1.4  | 2010-02-13 |
| ar      |   1.2  | 2010-02-16 |
| fr      |   1.3  | 2010-02-03 |
| fr      |   1.6  | 2010-02-06 |
| fr      |   1.3  | 2010-02-09 |
| fr      |   1.3  | 2010-02-13 |
| fr      |   1.3  | 2010-02-16 |
+---------+--------+------------+

Edit 2

With @Rick James answer, I’ve come up with this:

SELECT country, FLOOR(300*(TO_DAYS(date) - x.da) /(x.dd - x.da)) as g, date, value
    FROM table
    JOIN ( SELECT TO_DAYS(MIN(date)) AS da, TO_DAYS(MAX(date)) as dd, country as cc
              FROM table GROUP BY country
         ) AS x
         ON table.country = x.cc
    GROUP BY country, g;

I can tell this is faster, but as I am no expert, I can’t tell if it is the best I can get / if it is not an absurd way of doing this

2

Answers


  1. I would develop a formula that maps the date a number such that MIN(date) maps to 0.5 and MAX(date) maps to 299.5.

    SELECT country, ROUND(...that formula...), AVG(value)
        FROM t
        GROUP BY 1,2;
    

    Note that AVG will effectively smooth out the curve rather than picking a particular entry (which might be especially high or low.).

    Putting the pieces together:

     SELECT country,
           ROUND((TO_DAYS(date) - x.da) / x.span),
           AVG(value)
        FROM my_table
        JOIN ( SELECT TO_DAYS(MIN(date)) AS da,
                      ROUND((TO_DAYS(MAX(date)) -
                             TO_DAYS(MIN(date))) / 6) AS span
                  FROM my_table
             ) AS x
        GROUP BY 1,2;
    
    Login or Signup to reply.
  2. WITH
      sorted AS
    (
      SELECT
        *,
        ROW_NUMBER() OVER (
          PARTITION BY country
              ORDER BY date     -- choose a different column name!
        ) - 1
          AS row_id,
        COUNT(*) OVER (
          PARTITION BY country
        ) - 1
          AS step
      FROM
        my_table
    )
    SELECT
      sorted.*
    FROM
      sorted
    WHERE
      mod( row_id             * 4, step)
      <
      mod((row_id + step - 1) * 4, step)
    ORDER BY
      country, row_id
    

    The hard-coded * 4 should be * (sample_size - 1)

    Demo: https://dbfiddle.uk/llmch7nm

    EDIT:

    The above behaves badly when the starting data set size is smaller than the target sample size.

    This demo has two different fudges to work around that : https://dbfiddle.uk/J0Nb_CNb

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search