I’m looking for assistance with writing a PostgreSQL query in Apache Superset.
I have metered consumption data, where the meters have unique IDs. Normally, a meter is read at the start of every hour, and the difference is calculated in the consumption column. There are times, however, when an error occurs and the meter is not read, leading to missing hours in the data. I would like to add new rows and create a new column that would say "Missing Data" in those cases.
Here is an example of what the data looks like. The read_time_locals are in a timestamp without timezone format
meter_id | start_read_time_local | end_read_time_local | start_read | end_read | consumption |
---|---|---|---|---|---|
1111 | 9/10/2024 0:00 | 9/10/2024 1:00 | 79118.91 | 79118.93 | 0.02 |
1111 | 9/10/2024 3:00 | 9/10/2024 4:00 | 79122.47 | 79123.99 | 1.52 |
2222 | 9/10/2024 0:00 | 9/10/2024 1:00 | 1937135 | 1937174 | 39 |
2222 | 9/10/2024 1:00 | 9/10/2024 2:00 | 1937174 | 1937191 | 17 |
2222 | 9/10/2024 2:00 | 9/10/2024 3:00 | 1937191 | 1937197 | 6 |
2222 | 9/10/2024 3:00 | 9/10/2024 4:00 | 1937197 | 1937202 | 5 |
What I would like is to create a query that would do this:
meter_id | start_read_time_local | end_read_time_local | start_read | end_read | consumption | Notes |
---|---|---|---|---|---|---|
1111 | 9/10/2024 0:00 | 9/10/2024 1:00 | 79118.91 | 79118.93 | 0.02 | |
1111 | 9/10/2024 1:00 | 9/10/2024 2:00 | null | Missing Data | ||
1111 | 9/10/2024 2:00 | 9/10/2024 3:00 | null | Missing Data | ||
1111 | 9/10/2024 3:00 | 9/10/2024 4:00 | 79122.47 | 79123.99 | 1.52 | |
2222 | 9/10/2024 0:00 | 9/10/2024 1:00 | 1937135 | 1937174 | 39 | |
2222 | 9/10/2024 1:00 | 9/10/2024 2:00 | 1937174 | 1937191 | 17 | |
2222 | 9/10/2024 2:00 | 9/10/2024 3:00 | 1937191 | 1937197 | 6 | |
2222 | 9/10/2024 3:00 | 9/10/2024 4:00 | 1937197 | 1937202 | 5 |
I have looked and tried various solutions, but have not been able to successfully do this. My initial attempts were to use a combination of min/max of the end_read_time_local
and generate_series()
to create a sequence of timestamps that I would then right join to the water_consumption
table, the result being the missing hours would have no consumption/reading values.
I was using this command to create the series and just make a list of 10 hours:
select (date_trunc('hour',current_timestamp) + ((a-1)||' hour')::interval)::timestamptz
from generate_series(1, 10, 1) as a
I’m struggling to join it, and I have yet to figure out how I will join these for each meter ID. In other words, I need the join to check to see if meter 1111 has a complete record as well as meter 2222.
I have a table account
with distinct devices, where device_id
matches water_consumption.meter_id
.
After feedback I tried:
SELECT m.device_id
, date_trunc('hour', LOCALTIMESTAMP) + (h - 1) * interval '1 hour' AS start_read_time_local
, date_trunc('hour', LOCALTIMESTAMP) + h * interval '1 hour' AS end_read_time_local
, w.start_read
, w.end_read
, w.consumption
, CASE WHEN w.meter_id IS NULL THEN 'Missing Data' END AS notes
FROM account m
CROSS JOIN generate_series(1, 10) h
LEFT JOIN water_consumption w USING (meter_id, start_read_time_local)
But I got this Redshift error:
column "meter_id" specified in USING clause does not exist in left table
2
Answers
Build a Cartesian Product (
CROSS JOIN
) of all relevant meters and hours, thenLEFT JOIN
to existing data:Using your table
account
that contains a distinct set of relevant devices, whereaccount.device_id
equalswater_consumption.meter_id
.Related, with more explanation:
The
USING
clause I had in my initial answer builds on equal column names. UseON
instead, if that shortcut is not possible.To fill in the gaps in hours of a dates create a grid of all hours per date and meter_id – left join your readings data and use Coalesce() function to show every hour then define notes column as ‘Missing Data’ for hours without readings.
… create a cte grid that will have a row for every hour for every date of reading and for every meter_id …
… if you want the result as presented in your question – filter the data with WHERE clause as below. Without WHERE clause you will get all hours of a date per meter_id (see it in the fiddle – link is below)
See the fiddle here.
NOTE:
This is tested with postgres but it should work on redshift with one adjustment in grid cte – adding hours to a date uses different sintax.