I have some partially columnar data like this:
"hello", "2024 JAN", "2024 FEB"
"a", 0, 1
If it were purely columnar, it would look like:
"hello", "year", "month", "value"
"a", 2024, "JAN", 0
"a", 2024, "FEB", 1
Suppose the data is in the form of a numpy array, like this:
import numpy as np
data = np.array([["hello", "2024 JAN", "2024 FEB"], ["a", "0", "1"]], dtype="<U")
data
array([['hello', '2024 JAN', '2024 FEB'],
['a', '0', '1']], dtype='<U8')
Imagine also that I created a table:
import duckdb as ddb
conn = ddb.connect("hello.db")
conn.execute("CREATE TABLE columnar (hello VARCHAR, year UINTEGER, month VARCHAR, value VARCHAR);")
How could I go about efficiently inserting data
into the DuckDB table columnar
?
The naive/easy way would be to brute-force transform the data into a columnar format in-memory, in Python, before inserting it into the DuckDB table.
Here I mean specifically:
import re
data_header = data[0]
data_proper = data[1:]
date_pattern = re.compile(r"(?P<year>[d]+) (?P<month>JAN|FEB)")
common_labels: list[str] = []
known_years: set[int] = set()
known_months: set[str] = set()
header_to_date: Dict[str, tuple[int, str]] = dict()
for header in data_header:
if matches := date_pattern.match(header):
year, month = int(matches["year"]), str(matches["month"])
known_years.add(year)
known_months.add(month)
header_to_date[header] = (year, month)
else:
common_labels.append(header)
# hello, year, month, value
new_rows_per_old_row = len(known_years) * len(known_months)
new_headers = ["year", "month", "value"]
purely_columnar = np.empty(
(
1 + data_proper.shape[0] * new_rows_per_old_row,
len(common_labels) + len(new_headers),
),
dtype=np.object_,
)
purely_columnar[0] = common_labels + ["year", "month", "value"]
for rx, row in enumerate(data_proper):
common_data = []
ym_data = []
for header, element in zip(data_header, row):
if header in common_labels:
common_data.append(element)
else:
year, month = header_to_date[header]
ym_data.append([year, month, element])
for yx, year_month_value in enumerate(ym_data):
purely_columnar[
1 + rx * new_rows_per_old_row + yx, : len(common_labels)
] = common_data
purely_columnar[
1 + rx * new_rows_per_old_row + yx, len(common_labels) :
] = year_month_value
print(f"{purely_columnar=}")
purely_columnar=
array([[np.str_('hello'), 'year', 'month', 'value'],
[np.str_('a'), 2024, 'JAN', np.str_('0')],
[np.str_('a'), 2024, 'FEB', np.str_('1')]], dtype=object)
Now it is easy enough to store this data in DuckDB:
purely_columnar_data = np.transpose(purely_columnar[1:])
conn.execute(
"""INSERT INTO columnar
SELECT * FROM purely_columnar_data
"""
)
conn.sql("SELECT * FROM columnar")
┌─────────┬────────┬─────────┬─────────┐
│ hello │ year │ month │ value │
│ varchar │ uint32 │ varchar │ varchar │
├─────────┼────────┼─────────┼─────────┤
│ a │ 2024 │ JAN │ 0 │
│ a │ 2024 │ FEB │ 1 │
└─────────┴────────┴─────────┴─────────┘
But is there any other way in which I can insert the data into a DuckDB in a purely columnar form, apart from brute-forcing the data into a purely columnar form first?
Note: I have tagged this question with postgresql
because DuckDB’s SQL dialect closely follows that of PostgreSQL.
2
Answers
Note: I added another row (
"b",1,0
) to make the data a bit more substantive so that it’s easier to see what’s happening.Essentially what you have is a "pivoted" dataset:
So
UNPIVOT
it:Following which you could use one of the string functions to parse the
month
further.EDIT: I had a crack at splitting the string, it’s not the most efficient, but I figured the query optimiser probably sees the code duplication, and optimises it.
EDIT2: Okay,
EXPLAIN
confirms that the duplication is optimised away:A slight variant on the answer from @suvayu:
The two differences are:
strptime()
to parse the date.COLUMNS
expression to make the query a little more dynamic (I assume the real data set has more dates in it).