skip to Main Content

I have some partially columnar data like this:

"hello", "2024 JAN", "2024 FEB"
"a", 0, 1

If it were purely columnar, it would look like:

"hello", "year", "month", "value"
"a", 2024, "JAN", 0
"a", 2024, "FEB", 1

Suppose the data is in the form of a numpy array, like this:

import numpy as np

data = np.array([["hello", "2024 JAN", "2024 FEB"], ["a", "0", "1"]], dtype="<U")
data
array([['hello', '2024 JAN', '2024 FEB'],
       ['a', '0', '1']], dtype='<U8')

Imagine also that I created a table:

import duckdb as ddb

conn = ddb.connect("hello.db")
conn.execute("CREATE TABLE columnar (hello VARCHAR, year UINTEGER, month VARCHAR, value VARCHAR);")

How could I go about efficiently inserting data into the DuckDB table columnar?

The naive/easy way would be to brute-force transform the data into a columnar format in-memory, in Python, before inserting it into the DuckDB table.

Here I mean specifically:

import re

data_header = data[0]
data_proper = data[1:]

date_pattern = re.compile(r"(?P<year>[d]+) (?P<month>JAN|FEB)")
common_labels: list[str] = []
known_years: set[int] = set()
known_months: set[str] = set()
header_to_date: Dict[str, tuple[int, str]] = dict()
for header in data_header:
    if matches := date_pattern.match(header):
        year, month = int(matches["year"]), str(matches["month"])
        known_years.add(year)
        known_months.add(month)
        header_to_date[header] = (year, month)
    else:
        common_labels.append(header)

# hello, year, month, value
new_rows_per_old_row = len(known_years) * len(known_months)
new_headers = ["year", "month", "value"]
purely_columnar = np.empty(
    (
        1 + data_proper.shape[0] * new_rows_per_old_row,
        len(common_labels) + len(new_headers),
    ),
    dtype=np.object_,
)
purely_columnar[0] = common_labels + ["year", "month", "value"]
for rx, row in enumerate(data_proper):
    common_data = []
    ym_data = []
    for header, element in zip(data_header, row):
        if header in common_labels:
            common_data.append(element)
        else:
            year, month = header_to_date[header]
            ym_data.append([year, month, element])

    for yx, year_month_value in enumerate(ym_data):
        purely_columnar[
            1 + rx * new_rows_per_old_row + yx, : len(common_labels)
        ] = common_data
        purely_columnar[
            1 + rx * new_rows_per_old_row + yx, len(common_labels) :
        ] = year_month_value

print(f"{purely_columnar=}")
purely_columnar=
array([[np.str_('hello'), 'year', 'month', 'value'],
       [np.str_('a'), 2024, 'JAN', np.str_('0')],
       [np.str_('a'), 2024, 'FEB', np.str_('1')]], dtype=object)

Now it is easy enough to store this data in DuckDB:

purely_columnar_data = np.transpose(purely_columnar[1:])
conn.execute(
    """INSERT INTO columnar
    SELECT * FROM purely_columnar_data
  """
)
conn.sql("SELECT * FROM columnar")
┌─────────┬────────┬─────────┬─────────┐
│  hello  │  year  │  month  │  value  │
│ varchar │ uint32 │ varchar │ varchar │
├─────────┼────────┼─────────┼─────────┤
│ a       │   2024 │ JAN     │ 0       │
│ a       │   2024 │ FEB     │ 1       │
└─────────┴────────┴─────────┴─────────┘

But is there any other way in which I can insert the data into a DuckDB in a purely columnar form, apart from brute-forcing the data into a purely columnar form first?

Note: I have tagged this question with postgresql because DuckDB’s SQL dialect closely follows that of PostgreSQL.

2

Answers


  1. Note: I added another row ("b",1,0) to make the data a bit more substantive so that it’s easier to see what’s happening.

    Essentially what you have is a "pivoted" dataset:

    D SELECT * FROM 'pivoted-data.csv';
    ┌─────────┬──────────┬──────────┐
    │  hello  │ 2024 JAN │ 2024 FEB │
    │ varchar │  int64   │  int64   │
    ├─────────┼──────────┼──────────┤
    │ a       │        0 │        1 │
    │ b       │        1 │        0 │
    └─────────┴──────────┴──────────┘
    

    So UNPIVOT it:

    D SELECT hello, string_split(month, ' ')[1]::INTEGER AS year, string_split(month, ' ')[2] AS month, value
      FROM (UNPIVOT 'pivoted-data.csv' ON '2024 JAN', '2024 FEB' INTO NAME month VALUE value);
    ┌─────────┬───────┬─────────┬───────┐
    │  hello  │ year  │  month  │ value │
    │ varchar │ int32 │ varchar │ int64 │
    ├─────────┼───────┼─────────┼───────┤
    │ a       │  2024 │ JAN     │     0 │
    │ a       │  2024 │ FEB     │     1 │
    │ b       │  2024 │ JAN     │     1 │
    │ b       │  2024 │ FEB     │     0 │
    └─────────┴───────┴─────────┴───────┘
    

    Following which you could use one of the string functions to parse the month further.

    EDIT: I had a crack at splitting the string, it’s not the most efficient, but I figured the query optimiser probably sees the code duplication, and optimises it.

    EDIT2: Okay, EXPLAIN confirms that the duplication is optimised away:

    ┌─────────────────────────────┐
    │┌───────────────────────────┐│
    ││       Physical Plan       ││
    │└───────────────────────────┘│
    └─────────────────────────────┘
    ┌───────────────────────────┐
    │         PROJECTION        │
    │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
    │           hello           │
    │            year           │
    │           month           │
    │           value           │
    └─────────────┬─────────────┘
    ┌─────────────┴─────────────┐
    │         PROJECTION        │
    │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
    │           hello           │
    │  string_split(month, ' ') │
    │           value           │
    └─────────────┬─────────────┘
    ┌─────────────┴─────────────┐
    │         PROJECTION        │
    │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
    │             #0            │
    │             #3            │
    │             #4            │
    └─────────────┬─────────────┘
    ┌─────────────┴─────────────┐
    │           FILTER          │
    │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
    │    (value IS NOT NULL)    │
    │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
    │           EC: 1           │
    └─────────────┬─────────────┘
    ┌─────────────┴─────────────┐
    │           UNNEST          │
    └─────────────┬─────────────┘
    ┌─────────────┴─────────────┐
    │       READ_CSV_AUTO       │
    │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
    │           hello           │
    │          2024 JAN         │
    │          2024 FEB         │
    │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
    │           EC: 3           │
    └───────────────────────────┘
    
    Login or Signup to reply.
  2. A slight variant on the answer from @suvayu:

    SELECT hello, CAST(strptime(month, '%Y %b') AS DATE) AS month, value
    FROM (
      UNPIVOT 'pivoted-data.csv'
      ON COLUMNS(* EXCLUDE (hello)) 
      INTO NAME month VALUE value);
    

    The two differences are:

    1. Use of strptime() to parse the date.
    2. Use of a COLUMNS expression to make the query a little more dynamic (I assume the real data set has more dates in it).
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search