skip to Main Content

Let assume I have two parquet file aka main and updates with some data as shown below. I need to implement upsert kind of operation in duck db.

parquet file main: id, name, city

data:
   id     name     city 
    1      a         p
    2      b         q
    3      c         r
    

parquet file updates: id, name, city

data:
   id     name     city
    1      a         m
    4      b         q

desired output:-

   id     name     city
    1      a         m  <----update city
    2      b         q
    3      c         r
    4      b         q  <----insert

currently I am doing the same using below queries:-

create table main as select * from '/tmp/main.parquet';
create table stage as  select * from '/tmp/updates.parquet';
delete from main using stage where main.id=stage.id;
insert into main select * from stage;
COPY main TO '/tmp/final.parquet' (FORMAT 'PARQUET', CODEC 'ZSTD');

but only thing is create table will load and keep all the data in memory what I don’t want as the main file may contain 8-10 million records, is there any way where I can achive the same using joins only and avoid creating main and stage tables.

2

Answers


  1. Chosen as BEST ANSWER

    Here is the query, one point here is that this will consume less memory and will be fast as compared to creating duck db tables as given in original question but consume more cpu cores.

    COPY(
         select
         CASE WHEN a.id is NULL  THEN b.id ELSE a.id END as id,
         CASE WHEN b.id is NOT NULL THEN b.name ELSE a.name END as name,
         CASE WHEN b.id is NOT NULL THEN b.city ELSE a.city END as city
         from (SELECT  * FROM 'main.parquet') a full outer join (SELECT  * 
         FROM 'updates.parquet') b on a.id=b.id
        )
        TO '/tmp/output.parquet' (FORMAT 'PARQUET', CODEC 'ZSTD');
    

Please signup or login to give your own answer.
Back To Top
Search