Let assume I have two parquet file aka main and updates with some data as shown below. I need to implement upsert kind of operation in duck db.
parquet file main: id, name, city
data:
id name city
1 a p
2 b q
3 c r
parquet file updates: id, name, city
data:
id name city
1 a m
4 b q
desired output:-
id name city
1 a m <----update city
2 b q
3 c r
4 b q <----insert
currently I am doing the same using below queries:-
create table main as select * from '/tmp/main.parquet';
create table stage as select * from '/tmp/updates.parquet';
delete from main using stage where main.id=stage.id;
insert into main select * from stage;
COPY main TO '/tmp/final.parquet' (FORMAT 'PARQUET', CODEC 'ZSTD');
but only thing is create table will load and keep all the data in memory what I don’t want as the main file may contain 8-10 million records, is there any way where I can achive the same using joins only and avoid creating main and stage tables.
2
Answers
Here is the query, one point here is that this will consume less memory and will be fast as compared to creating duck db tables as given in original question but consume more cpu cores.
DuckDB has upsert support, see the docs: https://duckdb.org/docs/sql/statements/insert#on-conflict-clause