Json - Optimal bulk insert to clickhouse

Saminator
July 12, 2024
154 views
0 votes
2 Answers

I have a large dataset containing over 10 million JSON records. Could someone please advise on the most efficient method to insert these into a string column table? Currently, I am using clickhouse_driver, but it takes approximately one minute to load 100,000 records.

current steps:

read data from db (each row as json)
load into dataframe
use insert_dataframe or execute method to insert into clickhouse table (target will just have one column with json rows)

code:

from clickhouse_driver import Client
client = Client(host,user, password,database)

for chunk in pd.read_sql(src_sql, conn1,chunksize=300000):              
df=pd.DataFrame(chunk, columns=bq_column_names)
df_list=df['JsonData'].tolist()

bulk_insert_data = [(item,1)
                        for item in df_list
                        ]
        
client.execute("INSERT INTO schema.table (JsonData,batchseq) SETTINGS async_insert=1, wait_for_async_insert=1 VALUES", bulk_insert_data)

Answers

- Slach
- July 12, 2024 at 4:32 am
- 0 votes
0
async_insert is not for bulk inserts it’s for a lot of small inserts

most time your code just converts pandas df into list

look to
https://clickhouse.com/docs/en/integrations/python#client-insert-method

insert method
and insert_df=true parameter

Login or Signup to reply.

- OluwafemiSule
- July 12, 2024 at 10:41 am
- 0 votes
0
You can use the client.insert_df which efficiently performs bulk insertions.

An improvement in the data prepping to set the batchseq using vectorization.
```
prepped_df = df['JsonData']
prepped_df['batchseq'] = 1
  
client.insert_df(
    table='table',
    df=prepped_df, 
    database='schema',
    column_names=['JsonData','batchseq'])
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Json – Optimal bulk insert to clickhouse

Answers