I have a large dataset containing over 10 million JSON records. Could someone please advise on the most efficient method to insert these into a string column table? Currently, I am using clickhouse_driver, but it takes approximately one minute to load 100,000 records.
current steps:
- read data from db (each row as json)
- load into dataframe
- use insert_dataframe or execute method to insert into clickhouse table (target will just have one column with json rows)
code:
from clickhouse_driver import Client
client = Client(host,user, password,database)
for chunk in pd.read_sql(src_sql, conn1,chunksize=300000):
df=pd.DataFrame(chunk, columns=bq_column_names)
df_list=df['JsonData'].tolist()
bulk_insert_data = [(item,1)
for item in df_list
]
client.execute("INSERT INTO schema.table (JsonData,batchseq) SETTINGS async_insert=1, wait_for_async_insert=1 VALUES", bulk_insert_data)
2
Answers
async_insert
is not for bulk inserts it’s for a lot of small insertsmost time your code just converts pandas df into list
look to
https://clickhouse.com/docs/en/integrations/python#client-insert-method
insert method
and
insert_df=true
parameterYou can use the
client.insert_df
which efficiently performs bulk insertions.An improvement in the data prepping to set the
batchseq
using vectorization.