skip to Main Content

I have a large dataset containing over 10 million JSON records. Could someone please advise on the most efficient method to insert these into a string column table? Currently, I am using clickhouse_driver, but it takes approximately one minute to load 100,000 records.

current steps:

  1. read data from db (each row as json)
  2. load into dataframe
  3. use insert_dataframe or execute method to insert into clickhouse table (target will just have one column with json rows)

code:

from clickhouse_driver import Client
client = Client(host,user, password,database)

for chunk in pd.read_sql(src_sql, conn1,chunksize=300000):              
df=pd.DataFrame(chunk, columns=bq_column_names)
df_list=df['JsonData'].tolist()

bulk_insert_data = [(item,1)
                        for item in df_list
                        ]
        
client.execute("INSERT INTO schema.table (JsonData,batchseq) SETTINGS async_insert=1, wait_for_async_insert=1 VALUES", bulk_insert_data)

2

Answers


  1. async_insert is not for bulk inserts it’s for a lot of small inserts

    most time your code just converts pandas df into list

    look to
    https://clickhouse.com/docs/en/integrations/python#client-insert-method

    insert method
    and insert_df=true parameter

    Login or Signup to reply.
  2. You can use the client.insert_df which efficiently performs bulk insertions.

    An improvement in the data prepping to set the batchseq using vectorization.

    
    prepped_df = df['JsonData']
    prepped_df['batchseq'] = 1
      
    client.insert_df(
        table='table',
        df=prepped_df, 
        database='schema',
        column_names=['JsonData','batchseq'])
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search