skip to Main Content

I have a huge 23 GB CSV file I am trying to load, and I’ve found that instead of converting it to insert statements which takes a while, It’s possible to directly load to the DB.
So I tried the following syntax:

LOAD DATA LOCAL INFILE client_report.csv into table client_report fields terminated by ',' optionally enclosed by '"' lines terminated by 'rn' ignore 1 lines;

mysql> LOAD DATA LOCAL INFILE client_report.csv into table
client_report fields terminated by ‘,’ optionally enclosed by ‘"’ lines
terminated by ‘rn’ ignore 1 lines; ERROR 1064 (42000): You have an
error in your SQL syntax; check the manual that corresponds to your
MySQL server version for the right syntax to use near
‘client_report.csv into table client_report fields terminated
by ‘,’ opti’ at line 1

I am at a loss, I seem to be following documentation to the letter, and checked
sHOW GLOBAL VARIABLES LIKE 'local_infile'; its ON.

2

Answers


  1. Chosen as BEST ANSWER

    The final solution that worked best, is two steps:

    1. Split the file into chunks of 10M rows each using split -l 10000000 file.csv which will generate files like xaa xab, xac ...
    2. Using the below query (note filename in quote):

    LOAD DATA LOCAL INFILE 'xaa' into table table_name fields terminated by ',' optionally enclosed by '"' lines terminated by 'rn' ignore 1 lines (field1,field2,field3,field4,fieldN) SET ID = NULL;

    Note that if you don't explicitly set the fields and ID=null, it will try to insert first column into the autogenerated ID column, shifting all content by 1 column leaving the last one empty.

    Also note starting from file xab, you need to remove the ignore 1 lines part otherwise you'll skip the first row on each file.


  2. The only way I know how to do this is using python. This example uses sqlite, but the concept is the same.

    It will murder your ram, but with an large enough pagefile you should be fine. It will also take some time (up to a few hours).

    import pandas
    import sqlite3
    import time
    
    start_time = time.process_time()
    
    db = sqlite3.connect("-database.db")
    
    def import_csv():
       df = pandas.read_csv('csv_file.csv') 
       df.to_sql('test', con = db, if_exists = 'append', chunksize = 100000)
    
    # when 1=1 csv file is imported without only select is executed. sqlite auto creates the database and table with index column
    if 1 == 1:
       import_csv()
    
    #print(pandas.read_sql_query('select * from test limit 100000000', db))
    
    
    end_time = time.process_time()
    
    print(end_time-start_time)
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search