Been struggling with this one. Long story short I have a problem where we need to read a file at different times throughout the month provided by the client. The file can be a few hundred lines, or over a million lines. I’ve searched and nothing quite hits the stored procedure call issue (I’ve seen using executemany, etc)
For each line, I need to read the line contents (delimited), and write the contents to a database. I currently have ALL of it working. The issue is timing. I took this out of an interface engine because it took 1 hr to read roughly 5000 lines. I created this application in python with the hopes to significantly speed up this process.
My ask is: what would be the best way to "multi-thread" this to increase the speed? Basically I need to read the file as fast as possible and get the data inserted into the DB as fast as possible. The bolded section below is the part I’d like to somehow multi-thread and have execute multiple at a time.
def ReadMetaDataFile (srcFile, fileDelimiter, clientID, mysqlCnx):
functionResponse = ''
path, filename = os.path.split(srcFile)
try:
with open(srcFile, 'r') as delimitedFile:
fileReader = csv.DictReader(delimitedFile, delimiter=fileDelimiter)
#next(fileReader, None)
if(mysqlCnx and mysqlCnx.is_connected()):
cursor = mysqlCnx.cursor()
for row in fileReader:
**rowArgs = (clientID, row['FILENAME'], '', row['DATA_FORMAT'], 1)
cursor.callproc('sp_InsertMetaData', rowArgs)
mysqlCnx.commit()**
cursor.close()
mysqlCnx.close()
functionResponse = "MetaData File Loaded Successfully"
except mysql.connector.Error as err:
functionResponse = "Error processing MySQL SPROC call: " + str(err)
except (mysql.connector.Error, IOError) as err:
functionResponse = "Error connecting to MySQL in function: " + str(err)
except Exception as err:
functionResponse = "Exception Found in function: " + str(err)
finally:
mysqlCnx.close()
return functionResponse
–UPDATE
I’ve tried several different things and at this point I’m just setting up the sproc calls(not actually executing) and it’s taking over 20 mins to read 20k rows from a file. Having millions of rows in some files this is simply too long. Any suggestions please.
def ReadMetaDataFile (srcFile, fileDelimiter, clientID, engine):
functionResponse = ''
path, filename = os.path.split(srcFile)
try:
query = 'CALL sp_InsertMetaData (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'
with open(srcFile, 'r') as delimitedFile:
fileReader = csv.DictReader(delimitedFile, delimiter=fileDelimiter)
conn = engine.raw_connection()
cursor = conn.cursor()
for row in fileReader:
rowArgs = (clientID, row['FILENAME'], '', row['DATA_FORMAT'], row['SOURCE_SYSTEM_NAME'], row['CONFIDENTIALITY_CODE'], row['STATUS'], row['DOCUMENT_TYPE_SOURCE_SYSTEM'], row['DOCUMENT_TYPE_ID'], row['DOCUMENT_TYPE_DESCRIPTION'], row['DATE_OF_SERVICE'], row['DOCUMENT_ID'], row['SOURCE_CREATED_DATE'], row['SOURCE_LAST_MODIFIED_DATE'], row['TIMEZONE'], row['MRNSOURCE_SYSTEM'], row['PATIENT_MRN'], row['MEMBER_NBR'], row['PATIENT_LAST_NAME'], row['PATIENT_FIRST_NAME'], row['PATIENT_MIDDLE_NAME'], row['GENDER'], row['PATIENT_DATE_OF_BIRTH'], row['ENCOUNTER_SOURCE'], row['ENCOUNTER_ID'], row['ENCOUNTER_TYPE'], row['ADMIT_TIME'], row['DISCHARGE_TIME'], row['FACILITY_NAME'], row['FACILITY_SOURCE_SYSTEM'], row['PROVIDER_TYPE'], row['PROVIDER_SOURCE_SYSTEM'], row['PROVIDER_IDENTIFIER'], row['PROVIDER_LAST_NAME'], row['PROVIDER_FIRST_NAME'], row['PROVIDER_MIDDLE_NAME'], row['PROVIDER_CREDENTIAL'], row['PROVIDER_SPECIALTY'], 1)
cursor.callproc("sp_InsertMetaData", rowArgs)
#conn.commit()
cursor.close()
functionResponse = "MetaData File Loaded Successfully"
2
Answers
To reproduce your issue I’ve prepared file with 5000 lines and tried to read it line by line. It took
0.07 s
, so I’d assume that your bottleneck is inserting the data in the database. Try to use batch insert for that kind of operation. I’m pretty sure it’ll be enough for your task.Your code is slow because you are calling
.callproc()
once for each row to insert, and each.callproc()
results in two round-trips to the database. (One to set the parameter values and another to actuallyCALL
the stored procedure.)You will probably get much better performance if you upload all rows to a temporary table and then use another stored procedure to loop through the temporary table and call the existing stored procedure. After the temporary table has been populated, all subsequent processing happens on the server. If the temporary table is called
tmp_tbl
the new stored procedure would look something like this: