(Microsoft Surface Pro 7, Windows 11 Home, Python 3.10.1, MySQL 8.0.33)
The goal:
- Load a netcdf-file with sizes up too about 400MB into the SQL-Database in under 15 minutes,
- the data in the necdf is given in the form of multiple two dimensional arrays,
- items at same position within the structure of different arrays belong together,
- the data needs to be transformed from multiple two dimensional arrays into a list of rows,
- RAM usage is as limitted as is time.
relevant Dependencies:
import mysql.connector
import os
import netCDF4 as nc
from os.path import dirname
import sys
import time
import numbers
import numpy as np
The SQL scheme was created from the following query:
USE s5p_aer;
CREATE TABLE IF NOT EXISTS `sources` (
`sID` integer NOT NULL AUTO_INCREMENT,
`fileHash` char(64) UNIQUE NOT NULL,
`fileName` varchar(150) UNIQUE NOT NULL,
`length` integer NOT NULL,
`count` integer NOT NULL,
PRIMARY KEY(`sID`)
) DEFAULT CHARSET=utf8;
CREATE TABLE IF NOT EXISTS `measurements` (
`source` integer,
`coordinate` POINT NOT NULL,
`time` timestamp NOT NULL,
`cID` bigint NOT NULL,
`probability_of_ice` FLOAT,
`probability_of_water` FLOAT,
`sea_ice_fraction` FLOAT,
`sea_surface_temperature` FLOAT,
`surface_temperature` FLOAT,
`quality_level` FLOAT,
SPATIAL INDEX `SPATIAL` (`coordinate`),
FOREIGN KEY (`source`) REFERENCES sources(`sID`),
PRIMARY KEY (`time`,`cID`)
) DEFAULT CHARSET=utf8;
The query succesfully executes resulting in the correct schema within the database.
The Insert to be performed ("insertDataQuery"):
INSERT IGNORE INTO measurements(`source`,`coordinate`,`time`,`cID`,
`probability_of_ice`,`probability_of_water`,`sea_ice_fraction`,
`sea_surface_temperature`,`surface_temperature`,`quality_level`)
VALUES((SELECT sID FROM `sources`
WHERE fileHash = %s),
point(%s,%s), from_unixtime(%s),
FLOOR(%s * POWER(10,8)+ %s * POWER(10,16)),
%s5,%s6,%s7,%s8,%s9,%s10)
ON DUPLICATE KEY UPDATE cID = cID
Performing the insert manually via the MySQL Workbench with a line of data from below succeeds.
The data to be inserted is stored in a list of tuples named "dat" and result of print(dat)
is:
[
('0af547118346a9ee9e2f955a9a0365ea33bf279323b430dbc86b71fe562dc6a5', '71.453', '173.49', '1416366180', '71.453', '173.49', '-100', '-100', '96.0', '-32768.0', '-32768.0', '0'),
('0af547118346a9ee9e2f955a9a0365ea33bf279323b430dbc86b71fe562dc6a5', '71.457', '173.44402', '1416366180', '71.457', '173.44402', '-100', '-100', '96.0', '-32768.0', '-32768.0', '0'),
('0af547118346a9ee9e2f955a9a0365ea33bf279323b430dbc86b71fe562dc6a5', '71.484', '173.12201', '1416366180', '71.484', '173.12201', '-100', '-100', '96.0', '-32768.0', '-32768.0', '0'),
('0af547118346a9ee9e2f955a9a0365ea33bf279323b430dbc86b71fe562dc6a5', '71.48801', '173.07701', '1416366180', '71.48801', '173.07701', '-100', '-100', '96.0', '-32768.0', '-32768.0', '0')
]
How the data is produced:
ds = nc.Dataset(os.path.join(dirname(__file__),file_path),'r')
time = int(ds['/time'][0]) #single value
lats = np.ravel(ds['/lat']) #ds['/lat'] is a two dimensional list of floats, hence lats is a single dimensional list of floats
dim = len(lats)
times = [time] * (dim )
lons = np.ravel(ds['/lon'])
sources = [source] * (dim)
data = [sources[:10],lats[:10],lons[:10],times[:10],lats[:10],lons[:10]]
for it in cnfg["dataScheme"]:
data.append((np.ravel(ds[it["name"]]))[:10])
ds.close()
#here the variable "dat" to be inserted is created
The origin of all evil:
db = mysql.connector.connect(host=hostname, user=username, password=password, database=databaseName)
cursor = db.cursor()
cursor.executemany(insertDataQuery, dat)
db.commit()
cursor.close()
db.close()
The big, fat, ugly error:
Traceback (most recent call last):
File "C:Python310libsite-packagesmysqlconnectorconnection_cext.py", line 611, in cmd_query
self._cmysql.query(
_mysql_connector.MySQLInterfaceError: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '5,'-100'6,'96.0'7,'-32768.0'8,'-32768.0'9,'0'10),((SELECT sID FROM `sources` WHE' at line 1
Observe that the Query seems somehow distorted.
Any suggestion what I am doing wrong here?
The above queries are already verified by executing them via the MySQL Workbench.
In a previous version I created the list of tuples without NumPys transpose and ravel by itering through the two dimensional arrays in nested loops. This worked but was insufficiently performant.
(It took longer than 2 hours for the large files)
Assuming, that NumPys data types are causing the troubles, in the above script I use the .item() to extract the standard types. Previously I had: dat = [tuple(l) for l in np.transpose(data)]
I also tried the INSERT without IGNORE, but this did not change anything.
I suspect that some unwanted side-effects may be occuring between the sql connector and numpy but do not know, how to hunt down these bugs.
2
Answers
Resolution: spurious query:
The insert-query should be:
That is, the last 6 placeholders "%s,%s,%s,%s,%s,%s" have been wrong: "%s5,%s6,%s7,%s8,%s9,%s10"
(finding such trivial things always takes the longest :-/ )
Thanks for the help, despite that!
The query looks like
I would expect it to look like
(They are perhaps equivalent.)
May I suggest
1E8
instead ofPOWER(10,8)
.