This is currently what I am doing to insert data to sqlite db from a csv file that has ~2,000,000 rows. Is there a more memory efficient way? When I try to insert data using knex.js inside the .on(‘data’ …), I get all sorts of errors in the middle of processing them.
async function insertData(filePath: string, tableName: string) {
const data: CsvRow[] = [];
await new Promise<void>((resolve, reject) => {
fs.createReadStream(filePath)
.pipe(fastcsv.parse({ headers: true }))
.on("data", (row: CsvRow) => {
data.push(row);
})
.on("end", () => {
// Insert any remaining rows in the buffer
console.log("BATCH INSERTING DATA");
db.batchInsert(tableName, data, 100)
.then(() => {
console.log("Batch Insert Complete!");
resolve();
})
.catch((error) => {
console.log("Error in batch insert: ", error);
reject(error);
});
})
.on("error", (error) => {
console.log("Error in reading csv: ", error);
reject(error);
});
});
}
I’ve tried calling insert inside .on(‘data’ …) but I am getting errors, which I think comes from concurrency issues but not sure
2
Answers
Are you applying to some yc startup and is this a take-home coding challenge? XD I found the solution! use .pause() and .resume() on your csvstream between each batchInsert. That way there aren’t multiple async inserts going off which then clog up your 1 sqlite3 connection.
You’re problem is just logical problem when you triying insert data with event ‘data’ . The process is sequential and insert data to database require more ressoure . The sequential process interrupt database insert an cause error. The solution is just to put your code inside on event ‘end’ is sure . If you like first approach just add a counter and compare to your datas.length