I have a .csv file that is about 5mb (~45,000 rows). What I need to do is run through each row of the file and check if the ID in each line is already in a table in my database. If it is, I can delete that row from the file.
I did a good amount of research on the most memory efficient way to do this, so I’ve been using a method of writing lines that don’t need to get deleted to a temporary file and then renaming that file as the original. Code below:
$file= fopen($filename, 'r');
$temp = fopen($tempFilename, 'w');
while(($row = fgetcsv($file)) != FALSE){
// id is the 7th value in the row
$id = $row[6];
// check table to see if id exists
$sql = "SELECT id FROM table WHERE id = $id";
$result = mysqli_query($conn, $sql);
// if id is in the database, skip to next row
if(mysqli_num_rows($result) > 0){
continue;
}
// else write line to temp file
fputcsv($temp, $row);
}
fclose($file);
fclose($temp);
// overwrite original file
rename($tempFilename, $filename);
Problem is, I’m running into a timeout while executing this bit of code. Anything I can do to make the code more efficient?
2
Answers
LOAD DATA INFILE
: https://dev.mysql.com/doc/refman/8.0/en/load-data.htmlYour database user needs to have
FILE
priveleges on the database to use.to read the csv file into a separate table.
Other option is use your loop to insert your csv file into a seperate table, and then proceed with step 2.
Update: I use
LOAD DATA INFILE
with csv files up to 2 million rows (at the moment) and do some bulk data manipulation with big queries, it’s blazingly fast and I would recommend this route for files containing > 100k lines.You fire a database query per line, aka 45.000 queries… that takes too much time.
Better you do a query before the loop and read the existing id into a lookup array, then only check this array in the loop.
Pseudo code:
edit:
Assume memory isn’t sufficient to hold 1 million integer from the database. How can it still be done efficiently?
Collect ids from CSV into an array. Write a single query to find all those ids in the database and collect (it can be maximal so many as in the CSV). Now
array_diff()
the ids from file with the ids from database – those ids remaining exist in CSV but not in database.Pseudo code: