skip to Main Content

I have a .csv file that is about 5mb (~45,000 rows). What I need to do is run through each row of the file and check if the ID in each line is already in a table in my database. If it is, I can delete that row from the file.

I did a good amount of research on the most memory efficient way to do this, so I’ve been using a method of writing lines that don’t need to get deleted to a temporary file and then renaming that file as the original. Code below:

$file= fopen($filename, 'r');
$temp = fopen($tempFilename, 'w');

while(($row = fgetcsv($file)) != FALSE){
    // id is the 7th value in the row
    $id = $row[6];
    
    // check table to see if id exists
    $sql = "SELECT id FROM table WHERE id = $id";
    $result = mysqli_query($conn, $sql);
    
    // if id is in the database, skip to next row
    if(mysqli_num_rows($result) > 0){
        continue;
    }
    
    // else write line to temp file
    fputcsv($temp, $row);
}

fclose($file);
fclose($temp);

// overwrite original file
rename($tempFilename, $filename);

Problem is, I’m running into a timeout while executing this bit of code. Anything I can do to make the code more efficient?

2

Answers


    1. I would use LOAD DATA INFILE: https://dev.mysql.com/doc/refman/8.0/en/load-data.html
      Your database user needs to have FILE priveleges on the database to use.
      to read the csv file into a separate table.
    2. Then you can run one query to delete id’s already exist (delete from join …)
    3. And export the rows that were left intact.

    Other option is use your loop to insert your csv file into a seperate table, and then proceed with step 2.

    Update: I use LOAD DATA INFILE with csv files up to 2 million rows (at the moment) and do some bulk data manipulation with big queries, it’s blazingly fast and I would recommend this route for files containing > 100k lines.

    Login or Signup to reply.
  1. You fire a database query per line, aka 45.000 queries… that takes too much time.

    Better you do a query before the loop and read the existing id into a lookup array, then only check this array in the loop.

    Pseudo code:

    $st = query('SELECT id FROM table');
    while ($row = $st->fetch()) {
        $lookup[ $row['id'] ] = $row['id'];
    }
    
    // now read CSV
    while($row = fgetcsv($h)) {
        $id = $row[6];
    
        if (isset($lookup[ $id ])) {
            // exist...
            continue;
        }
    
        // write the non-existing id to different file...
    }
    

    edit:
    Assume memory isn’t sufficient to hold 1 million integer from the database. How can it still be done efficiently?

    Collect ids from CSV into an array. Write a single query to find all those ids in the database and collect (it can be maximal so many as in the CSV). Now array_diff() the ids from file with the ids from database – those ids remaining exist in CSV but not in database.

    Pseudo code:

    $ids_csv = [];
    while($row = fgetcsv($h)) {
        $id = row[6];
        $ids_csv[] = intval($id);
    }
    
    $sql = sprintf('SELECT id FROM table WHERE id IN(%s)', implode(',', $ids_csv));
    
    $ids_db = [];
    $st = query($sql);
    while ($row = $st->fetch()) {
        $ids_db[] = $row['id'];
    }
    
    $missing_in_db = array_diff($ids_csv, $ids_db);
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search