skip to Main Content

I’m working on a PHP script that processes large CSV files (up to 1GB in size) by reading the file, performing some transformations on each row, and then saving the results to a database. My current approach involves reading the entire file into memory using file_get_contents() and then splitting it by line, but this often results in memory exhaustion errors.

I’ve considered using fgetcsv() within a loop to process the file line by line, but I’m concerned about the performance and want to ensure that the script runs efficiently even with very large files. Additionally, I need to handle potential issues like malformed lines or different CSV formats (e.g., varying delimiters).

What are some best practices or patterns for handling large CSV files in PHP while minimizing memory usage and maintaining good performance? Are there any built-in functions, libraries, or techniques that can help optimize this process?

What I Tried:

Reading the Entire File with file_get_contents():
I initially tried using file_get_contents() to load the entire CSV file into memory and then split it into rows using explode("n", $data). This approach worked for smaller files but quickly led to memory exhaustion errors when dealing with larger files (over 100MB).

Using fgetcsv() in a Loop:
I switched to using fgetcsv() within a while loop to process the file line by line. This method significantly reduced memory usage and allowed me to handle larger files without crashing. However, I noticed a performance drop, especially when processing files with a large number of rows. Additionally, handling different CSV formats (like varying delimiters and escape characters) became cumbersome.

Adjusting PHP’s Memory Limit:
I tried increasing the PHP memory limit using ini_set(‘memory_limit’, ‘512M’);, which helped to an extent, but it felt more like a temporary fix rather than an efficient solution. I also realized that this approach wouldn’t be scalable for even larger files.

What I Expected:

Efficient Memory Usage:
I was hoping to find a method that would allow me to process very large CSV files without exhausting the server’s memory, while still maintaining reasonable processing speed.

Improved Performance:
I expected that processing the file line by line with fgetcsv() would be faster than it turned out to be, especially for files with a high number of rows.

Scalability:
I’m looking for a solution that can scale with file size, allowing the script to handle files in the GB range without requiring excessive memory or processing time.

I’m open to suggestions on how to improve the current approach or if there are alternative methods or libraries that can help me achieve better performance and memory efficiency.

2

Answers


  1. Best Approach to Handling Large CSV Files

    1 . Split the CSV File

    • Divide the 1GB CSV file into smaller, manageable files and Chunk Size: Split into files containing ~100,000 records each.

    2 . Batch Insert

    • Optimize database performance by inserting records in batches,5,000 and 10,000 record batches each batch size as per your system

    3 . Background Process

    • Ensure continuous processing without blocking the main application, background jobs or a task queue asynchronously.
    Login or Signup to reply.
  2. fgetcsv() in a loop is the superior mechanism, and will yield the most resilient and scalable process. If your code is OO, then I would also recommend implementing it as a Generator.

    That said, your performance issues are not coming from processing one line a time. Your performance issue is a result of you now doing proper CSV parsing, which fgetcsv() does and explode() does not.

    For example, on my dev box:

    Read the whole file into memory

    $file = file_get_contents('/tmp/foo.csv');
    
    Time: 0.495s
    Memory: 1007.6Mb
    

    Read the whole file into memory, explode() to lines

    $lines = explode("n", file_get_contents('/tmp/foo.csv'));
    
    Time: 0.870s
    Memory: 2292.3Mb
    

    This doubles both the time and memory requirement, not very scalable. Let’s read the file one line at a time:

    Read one line at a time

    $fp = fopen('/tmp/foo.csv', 'r');
    while ($line = fgets($fp)) {
    }
    fclose($fp);
    
    Time: 0.764s
    Memory: 0.5Mb
    

    Huge difference in memory consumption, and it’s faster than splitting lines by explode. Still, this difference is completely irrelevant and not worth your time to address.

    However, let’s now consider the additional overhead of CSV processing. You’re currently doing this (ostensibly) via explode() on a comma. This leaves you vulnerable to breakages as a result of escaping and quoting, but it’s pretty fast:

    Read the whole file, explode() to lines, explode() to fields

    foreach (explode("n", file_get_contents('/tmp/foo.csv')) as $line) {
        $row = explode(",", $line);
    }
    
    Time: 2.128s
    Memory: 2292.3Mb
    

    Time: 2.37s

    Let’s now do the same thing line-by-line, it should be faster:

    Read one line at a time, explode() to fields

    $fp = fopen('/tmp/foo.csv', 'r');
    while ($line = fgets($fp)) {
        $row = explode(",", $line);
    }
    fclose($fp);
    
    Time: 2.036s
    Memory: 0.5Mb
    

    This is fantastic, and a huge improvement over what you currently have. However, it’s not resilient to malformed input. For example, this code will break if your CSV file contains:

    foo,bar,"this is anmulti-linencomment, which also contains a comma"
    

    So let’s add some proper CSV parsing:

    Read one line at a time, innate with fgetcsv()

    $fp = fopen('/tmp/foo.csv', 'r');
    while ($row = fgetcsv($fp)) {
    }
    fclose($fp);
    
    Time: 44.390s
    Memory: 0.5Mb
    

    And there’s your performance hit. It’ll be the same if we read the file first and then use str_getcsv():

    Read one line at a time, explicit with strgetcsv()

    $fp = fopen('/tmp/foo.csv', 'r');
    while ($line = fgets($fp)) {
        $row = str_getcsv($line);
    }
    fclose($fp);
    
    Time: 44.695s
    Memory: 0.5Mb
    

    So, your slow timing is coming from the CSV parsing, and not the line-by-line read. If you are absolutely certain that your files will never contain escaped or quoted characters, then you can switch to a line-by-line read and continue to use explode() to get fields. You’ll improve speed slightly and dramatically improve memory usage. At the very least however, you should count the number of elements in each row to ensure it matches what you’re expecting.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search