Why is PHP processing a CSV file into 19 smaller CSV's so slow?

grimx
May 17, 2023
159 views
1 vote
2 Answers

I have created this php script which takes a very long time to filter the CSV file into 19 smaller ones. The link for the CSV’s is within this google drive line

https://drive.google.com/drive/folders/1Bju4kkVgo21xu3IFeyKWNJ1uClTn534J?usp=share_link

I have the process running line by line to save memory but this exceeded the maxim count time within the PHP script. Is there an improved method to achieve this file breakdown?

<?php

@date_default_timezone_set("GMT");

ini_set('memory_limit', '512M');
ini_set('max_execution_time', '300');
ini_set('auto_detect_line_endings', TRUE);
ini_set('display_errors', 1);

$inputFile = '/Users/declanryan/Desktop/UNI3.0/year3/WEB/workshop/assgienment/air-quality-data-2003-2022.csv';
$outputDirectory = 'output';

// create the output directory if it doesn't exist
if (!file_exists($outputDirectory)) {
    mkdir($outputDirectory);
}

$handle = fopen($inputFile, 'r');

if ($handle) {
    $headerRow = null;
    $siteIDs = array();

    while (($data = fgets($handle)) !== false) {
        $row = str_getcsv($data, ";");

        if ($headerRow === null) {
            $headerRow = $row;
            continue; // Skip processing the header row
        }

        $siteID = $row[array_search('SiteID', $headerRow)];
        if (!in_array($siteID, $siteIDs)) {
            $siteIDs[] = $siteID;

            $newCSVFilename = $outputDirectory . '/data-' . $siteID . '.csv';
            $f = fopen($newCSVFilename, 'w');

            fputs($f, implode(';', $headerRow) . PHP_EOL);
        }

        $currentCSVFilename = $outputDirectory . '/data-' . $siteID . '.csv';
        $f = fopen($currentCSVFilename, 'a');

        fputs($f, implode(';', $row) . PHP_EOL);

        fclose($f);
    }

    fclose($handle);
}

echo "Done.";
echo "<br>";
echo 'Script took ' . round((microtime(true) - $_SERVER["REQUEST_TIME_FLOAT"]), 2) . ' seconds to run.';


?>

It has taken a fair enough time to even get the file processing to occur. I was going to change the format to getcsv but my lecture told me this method is actually slower?

in reply to Sammath, would something like this be more inline with what’s necessary?


@date_default_timezone_set("GMT");

ini_set('memory_limit', '512M');
ini_set('max_execution_time', '300');
ini_set('auto_detect_line_endings', TRUE);
ini_set('display_errors', 1);

$inputFile = '/Users/declanryan/Desktop/UNI3.0/year3/WEB/workshop/assgienment/air-quality-data-2003-2022.csv';
$outputDirectory = 'output';

// create the output directory if it doesn't exist
if (!file_exists($outputDirectory)) {
    mkdir($outputDirectory);
}

$source = fopen($inputFile, 'r');
if (!$source) {
    exit('Unable to open input file.');
}

$headerRow = fgetcsv($source, 0, ';');
if (!$headerRow) {
    exit('Unable to read header row.');
}

$columnIndexes = array_flip($headerRow);
$siteIDColumn = $columnIndexes['SiteID'];

$handles = [];

while (($row = fgetcsv($source, 0, ';')) !== false) {
    $siteID = $row[$siteIDColumn];
    if (!isset($handles[$siteID])) {
        $newCSVFilename = $outputDirectory . '/data-' . $siteID . '.csv';
        $handles[$siteID] = fopen($newCSVFilename, 'w');
        if (!$handles[$siteID]) {
            exit('Unable to open output file for SiteID: ' . $siteID);
        }
        fputcsv($handles[$siteID], $headerRow, ';');
    }

    fputcsv($handles[$siteID], $row, ';');
}

foreach ($handles as $handle) {
    fclose($handle);
}

fclose($source);

echo "Done.";
echo "<br>";
echo 'Script took ' . round((microtime(true) - $_SERVER["REQUEST_TIME_FLOAT"]), 2) . ' seconds to run.';

Answers

- Sammitch
- May 16, 2023 at 11:25 pm
- 0 votes
0
Touching the filesystem has a non-trivial amount of the slowest IO computers generally do, and PHP abstracts a lot of optimizations away. But when you repeatedly open a file, write a very small amount of data, and then close it, you’re not only making those optimizations meaningless, but you’re also doing about the worst thing you can do: constantly flushing tiny writes to disk.

For something like this you should be opening those handles once, which might look roughly like:
```
$source = fopen('somefile.csv', 'r');
$handles = [];

$header = fgetcsv($source, ';');

while( $row = fgetcsv($source) ) {
  $some_id = $row[X];
  if( ! key_exists($some_id, $handles) ) {
    $handles[$some_id] = fopen("foo/$some_id.csv", w);
  }
  fputcsv($handles[$some_id], $row, ';');
}

foreach($handles as $handle) {
  fclose($handle);
}
fclose($source);
```
Additionally, there is functionally little difference between fgetcsv() and str_getcsv(fgets()), but implode(';', $row) is not an appropriate CSV encode method as it will not perform any string quoting/escaping, etc.
Login or Signup to reply.

Your execution times stem from three sources:

closing and reopening the files is very costly.
writing small amounts of data (one line) at a time is not efficient.
parsing CSV is not needed in this case, because all you want to know is the siteId of each single row, and the file format is known.

So for example:

define('BUFFERSIZE', 1048576);

$buffers = [];
$handles = [];

$start  = microtime(true);
$memory = memory_get_peak_usage(true);

$fp     = fopen("air-quality-data-2003-2022.csv", "r");
fgets($fp, 10240);
while(!feof($fp)) {
    $line = fgets($fp, 10240);
    if (empty($line)) {
        break;
    }
    [ , , , , $siteId ] = explode(';', $line);
    if (isset($handles[$siteId])) {
        if (strlen($buffers[$siteId]) > BUFFERSIZE) {
            fwrite($handles[$siteId], $buffers[$siteId]);
            $buffers[$siteId] = '';
        }
    } else {
        $handles[$siteId] = fopen("air-quality-{$siteId}.csv", "w");
        $buffers[$siteId] = '';
    }
    $buffers[$siteId] .= $line;
}
fclose($fp);

foreach ($handles as $siteId => $fp) {
    fwrite($fp, $buffers[$siteId]);
    fclose($fp);
}

print "Time elapsed: " . (microtime(true) - $start) . " seconds, memory = " . (memory_get_peak_usage(true) - $memory) . " bytes n";

yields (on my system):

Time elapsed: 0.9726489448547 seconds, memory = 20971520 bytes

I’ve ran some experiments with different BUFFERSIZE’s (reported memory is memory beyond that already allocated by the script).

Buffer = 4096, time elapsed: 1.3162 seconds, memory = 0 bytes
Buffer = 32768, time elapsed: 1.0094 seconds, memory = 0 bytes
Buffer = 131072, time elapsed: 0.9834 seconds, memory = 2097152 bytes
Buffer = 262144, time elapsed: 0.9104 seconds, memory = 4194304 bytes
Buffer = 500000, time elapsed: 0.9812 seconds, memory = 10485760 bytes
Buffer = 400000, time elapsed: 0.9854 seconds, memory = 8388608 bytes
Buffer = 300000, time elapsed: 0.9675 seconds, memory = 6291456 bytes
Buffer = 262144, time elapsed: 1.0102 seconds, memory = 4194304 bytes
Buffer = 262144, time elapsed: 0.9599 seconds, memory = 4194304 bytes

Note the variability (I should probably have rebooted or at least run sync and flushed the cache between tests), and the fact that it doesn’t take much in the way of buffer to improve speed (after a certain point, efficiency will start decreasing again as PHP struggles to handle very large string concatenation). The actual size of buffer will depend on the underlying file system: if it is cache-backed, as mine is, chances are that large BUFFERSIZE’s will not change things very much.

Please signup or login to give your own answer.

Click here to cancel reply.

Why is PHP processing a CSV file into 19 smaller CSV’s so slow?

Answers