I have created this php script which takes a very long time to filter the CSV file into 19 smaller ones. The link for the CSV’s is within this google drive line
https://drive.google.com/drive/folders/1Bju4kkVgo21xu3IFeyKWNJ1uClTn534J?usp=share_link
I have the process running line by line to save memory but this exceeded the maxim count time within the PHP script. Is there an improved method to achieve this file breakdown?
<?php
@date_default_timezone_set("GMT");
ini_set('memory_limit', '512M');
ini_set('max_execution_time', '300');
ini_set('auto_detect_line_endings', TRUE);
ini_set('display_errors', 1);
$inputFile = '/Users/declanryan/Desktop/UNI3.0/year3/WEB/workshop/assgienment/air-quality-data-2003-2022.csv';
$outputDirectory = 'output';
// create the output directory if it doesn't exist
if (!file_exists($outputDirectory)) {
mkdir($outputDirectory);
}
$handle = fopen($inputFile, 'r');
if ($handle) {
$headerRow = null;
$siteIDs = array();
while (($data = fgets($handle)) !== false) {
$row = str_getcsv($data, ";");
if ($headerRow === null) {
$headerRow = $row;
continue; // Skip processing the header row
}
$siteID = $row[array_search('SiteID', $headerRow)];
if (!in_array($siteID, $siteIDs)) {
$siteIDs[] = $siteID;
$newCSVFilename = $outputDirectory . '/data-' . $siteID . '.csv';
$f = fopen($newCSVFilename, 'w');
fputs($f, implode(';', $headerRow) . PHP_EOL);
}
$currentCSVFilename = $outputDirectory . '/data-' . $siteID . '.csv';
$f = fopen($currentCSVFilename, 'a');
fputs($f, implode(';', $row) . PHP_EOL);
fclose($f);
}
fclose($handle);
}
echo "Done.";
echo "<br>";
echo 'Script took ' . round((microtime(true) - $_SERVER["REQUEST_TIME_FLOAT"]), 2) . ' seconds to run.';
?>
It has taken a fair enough time to even get the file processing to occur. I was going to change the format to getcsv but my lecture told me this method is actually slower?
in reply to Sammath, would something like this be more inline with what’s necessary?
@date_default_timezone_set("GMT");
ini_set('memory_limit', '512M');
ini_set('max_execution_time', '300');
ini_set('auto_detect_line_endings', TRUE);
ini_set('display_errors', 1);
$inputFile = '/Users/declanryan/Desktop/UNI3.0/year3/WEB/workshop/assgienment/air-quality-data-2003-2022.csv';
$outputDirectory = 'output';
// create the output directory if it doesn't exist
if (!file_exists($outputDirectory)) {
mkdir($outputDirectory);
}
$source = fopen($inputFile, 'r');
if (!$source) {
exit('Unable to open input file.');
}
$headerRow = fgetcsv($source, 0, ';');
if (!$headerRow) {
exit('Unable to read header row.');
}
$columnIndexes = array_flip($headerRow);
$siteIDColumn = $columnIndexes['SiteID'];
$handles = [];
while (($row = fgetcsv($source, 0, ';')) !== false) {
$siteID = $row[$siteIDColumn];
if (!isset($handles[$siteID])) {
$newCSVFilename = $outputDirectory . '/data-' . $siteID . '.csv';
$handles[$siteID] = fopen($newCSVFilename, 'w');
if (!$handles[$siteID]) {
exit('Unable to open output file for SiteID: ' . $siteID);
}
fputcsv($handles[$siteID], $headerRow, ';');
}
fputcsv($handles[$siteID], $row, ';');
}
foreach ($handles as $handle) {
fclose($handle);
}
fclose($source);
echo "Done.";
echo "<br>";
echo 'Script took ' . round((microtime(true) - $_SERVER["REQUEST_TIME_FLOAT"]), 2) . ' seconds to run.';
2
Answers
Touching the filesystem has a non-trivial amount of the slowest IO computers generally do, and PHP abstracts a lot of optimizations away. But when you repeatedly open a file, write a very small amount of data, and then close it, you’re not only making those optimizations meaningless, but you’re also doing about the worst thing you can do: constantly flushing tiny writes to disk.
For something like this you should be opening those handles once, which might look roughly like:
Additionally, there is functionally little difference between
fgetcsv()
andstr_getcsv(fgets())
, butimplode(';', $row)
is not an appropriate CSV encode method as it will not perform any string quoting/escaping, etc.Your execution times stem from three sources:
So for example:
yields (on my system):
I’ve ran some experiments with different BUFFERSIZE’s (reported memory is memory beyond that already allocated by the script).
Note the variability (I should probably have rebooted or at least run
sync
and flushed the cache between tests), and the fact that it doesn’t take much in the way of buffer to improve speed (after a certain point, efficiency will start decreasing again as PHP struggles to handle very large string concatenation). The actual size of buffer will depend on the underlying file system: if it is cache-backed, as mine is, chances are that large BUFFERSIZE’s will not change things very much.