I am trying to split large files into individual pages, using PHP’s FPDI library.
For some reason, splitting the file does not do much to reduce the file size. For example, the following script applied to a 30 page 1MB file results in 30 files of around 0.9MB, i.e. resulting in total of around 26MB!
It suggests to me that a big portion of original file is retained, even though it is not required.
Questions:
- Is this avoidable?
- Is this a bug in FPDI?
- Is there an alternative PHP library that is more efficient at splitting?
More detail
I’ve reproduced this issue in a variety of configurations:
- FPDI version 1 (no longer supported) and FPDI version 2
- Using FPDF and TCPDF
- PHP 5.4 and PHP 5.6
- Various PDF files, including files generated using FPDF and TCPDF
Here is some PHP code to illustrate the issue:
<?php
testPdfSplit();
function testPdfSplit()
{
echo phpversion();
//Load a file
$contentPath = "/path/to/local/files/original_file.pdf";
copy("https://file-examples.com/wp-content/uploads/2017/10/file-example_PDF_1MB.pdf", $contentPath);
$numpages = 30;
//Get the original file size
$fileSize = round(filesize($contentPath) / (1024 * 1024), 3);
echo "<p>Original file is $fileSize MB</p>";
for($i=1; $i<=$numpages; $i++)
{
echo "<p>Creating file with $i pages</p>";
$filePath = "/path/to/local/files/test.$i.pdf";
try
{
selectOnePage($content, $i, $filePath);
}
catch (Exception $e)
{
die ("<pre>ERROR: $e</pre>");
}
$fileSize = round(filesize($filePath) / (1024 * 1024),3);
echo "<p>$filePath is $fileSize MB</p>";
}
}
function selectOnePage($filePathIn, $pageNo, $filePathOut)
{
require_once('fpdf/fpdf.php');
require_once('fpdi/src/autoload.php');
// initiate FPDI
$pdf = new setasignFpdiFpdi();
// get the page count
$pageCount = $pdf->setSourceFile($filePathIn);
echo "<p>Selecting page $pageNo / $pageCount</p>";
// import a page
$pdf->AddPage();
$templateId = $pdf->importPage($pageNo);
$pdf->useImportedPage($templateId);
//output the file
$pdf->Output($filePathOut, 'F');
}
2
Answers
This appears to be a general problem with most PDF tools - it is also a problem with
pdftk
andcpdf
, as described in pdftk split pdf with multiple pages.Most PDFs I have come across have a single resource dictionary, so it can't be done easily (Thanks to @Jan Slabon for the explanation).
FPDI does not analyze the used resources of an imported page and copies all referenced resources.
If a document e.g. has only a single resource dictionary (a common structure), all resources are copied.
We also offer a commercial (non-free) tool for merging and splitting PDF documents. The SetaPDF-Merger component. By default this tool has the same problem but we’d prepared a demo with some code, that removes unused resources after the split process. You can find the demo and code here.