I need to extract the headings of my PDF file which start with #
symbols through PHP. I don’t know how to do it. Here is my PDF file link:
https://afxwebdesign.com/order.pdf
I have tried this script:
<?php
// Load the PDF file
$pdfFile = 'order.pdf';
// Use a PDF parsing library like TCPDF or FPDI to extract text
// Code snippet using TCPDF
require_once('tcpdf.php');
require_once('vendor/setasign/fpdi/src/autoload.php');
use setasignFpdiTcpdfFpdi;
$pdf = new Fpdi();
$pageCount = $pdf->setSourceFile($pdfFile);
for ($pageNo = 1; $pageNo <= $pageCount; $pageNo++) {
$templateId = $pdf->importPage($pageNo);
$text = $pdf->getPageContent($pageNo);
preg_match_all('/^#[^#].*$/m', $text, $headings);
foreach ($headings[0] as $heading) {
echo $heading . "n";
}
}
$pdf->close();
?>
But it’s not working – it throws this error:
Fatal error: Uncaught Error: Call to undefined method
setasignFpdiTcpdfFpdi::getPageContent() in
C:xampphtdocspdfextractindex.php:17 Stack trace: #0 {main} thrown
in C:xampphtdocspdfextractindex.php on line 17
2
Answers
I skipped the PHP to extract the text where '#' symbol is located, but I used the pdf.js javascript library and it is working absolutely fine. here is the complete javascript code. it is working 100% fine.
If you can shell the PDF utilities from xpdf or poppler
it is as simple as
or any equvalent grep command if not based on windows, and redirect to a file or other target.
Result