Php - need to extract all headings which has # symbols from PDF file

KamranShah
June 5, 2024
112 views
1 vote
2 Answers

I need to extract the headings of my PDF file which start with # symbols through PHP. I don’t know how to do it. Here is my PDF file link:

https://afxwebdesign.com/order.pdf

I have tried this script:

<?php
// Load the PDF file
$pdfFile = 'order.pdf';

// Use a PDF parsing library like TCPDF or FPDI to extract text
// Code snippet using TCPDF

require_once('tcpdf.php');
require_once('vendor/setasign/fpdi/src/autoload.php');

use setasignFpdiTcpdfFpdi;
$pdf = new Fpdi();
$pageCount = $pdf->setSourceFile($pdfFile);

for ($pageNo = 1; $pageNo <= $pageCount; $pageNo++) {
    $templateId = $pdf->importPage($pageNo);
    $text = $pdf->getPageContent($pageNo);

    preg_match_all('/^#[^#].*$/m', $text, $headings);

    foreach ($headings[0] as $heading) {
        echo $heading . "n";
    }
}

$pdf->close();
?>

But it’s not working – it throws this error:

Fatal error: Uncaught Error: Call to undefined method
setasignFpdiTcpdfFpdi::getPageContent() in
C:xampphtdocspdfextractindex.php:17 Stack trace: #0 {main} thrown
in C:xampphtdocspdfextractindex.php on line 17

Answers

Chosen as BEST ANSWER

I skipped the PHP to extract the text where '#' symbol is located, but I used the pdf.js javascript library and it is working absolutely fine. here is the complete javascript code. it is working 100% fine.

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>PDF Line Extractor with Screenshot</title>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/pdf.js/2.10.377/pdf.min.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/html2canvas/1.4.1/html2canvas.min.js"></script>
</head>
<body>
    <input type="file" id="file-input" />
    <pre id="output"></pre>
    <canvas id="pdf-canvas"></canvas>
    <script>
        document.getElementById('file-input').addEventListener('change', function(event) {
            const file = event.target.files[0];
            if (file) {
                const reader = new FileReader();
                reader.onload = function(e) {
                    const typedarray = new Uint8Array(e.target.result);
                    extractLinesAndScreenshotsFromPDF(typedarray).then(lines => {
                        document.getElementById('output').textContent = lines.join('n');
                    }).catch(error => {
                        console.error('Error extracting lines and screenshots:', error);
                    });
                };
                reader.readAsArrayBuffer(file);
            }
        });

        async function extractLinesAndScreenshotsFromPDF(data) {
            const pdf = await pdfjsLib.getDocument({ data }).promise;

            let extractedLines = [];

            for (let pageNum = 1; pageNum <= pdf.numPages; pageNum++) {
                const page = await pdf.getPage(pageNum);
                const textContent = await page.getTextContent();

                // Group text items by their y-coordinate
                const groupedText = {};
                textContent.items.forEach(item => {
                    const y = Math.floor(item.transform[5]);  // Use y-coordinate for grouping
                    if (!groupedText[y]) {
                        groupedText[y] = [];
                    }
                    groupedText[y].push(item.str);
                });

                // Concatenate items to form complete lines
                const pageTextLines = Object.values(groupedText).map(items => items.join(' '));
                const filteredLines = pageTextLines.filter(line => line.includes('#'));

                if (filteredLines.length > 0) {
                    extractedLines = extractedLines.concat(filteredLines);

                    // Render the page on canvas
                    const viewport = page.getViewport({ scale: 1.5 });
                    const canvas = document.getElementById('pdf-canvas');
                    const context = canvas.getContext('2d');
                    canvas.width = viewport.width;
                    canvas.height = viewport.height;

                    await page.render({ canvasContext: context, viewport }).promise;

                    // Take screenshot and send to server
                    html2canvas(canvas).then(canvas => {
                        const imgData = canvas.toDataURL('image/png');
                        const blob = dataURLToBlob(imgData);
                        const formData = new FormData();
                        const dt= "<?= date('MdyHis') ?>";
                        const randd = Math.floor(Math.random() * 9999990);
                        formData.append('screenshot', blob, `screenshot-${pageNum+dt+randd}.png`);
                        fetch('save_screenshot.php', {
                            method: 'POST',
                            body: formData
                        }).then(response => {
                            if (!response.ok) {
                                throw new Error('Network response was not ok');
                            }
                            return response.text();
                        }).then(data => {
                            console.log('Screenshot saved:', data);
                        }).catch(error => {
                            console.error('Error saving screenshot:', error);
                        });
                    });
                }
            }

            return extractedLines;
        }

        function dataURLToBlob(dataURL) {
            const byteString = atob(dataURL.split(',')[1]);
            const mimeString = dataURL.split(',')[0].split(':')[1].split(';')[0];
            const ab = new ArrayBuffer(byteString.length);
            const ia = new Uint8Array(ab);
            for (let i = 0; i < byteString.length; i++) {
                ia[i] = byteString.charCodeAt(i);
            }
            return new Blob([ab], { type: mimeString });
        }
    </script>
</body>
</html>

(Edit)

- KJ
- June 5, 2024 at 4:52 pm
- 0 votes
0
If you can shell the PDF utilities from xpdf or poppler

it is as simple as
```
pdftotext -nopgbrk order.pdf - |find "#"  
```
or any equvalent grep command if not based on windows, and redirect to a file or other target.

Result
```
#01 Custom Made Oversized Hoodies
#02 Custom Made Boxy Fit Long Sleeves
#03 Custom Made Boxy Fit Sweatshirts
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Php – need to extract all headings which has # symbols from PDF file

Answers