skip to Main Content

I need to extract the headings of my PDF file which start with # symbols through PHP. I don’t know how to do it. Here is my PDF file link:

https://afxwebdesign.com/order.pdf

I have tried this script:

<?php
// Load the PDF file
$pdfFile = 'order.pdf';

// Use a PDF parsing library like TCPDF or FPDI to extract text
// Code snippet using TCPDF

require_once('tcpdf.php');
require_once('vendor/setasign/fpdi/src/autoload.php');

use setasignFpdiTcpdfFpdi;
$pdf = new Fpdi();
$pageCount = $pdf->setSourceFile($pdfFile);

for ($pageNo = 1; $pageNo <= $pageCount; $pageNo++) {
    $templateId = $pdf->importPage($pageNo);
    $text = $pdf->getPageContent($pageNo);

    preg_match_all('/^#[^#].*$/m', $text, $headings);

    foreach ($headings[0] as $heading) {
        echo $heading . "n";
    }
}

$pdf->close();
?>

But it’s not working – it throws this error:

Fatal error: Uncaught Error: Call to undefined method
setasignFpdiTcpdfFpdi::getPageContent() in
C:xampphtdocspdfextractindex.php:17 Stack trace: #0 {main} thrown
in C:xampphtdocspdfextractindex.php on line 17

2

Answers


  1. Chosen as BEST ANSWER

    I skipped the PHP to extract the text where '#' symbol is located, but I used the pdf.js javascript library and it is working absolutely fine. here is the complete javascript code. it is working 100% fine.

    <!DOCTYPE html>
    <html lang="en">
    <head>
        <meta charset="UTF-8">
        <meta name="viewport" content="width=device-width, initial-scale=1.0">
        <title>PDF Line Extractor with Screenshot</title>
        <script src="https://cdnjs.cloudflare.com/ajax/libs/pdf.js/2.10.377/pdf.min.js"></script>
        <script src="https://cdnjs.cloudflare.com/ajax/libs/html2canvas/1.4.1/html2canvas.min.js"></script>
    </head>
    <body>
        <input type="file" id="file-input" />
        <pre id="output"></pre>
        <canvas id="pdf-canvas"></canvas>
        <script>
            document.getElementById('file-input').addEventListener('change', function(event) {
                const file = event.target.files[0];
                if (file) {
                    const reader = new FileReader();
                    reader.onload = function(e) {
                        const typedarray = new Uint8Array(e.target.result);
                        extractLinesAndScreenshotsFromPDF(typedarray).then(lines => {
                            document.getElementById('output').textContent = lines.join('n');
                        }).catch(error => {
                            console.error('Error extracting lines and screenshots:', error);
                        });
                    };
                    reader.readAsArrayBuffer(file);
                }
            });
    
            async function extractLinesAndScreenshotsFromPDF(data) {
                const pdf = await pdfjsLib.getDocument({ data }).promise;
    
                let extractedLines = [];
    
                for (let pageNum = 1; pageNum <= pdf.numPages; pageNum++) {
                    const page = await pdf.getPage(pageNum);
                    const textContent = await page.getTextContent();
    
                    // Group text items by their y-coordinate
                    const groupedText = {};
                    textContent.items.forEach(item => {
                        const y = Math.floor(item.transform[5]);  // Use y-coordinate for grouping
                        if (!groupedText[y]) {
                            groupedText[y] = [];
                        }
                        groupedText[y].push(item.str);
                    });
    
                    // Concatenate items to form complete lines
                    const pageTextLines = Object.values(groupedText).map(items => items.join(' '));
                    const filteredLines = pageTextLines.filter(line => line.includes('#'));
    
                    if (filteredLines.length > 0) {
                        extractedLines = extractedLines.concat(filteredLines);
    
                        // Render the page on canvas
                        const viewport = page.getViewport({ scale: 1.5 });
                        const canvas = document.getElementById('pdf-canvas');
                        const context = canvas.getContext('2d');
                        canvas.width = viewport.width;
                        canvas.height = viewport.height;
    
                        await page.render({ canvasContext: context, viewport }).promise;
    
                        // Take screenshot and send to server
                        html2canvas(canvas).then(canvas => {
                            const imgData = canvas.toDataURL('image/png');
                            const blob = dataURLToBlob(imgData);
                            const formData = new FormData();
                            const dt= "<?= date('MdyHis') ?>";
                            const randd = Math.floor(Math.random() * 9999990);
                            formData.append('screenshot', blob, `screenshot-${pageNum+dt+randd}.png`);
                            fetch('save_screenshot.php', {
                                method: 'POST',
                                body: formData
                            }).then(response => {
                                if (!response.ok) {
                                    throw new Error('Network response was not ok');
                                }
                                return response.text();
                            }).then(data => {
                                console.log('Screenshot saved:', data);
                            }).catch(error => {
                                console.error('Error saving screenshot:', error);
                            });
                        });
                    }
                }
    
                return extractedLines;
            }
    
            function dataURLToBlob(dataURL) {
                const byteString = atob(dataURL.split(',')[1]);
                const mimeString = dataURL.split(',')[0].split(':')[1].split(';')[0];
                const ab = new ArrayBuffer(byteString.length);
                const ia = new Uint8Array(ab);
                for (let i = 0; i < byteString.length; i++) {
                    ia[i] = byteString.charCodeAt(i);
                }
                return new Blob([ab], { type: mimeString });
            }
        </script>
    </body>
    </html>
    

  2. If you can shell the PDF utilities from xpdf or poppler

    it is as simple as

    pdftotext -nopgbrk order.pdf - |find "#"  
    

    or any equvalent grep command if not based on windows, and redirect to a file or other target.

    Result

    #01 Custom Made Oversized Hoodies
    #02 Custom Made Boxy Fit Long Sleeves
    #03 Custom Made Boxy Fit Sweatshirts
    

    enter image description here

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search