skip to Main Content

I have a series of PDF files on my shared hosting webserver which I’m writing a PHP script for to catalogue them on the screen. I’ve added metadata to the PDF files – Document Title, Author and Subject. The filename is composed of the Author and Title so I can construct the catalogue text from that. However, I want to display the contents of the ‘Subject’ metadata field as well.

Because I’m using shared hosting, I cannot install any extra PHP extensions. They have the free version of PDFLib but this doesn’t include any functions to load the PDF file or to extract metadata.

This is the script so far which just displays a list of the filenames…

function catalogue($folder){
  $files = preg_grep('/^([^.])/', scandir($folder));
  foreach($files as $file){
    echo($file.'<br/>');
  }
}

So, I’ve not made much progress 🙁

I’ve tried PDF_open_pdi_document() but this is not part of the installed PDFLib extension. I’ve tried PDF_pcos_get_string() but all I get with…

PDF_pcos_get_string($file,0,'author');

…is…

pdf_pcos_get_string(): supplied resource is not a valid pdf object resource

…and I can find literally ZERO help on the web for this function. Literally nothing!

I am running PHP 7.4 on the shared hosting.

2

Answers


  1. Chosen as BEST ANSWER

    Thank you @drdlp. I've used file_get_contents() to load in the PDF and extract and display the metadata.

    function catalogue($folder){
      $files = preg_grep('/^([^.])/', scandir($folder));
      foreach($files as $file){
        $page = file_get_contents($file);
        $metadata = preg_match_all('//[^(]*(([^/)]*)/',$page,$matches);
        $author = $matches[1][0];
        $subject = $matches[1][4];
        $title = $matches[1][5];
        echo($title.'/'.$subject.'/'.$author.'<br>');
      }
    }
    /
    

    However, this is very slow for 40 odd PDF articles in a folder.

    How can I speed this up?

    I've begun experimenting with pdf.js for which I can load all the basic details from files first (filename etc) and then update them with Javascript after the page has loaded.

    However, I clearly don't know enough about Javascript to make this work. This is what I have so far and I am very stuck. I've imported pdf.js from mozilla.github.io/pdf.js/build/pdf.js...

    function pdf_metadata(file_url,id){
      var pdfjsLib = window['pdfjs-dist/build/pdf'];
      pdfjsLib.GlobalWorkerOptions.workerSrc = '//mozilla.github.io/pdf.js/build/pdf.worker.js';
      var loadingTask = pdfjsLib.getDocument(file_url);
      loadingTask.promise.then(function(pdf) {
        pdf.getMetadata().then(function(details) {
          console.log(details);
          document.getElementById(id).innerHTML=details;
        }).catch(function(err) {
           console.log('Error getting meta data');
           console.log(err);
           });
        });
    }
    

    The line console.log(details); outputs an object to the console. From there I have no idea how to extract any data at all. Therefore document.getElementById(id).innerHTML=details; displays nothing.

    This is the object which is output to the console.

    enter image description here


  2. Metadata aren’t encrypted like the PDF, so you can use file_get_contents, find the pattern for the subject (<</Subject) and extract it using either a regex or a simple combination of strpos/substr.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search