skip to Main Content

I’m converting a buffer of a PDF into a string to find the page count. Here is an example of the buffer string:

<<
/Type /Pages
/Count 2
/Kids[4 0 R 8 0 R]
>>

I see that page count is defined after /Type /Pages and after /Count. Here is a different example:

<<
/Type /Pages
/MediaBox [ 0 0 612 792 ]
/Count 22
>>

As you can see, the count is now defined after MediaBox in between. I am trying to come up with a regex statement that can match the page count in both scenarios but I’m not sure where to start.
Would it make sense to match for 3 words in order then extract the number? Matching Type first, then Pages, then lastly Count and extract the number after?

2

Answers


  1. //Type /Pagesb.*?^/Count (d+)/ms
    

    should do it, with capture group 1 holding the value of the count.

    Demo(https://regex101.com/r/5dk3eP/1). Hover the cursor over each element of the regular expression to obtain an explanation of its function.

    Login or Signup to reply.
  2. I would use a regex which looks for /Type and /Pages between << and >> and then captures the Count when they are both found. This will then be completely independent of the order of the tags. You can do this with a tempered greedy token which prevents matching beyond >>:

    <<(?=(?:(?!>>).)*/Type)(?=(?:(?!>>).)*/Pages)(?:(?!>>).)*/Counts+(d+)
    

    For sample data of:

    <<
    /Type /Pages
    /Count 2
    /Kids[4 0 R 8 0 R]
    >>
    <<
    /Type /Pages
    /MediaBox [ 0 0 612 792 ]
    /Count 22
    >>
    <<
    /Type /Image
    /MediaBox [ 0 0 612 792 ]
    /Count 1
    >>
    1 0 obj <</Type /Pages /Kids [3 0 R ] /Count 1 >>
    2 0 obj <<   /Count 1   /Kids [ 5 0 R ]   /Type /Pages >>
    

    Group 1 will contain [2, 22, 1, 1]

    Demo on regex101. Note you need the s regex flag to make . match newlines as well.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search