skip to Main Content

I’m trying to extract the values Navn, Telefon, E-postadresse and Adresse from a text file. The structure in the text file comes from a converted pdf file, thus the blank lines between the lines with text. I’m using php preg_match_all to extract the values. The loop only extracts the first value which is the Navn value. I don’t get a hit on the rest. Can anyone point me in the right direction on this?
Here is the text file content:

    Kontaktperson:

    Navn: Johan Wathne

    Telefon: 99566530

    99566530

    E-postadresse: johancqöhwheno

My PHP code looks like this where the variable $fileContent contains the text above:

if (preg_match('/Kontaktperson:s*(.*?)nn/ms', $fileContent, $matches)) {
    $kontaktpersonSection = trim($matches[1]);

    // Extract Tiltakshaver fields
    if (preg_match_all('/(Navn|Telefon|E-postadresse|Adresse):s*([^n]+)s*(?:nn|$)/', $kontaktpersonSection, $kontaktMatches, PREG_SET_ORDER)) {
        $kontaktpersonInfo = "Kontaktpersonn";
        $currentkontakt = "";
        foreach ($kontaktMatches as $kontaktMatch) {
            if ($kontaktMatch[1] === "Navn") {
                $currentkontakt = "Navn";
                $kontaktValue = $kontaktMatch[2];
                $kontaktpersonInfo .= "$currentkontakt: $kontaktValuen";
            } elseif ($kontaktMatch[1] === "Telefon") {
                $currentkontakt = "Telefon";
                $kontaktValue = explode("n", $kontaktMatch[2])[0];
                $kontaktpersonInfo .= "$currentkontakt: $kontaktValuen";
            } elseif ($kontaktMatch[1] === "E-postadresse") {
                $currentkontakt = "E-postadresse";
                $kontaktValue = $kontaktMatch[2];
                $kontaktpersonInfo .= "$currentkontakt: $kontaktValuen";
            } elseif ($kontaktMatch[1] === "Adresse") {
                $currentkontakt = "Adresse";
                $kontaktValue = $kontaktMatch[2];
                $kontaktpersonInfo .= "$currentkontakt: $kontaktValuen";
            }
        }
        $extractedInfo .= $kontaktpersonInfo;
    }
}

2

Answers


  1. As I mentioned in a comment, your first regular expression isn’t
    capturing all what you want, leading to missing fields below.

    I would change your regular expression to this:

    /
    # Opening of contact person:
    Kontaktperson:
    s*   # Ignore spaces
    (.*?) # Capture all the fields.
    # It should end with one of these:
    (?: # Non-capturing group
      $ # End of file
      | # or
      n{3} # 3 new lines
      | # or
      n(?=Kontaktperson:) # A new line followed by Kontaktperson:
    )
    /gsx
    

    See it in action with the help: https://regex101.com/r/YJMYhJ/1

    I don’t know what helps you find the end of the contact person section
    but I assume it could be:

    • The end of the file : $ (don’t use the m flag if not $ matches
      the end of a line)

    • At least 3 new lines (as you already had 2 lines between fields).

    • A new line followed by a new opening contact person section. Here I
      use a positive lookahead to avoid "eating" the Kontaktperson: text
      so that we can match the next occurences.

    I used the x flag so that you can put comments in your pattern.

    Login or Signup to reply.
  2. You don’t need to use a regex if your input is well structured.

    You can use

    $blob = explode("n", $fileContent);

    to break $fileContent into an array, then use strpos() and substr() to extract everything to the right of the colon for the appropriate lines.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search