I’m trying to extract the values Navn, Telefon, E-postadresse and Adresse from a text file. The structure in the text file comes from a converted pdf file, thus the blank lines between the lines with text. I’m using php preg_match_all to extract the values. The loop only extracts the first value which is the Navn value. I don’t get a hit on the rest. Can anyone point me in the right direction on this?
Here is the text file content:
Kontaktperson:
Navn: Johan Wathne
Telefon: 99566530
99566530
E-postadresse: johancqöhwheno
My PHP code looks like this where the variable $fileContent contains the text above:
if (preg_match('/Kontaktperson:s*(.*?)nn/ms', $fileContent, $matches)) {
$kontaktpersonSection = trim($matches[1]);
// Extract Tiltakshaver fields
if (preg_match_all('/(Navn|Telefon|E-postadresse|Adresse):s*([^n]+)s*(?:nn|$)/', $kontaktpersonSection, $kontaktMatches, PREG_SET_ORDER)) {
$kontaktpersonInfo = "Kontaktpersonn";
$currentkontakt = "";
foreach ($kontaktMatches as $kontaktMatch) {
if ($kontaktMatch[1] === "Navn") {
$currentkontakt = "Navn";
$kontaktValue = $kontaktMatch[2];
$kontaktpersonInfo .= "$currentkontakt: $kontaktValuen";
} elseif ($kontaktMatch[1] === "Telefon") {
$currentkontakt = "Telefon";
$kontaktValue = explode("n", $kontaktMatch[2])[0];
$kontaktpersonInfo .= "$currentkontakt: $kontaktValuen";
} elseif ($kontaktMatch[1] === "E-postadresse") {
$currentkontakt = "E-postadresse";
$kontaktValue = $kontaktMatch[2];
$kontaktpersonInfo .= "$currentkontakt: $kontaktValuen";
} elseif ($kontaktMatch[1] === "Adresse") {
$currentkontakt = "Adresse";
$kontaktValue = $kontaktMatch[2];
$kontaktpersonInfo .= "$currentkontakt: $kontaktValuen";
}
}
$extractedInfo .= $kontaktpersonInfo;
}
}
2
Answers
As I mentioned in a comment, your first regular expression isn’t
capturing all what you want, leading to missing fields below.
I would change your regular expression to this:
See it in action with the help: https://regex101.com/r/YJMYhJ/1
I don’t know what helps you find the end of the contact person section
but I assume it could be:
The end of the file :
$
(don’t use them
flag if not$
matchesthe end of a line)
At least 3 new lines (as you already had 2 lines between fields).
A new line followed by a new opening contact person section. Here I
use a positive lookahead to avoid "eating" the
Kontaktperson:
textso that we can match the next occurences.
I used the
x
flag so that you can put comments in your pattern.You don’t need to use a regex if your input is well structured.
You can use
$blob = explode("n", $fileContent);
to break $fileContent into an array, then use
strpos()
andsubstr()
to extract everything to the right of the colon for the appropriate lines.