I am trying to figure out the regex expression and using PHP to fix a csv file. The format of the csv file should be like this:
id, text field1, text field2
1,some text,another text
2,some text,another text
3,some text,another text
Problem is, sometimes, the "text field1" has a new line character in it, and the content of the csv files becomes like this:
id, text field1, text field2
1,some text,another text
2,some
text,another text
3,some text,another text
As a result, when reading the file line by line, or exploding using new line to fill an array line by line, I get invalid records. I fixed the above problem using this code:
<?php
$c= file_get_contents($myFile);
$c= preg_replace( '/n([^0-9])/is', "n~~$1", $c );
$c= str_replace( "n~~", " ", $c );
?>
The above regex checks if character after a newline is not numeric, then it adds a ~~ sign there, and I replace the newline and ~~ sign with a white space.
Problem is, sometimes, the "text field1" has numbers in it, and newline appears just before them, like this:
id, text field1, text field2
1,some text,another text
2,some
1999-06-21 text,another text
3,some text,another text
My regex is not putting a ~~ sign in this case and I get a broken csv file.
How can I modify this above regex to fix this problem? I was hoping for a regex where it can check for "non numeric value (any length) followed by comma sign". Please note that I got over 100000 records in csv file, so if using regex method, then regex must check for non-numeric value of any length.
Something like this:
$c= preg_replace( ‘/n([^0-9]*,)/is’, "n~~$1", $c);
But this above regex does not work, and I don’t know that much regex to get this working.
If there is any other way of fixing this csv file, even that is ok. It need not be a regex solution.
Thanks
2
Answers
You can use negative lookahead for this occasion:
, to skip to substitutions).
n(?!d+,)
(with replacementIt matches any newline, that is not immediately followed by
number,
.In result your code would look like this:
Demo here.
It is important to notice, that this solution is still susceptible to error, in case if newline in your cell is immediately followed by number and comma, but to correct this problem in depth, will need to know exact expected format of line and check it, instead of just newline.
I wouldn’t read line by line, as it is totally right that a string can have
some new lines in it. But what’s missing in this CSV file is some double
quotes around the strings. What happens if
text field1
ortext field2
has a comma inside it? It will break your CSV even more.
As you cannot correct the CSV source to output correctly, we could split the
content based on the commas (if we consider you haven’t got some in the text).
As you know that we only have 3 fields, we know that it should ideally match
with something like this:
You can test it here: https://regex101.com/r/YLnNOY/2
In PHP:
You can run the PHP demo here: https://onlinephp.io/c/8aa48