skip to Main Content

I am trying to figure out the regex expression and using PHP to fix a csv file. The format of the csv file should be like this:

id, text field1, text field2
1,some text,another text
2,some text,another text
3,some text,another text

Problem is, sometimes, the "text field1" has a new line character in it, and the content of the csv files becomes like this:

id, text field1, text field2
1,some text,another text
2,some 
text,another text
3,some text,another text

As a result, when reading the file line by line, or exploding using new line to fill an array line by line, I get invalid records. I fixed the above problem using this code:

<?php
$c= file_get_contents($myFile);
$c= preg_replace( '/n([^0-9])/is', "n~~$1", $c );
$c= str_replace( "n~~", " ", $c );
?>

The above regex checks if character after a newline is not numeric, then it adds a ~~ sign there, and I replace the newline and ~~ sign with a white space.

Problem is, sometimes, the "text field1" has numbers in it, and newline appears just before them, like this:

id, text field1, text field2
1,some text,another text
2,some 
1999-06-21 text,another text
3,some text,another text

My regex is not putting a ~~ sign in this case and I get a broken csv file.

How can I modify this above regex to fix this problem? I was hoping for a regex where it can check for "non numeric value (any length) followed by comma sign". Please note that I got over 100000 records in csv file, so if using regex method, then regex must check for non-numeric value of any length.

Something like this:

$c= preg_replace( ‘/n([^0-9]*,)/is’, "n~~$1", $c);

But this above regex does not work, and I don’t know that much regex to get this working.

If there is any other way of fixing this csv file, even that is ok. It need not be a regex solution.

Thanks

2

Answers


  1. You can use negative lookahead for this occasion: n(?!d+,) (with replacement , to skip to substitutions).

    It matches any newline, that is not immediately followed by number,.

    In result your code would look like this:

    <?php
    $c= file_get_contents($myFile);
    $c= preg_replace( '/n(?!d+,)/g', " ", $c );
    ?>
    

    Demo here.

    It is important to notice, that this solution is still susceptible to error, in case if newline in your cell is immediately followed by number and comma, but to correct this problem in depth, will need to know exact expected format of line and check it, instead of just newline.

    Login or Signup to reply.
  2. I wouldn’t read line by line, as it is totally right that a string can have
    some new lines in it. But what’s missing in this CSV file is some double
    quotes around the strings
    . What happens if text field1 or text field2
    has a comma inside it? It will break your CSV even more.

    As you cannot correct the CSV source to output correctly, we could split the
    content based on the commas (if we consider you haven’t got some in the text).
    As you know that we only have 3 fields, we know that it should ideally match
    with something like this:

    (d+),([^,]*),([^,]*?)(?:r?n|$)
    

    You can test it here: https://regex101.com/r/YLnNOY/2

    In PHP:

    <?php
    
    $regex = '/(d+),([^,]*),([^,]*?)(?:r?n|$)/';
    
    $wrong_csv = 'id, text field1, text field2
    1,some text,another text
    2,some text,another text
    3,some text,another text
    1,some text,another text
    2,some 
    text,another text
    3,some text,another text
    1,some text,another text
    2,some 
    1999-06-21 text,another text
    3,some text,another text';
    
    $replacement = '$1, "$2", "$3"' . "n";
    
    $corrected_csv = preg_replace(
        $regex,
        $replacement,
        $wrong_csv
    );
    
    print $corrected_csv;
    

    You can run the PHP demo here: https://onlinephp.io/c/8aa48

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search