skip to Main Content
$string = 'Audi MODEL 80 ENGINE 1.9 TDi';
list($make,$model,$engine) = preg_split('/( MODEL | ENGINE )/',$string);

Anything before "MODEL" would be considered "MAKE string".
Anything before "ENGINE" will be considered "MODEL string".
Anything after "ENGINE" is the "ENGINE string".

But we usually have more information in this string.

//  possible variations:
$string = 'Audi MODEL 80 ENGINE 1.9 TDi POWER 90Hk TORQUE 202Nm GEAR man DRIVE 2wd YEAR 1996';

$string = 'Audi MODEL 80 ENGINE 1.9 TDi YEAR 1996 NOTE this engine needs custom stage GEAR auto';    

$string = 'Audi MODEL 80 ENGINE 1.9 TDi GEAR man YEAR 1996';

$string = 'Audi MODEL 80 ENGINE 1.9 TDi YEAR 1996 DRIVE 2wd';

MODEL and ENGINE is always present, and is always the start of the string.

The rest (POWER,TORQUE,GEAR,DRIVE,YEAR,NOTE) may vary, both in sorting order, and if they’re even there or not.

Since we can’t know for sure how the ENGINE string ends, or which of the other keywords will be the first to come right after, I thought it would be possible to create an array with the keywords.
Then do some sort of a string search for first occurrence of a word that matches one of the keyword in the array.

I do need to keep the matched word.

Another way of putting this might be: "How to split the string on/before each occurrence of words in array"

3

Answers


  1. To keep the "bits" intact with the keyword included, you can use preg_split with a lookahead that will split on a space followed by any one of your keywords. For example:

    $string = 'Audi MODEL 80 ENGINE 1.9 TDi POWER 90Hk TORQUE 202Nm GEAR man DRIVE 2wd YEAR 1996';
    
    $bits = preg_split('~s+(?=(MODEL|ENGINE|POWER|TORQUE|GEAR|DRIVE|YEAR|NOTE)b)~', $string);
    

    Results in:

    array(8) {
        [0] · string(4) "Audi"
        [1] · string(8) "MODEL 80"
        [2] · string(14) "ENGINE 1.9 TDi"
        [3] · string(10) "POWER 90Hk"
        [4] · string(12) "TORQUE 202Nm"
        [5] · string(8) "GEAR man"
        [6] · string(9) "DRIVE 2wd"
        [7] · string(9) "YEAR 1996"
    }
    

    If you want to parse these into key/value pairs, it’s simple:

    // Initialize array; get the "unnamed" make:
    $data = [
        'MAKE' => array_shift($bits),
    ];
    
    // Iterate any other known keys found:
    foreach($bits as $bit) {
        $pair = explode(' ', $bit, 2);
        $data[$pair[0]] = $pair[1];
    }
    

    Results in:

    array(8) {
        ["MAKE"] · string(4) "Audi"
        ["MODEL"] · string(2) "80"
        ["ENGINE"] · string(7) "1.9 TDi"
        ["POWER"] · string(4) "90Hk"
        ["TORQUE"] · string(5) "202Nm"
        ["GEAR"] · string(3) "man"
        ["DRIVE"] · string(3) "2wd"
        ["YEAR"] · string(4) "1996"
    }
    
    Login or Signup to reply.
  2. If you’d prefer a non-RegEx method, you could also just break into individual tokens (words) and build an array. The code below makes some presumptions about whitespace which, if it is a problem, could be addressed with a replace possibly.

    // The first group to assign un-prefixed items to
    $firstGroup = 'MAKE';
    
    // Every possible word grouping
    $wordList = ['ENGINE', 'MODEL', 'POWER', 'TORQUE', 'GEAR', 'DRIVE', 'YEAR'];
    
    // Test string
    $string = 'Audi MODEL 80 ENGINE 1.9 TDi POWER 90Hk TORQUE 202Nm GEAR man DRIVE 2wd YEAR 1996';
    
    // Key/value of group name and values
    $groups = [];
    
    // Default to the first group
    $currentWord = $firstGroup;
    foreach (explode(' ', $string) as $word) {
    
        // Found a special word, reset and continue the hunt
        if (in_array($word, $wordList)) {
            $currentWord = $word;
            continue;
        }
    
        // Assign. The subsequent for loop could be removed by just doing string concatenation here instead
        $groups[$currentWord][] = $word;
    }
    
    // Optional, join each back into a string
    foreach ($groups as $key => $values) {
        $groups[$key] = implode(' ', $values);
    }
    
    var_dump($groups);
    

    Outputs:

    array(8) {
      ["MAKE"]=>
      string(4) "Audi"
      ["MODEL"]=>
      string(2) "80"
      ["ENGINE"]=>
      string(7) "1.9 TDi"
      ["POWER"]=>
      string(4) "90Hk"
      ["TORQUE"]=>
      string(5) "202Nm"
      ["GEAR"]=>
      string(3) "man"
      ["DRIVE"]=>
      string(3) "2wd"
      ["YEAR"]=>
      string(4) "1996"
    }
    

    Demo: https://3v4l.org/D4pvl

    Login or Signup to reply.
  3. If you’d like to have a dynamic associative array:

    1. Prepend MAKE to the string
    2. Use preg_match_all() to capture pairs of labels and values in the formatted string
    3. Use array_column() to restructure the columns of matches into an associative array.

    Code: (Demo)

    $strings = [
        'Audi MODEL 80 ENGINE 1.9 TDi POWER 90Hk TORQUE 202Nm GEAR man DRIVE 2wd YEAR 1996',
        'Audi MODEL 80 ENGINE 1.9 TDi YEAR 1996 NOTE this engine needs custom stage GEAR auto',
        'Audi MODEL 80 ENGINE 1.9 TDi GEAR man YEAR 1996',
        'Audi MODEL 80 ENGINE 1.9 TDi YEAR 1996 DRIVE 2wd'
    ];
    
    foreach ($strings as $string) {
        preg_match_all('/b([A-Z]+)s+(S+(?:s+S+)*?)(?=$|s+[A-Z]+b)/', 'MAKE ' . $string, $m, PREG_SET_ORDER);
        var_export(array_column($m, 2, 1));
        echo "n---n";
    }
    

    Output:

    array (
      'MAKE' => 'Audi',
      'MODEL' => '80',
      'ENGINE' => '1.9 TDi',
      'POWER' => '90Hk',
      'TORQUE' => '202Nm',
      'GEAR' => 'man',
      'DRIVE' => '2wd',
      'YEAR' => '1996',
    )
    ---
    array (
      'MAKE' => 'Audi',
      'MODEL' => '80',
      'ENGINE' => '1.9 TDi',
      'YEAR' => '1996',
      'NOTE' => 'this engine needs custom stage',
      'GEAR' => 'auto',
    )
    ---
    array (
      'MAKE' => 'Audi',
      'MODEL' => '80',
      'ENGINE' => '1.9 TDi',
      'GEAR' => 'man',
      'YEAR' => '1996',
    )
    ---
    array (
      'MAKE' => 'Audi',
      'MODEL' => '80',
      'ENGINE' => '1.9 TDi',
      'YEAR' => '1996',
      'DRIVE' => '2wd',
    )
    ---
    

    This is not a new concept/technique. The only adjustment to make is how to identify the keys/labels in the original string. Instead of [A-Z]+ you may wish to explicitly name each label and separate them in the pattern with pipes. See these other demonstrations:


    Alternatively, instead of using a regex to parse the string, you could manipulate the string into a standardized format that a native PHP function can parse. (Demo)

    foreach ($strings as $string) {
        var_export(
            parse_ini_string(
                preg_replace(
                    '~s*b(MAKE|MODEL|ENGINE|POWER|TORQUE|GEAR|DRIVE|YEAR|NOTE)s+~',
                    "n$1=",
                    'MAKE ' . $string
                )
            )
        );
        echo "n---n";
    }
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search