skip to Main Content

I have been trying for over a week to replace thousands of grammar errors in a data feed file of 103,927 products.

I’m stuck on how to put a space between a lower case letter or number, period, [NEED SPACE HERE], and capital letter.

My (bad) code only replaces the first instance in the product description but I need it to replace every instance in every record.

I can’t simply replace every period with period space because the descriptions have decimal values and http addresses.

I have some experience with MYSQL programming but I am a novice on regular expression (I just don’t get them).

I have tried the following code…tested and verified REGEXP string at regex101 and notepad++:

SELECT REPLACE(`desc`, REGEXP_SUBSTR(`desc`, '[A-Za-z0-9]\.[A-Z]'), REGEXP_SUBSTR(`desc`, '[A-Za-z0-9]\.\s[A-Z]')) FROM `build_datafeed` WHERE `desc` IS NOT NULL

It only modifies the first instance of a period without a space after it. I.E. use.This is changed to ushis

Flustered but not defeated, I tried a different approach:

SELECT REPLACE(`desc`, REGEXP_SUBSTR(`desc`, '[A-Za-z0-9]\.[A-Z]'), CONCAT(REGEXP_SUBSTR(`desc`, '[A-Za-z0-9]\.'), REGEXP_SUBSTR(`desc`, '\s[A-Z]'))) FROM `build_datafeed` WHERE `desc` IS NOT NULL

Same as the last…only the first instance got modified. I.E. use.This became use. uhis…some progress but not quite what I need.

I have tried various combinations of REPLACE, SUBSTR, REGEXP_REPLACE, and REGEXP_SUBSTR. I even attempted using PHP but that really slowed things down.

Any ideas? Thanks.

Environment:
Server Host: localhost (127.0.0.1)
Server OS: Windows NT DESKTOP-4B8U62U 10.0 build 19045 (Windows 10) AMD64
Server Date: 06/07/2024 11:39:28  
Server Up Time: Unsupported
HTTP Server: Apache/2.4.54 (Win64) OpenSSL/1.1.1p PHP/8.0.23
PHP Version: 8.0.23 (Zend: 4.0.23)
PHP File Uploads: On
Upload Max Size: 512M
PHP Memory Limit: -1
POST Max Size: 1024M
Database Engine: MySQL 10.4.25-MariaDB
Database Host: localhost (127.0.0.1)
Database Date: 06/07/2024 11:39:28
Database Data Size: 361,500 kB
Database Index Size: 50,015 kB
MySQL Slow Query Log Status: Off
MySQL Mode: NO_ZERO_IN_DATE, NO_ZERO_DATE, NO_ENGINE_SUBSTITUTION

3

Answers


  1. We can use REGEXP_REPLACE() as follows:

    REGEXP_REPLACE('Hello.World', '([a-z0-9])\.([A-Z])', '$1. $2');
    -- Hello. World
    

    The regex pattern here will match:

    • ([a-z0-9]) a single lowercase letter or number (and capture in $1)
    • . literal dot
    • ([A-Z]) an uppercase letter (and capture in $2)

    We then replace with $1. $2 to introduce a space after the period.

    Demo

    Note: The REGEXP_REPLACE() function follows the case sensitivity of the underlying collation of the table/database. For case insensitive collations, the above answer would also target substrings such as hello.world. For case sensitive and binary collations, the answer as given above would work.

    If you are using a case insensitive collation, you may still force a case sensitive regex search by using the (?-i) flag:

    REGEXP_REPLACE('Hello.World', '(?-i)([a-z0-9])\.([A-Z])', '$1. $2');
    
    Login or Signup to reply.
  2. If I understand what you want correctly, you want to use REGEXP_REPLACE with capture groups. The following will translate a string like "a.B c.D 3.F" to "a. B c. D 3. F".

    SELECT REGEXP_REPLACE(`desc`, '([a-z0-9]\.)([A-Z])', '$1 $2') FROM `build_datafeed` WHERE `desc` IS NOT NULL
    
    Login or Signup to reply.
  3. I understand MariaDB supports the PCRE regex engine. One therefore may replace zero-width matches of the following regular expression with a space.

    (?<=[a-z0-9].)(?=[A-Z])
    

    Demo


    (?<=[a-z0-9].) is a positive lookbehind ((?<=...)) that asserts the match must be preceded by a period that is in turn preceded by character in the character class, namely, a lowercase letter or digit.

    (?=[A-Z]) is a positive lookahead ((?=...)) that asserts the match must be followed by a character in the character class, namely, an uppercase letter.

    If, for example, the string were "a.A", the expression would match the zero-width string between "a." and "A", with "a. A" being returned after the space is inserted.


    It may be necessary to write \. rather than ., but regex101 does not permit that when testing regular expressions with the PCRE engine.


    Alternatively, one may replace the positive lookbehind with

    [a-z0-9].K
    

    so that the expression becomes

    [a-z0-9].K(?=[A-Z])
    

    (again, perhaps the backslashes may need to be doubled). The token K causes the start of the match to be reset to the current string location and for all previously-consumed characters to be discarded.

    Demo

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search