I have been trying for over a week to replace thousands of grammar errors in a data feed file of 103,927 products.
I’m stuck on how to put a space between a lower case letter or number, period, [NEED SPACE HERE], and capital letter.
My (bad) code only replaces the first instance in the product description but I need it to replace every instance in every record.
I can’t simply replace every period with period space because the descriptions have decimal values and http addresses.
I have some experience with MYSQL programming but I am a novice on regular expression (I just don’t get them).
I have tried the following code…tested and verified REGEXP string at regex101 and notepad++:
SELECT REPLACE(`desc`, REGEXP_SUBSTR(`desc`, '[A-Za-z0-9]\.[A-Z]'), REGEXP_SUBSTR(`desc`, '[A-Za-z0-9]\.\s[A-Z]')) FROM `build_datafeed` WHERE `desc` IS NOT NULL
It only modifies the first instance of a period without a space after it. I.E. use.This is changed to ushis
Flustered but not defeated, I tried a different approach:
SELECT REPLACE(`desc`, REGEXP_SUBSTR(`desc`, '[A-Za-z0-9]\.[A-Z]'), CONCAT(REGEXP_SUBSTR(`desc`, '[A-Za-z0-9]\.'), REGEXP_SUBSTR(`desc`, '\s[A-Z]'))) FROM `build_datafeed` WHERE `desc` IS NOT NULL
Same as the last…only the first instance got modified. I.E. use.This became use. uhis…some progress but not quite what I need.
I have tried various combinations of REPLACE, SUBSTR, REGEXP_REPLACE, and REGEXP_SUBSTR. I even attempted using PHP but that really slowed things down.
Any ideas? Thanks.
Environment:
Server Host: localhost (127.0.0.1)
Server OS: Windows NT DESKTOP-4B8U62U 10.0 build 19045 (Windows 10) AMD64
Server Date: 06/07/2024 11:39:28
Server Up Time: Unsupported
HTTP Server: Apache/2.4.54 (Win64) OpenSSL/1.1.1p PHP/8.0.23
PHP Version: 8.0.23 (Zend: 4.0.23)
PHP File Uploads: On
Upload Max Size: 512M
PHP Memory Limit: -1
POST Max Size: 1024M
Database Engine: MySQL 10.4.25-MariaDB
Database Host: localhost (127.0.0.1)
Database Date: 06/07/2024 11:39:28
Database Data Size: 361,500 kB
Database Index Size: 50,015 kB
MySQL Slow Query Log Status: Off
MySQL Mode: NO_ZERO_IN_DATE, NO_ZERO_DATE, NO_ENGINE_SUBSTITUTION
3
Answers
We can use
REGEXP_REPLACE()
as follows:The regex pattern here will match:
([a-z0-9])
a single lowercase letter or number (and capture in$1
).
literal dot([A-Z])
an uppercase letter (and capture in$2
)We then replace with
$1. $2
to introduce a space after the period.Demo
Note: The
REGEXP_REPLACE()
function follows the case sensitivity of the underlying collation of the table/database. For case insensitive collations, the above answer would also target substrings such ashello.world
. For case sensitive and binary collations, the answer as given above would work.If you are using a case insensitive collation, you may still force a case sensitive regex search by using the
(?-i)
flag:If I understand what you want correctly, you want to use
REGEXP_REPLACE
with capture groups. The following will translate a string like"a.B c.D 3.F"
to"a. B c. D 3. F"
.I understand MariaDB supports the PCRE regex engine. One therefore may replace zero-width matches of the following regular expression with a space.
Demo
(?<=[a-z0-9].)
is a positive lookbehind ((?<=...)
) that asserts the match must be preceded by a period that is in turn preceded by character in the character class, namely, a lowercase letter or digit.(?=[A-Z])
is a positive lookahead ((?=...)
) that asserts the match must be followed by a character in the character class, namely, an uppercase letter.If, for example, the string were
"a.A"
, the expression would match the zero-width string between"a."
and"A"
, with"a. A"
being returned after the space is inserted.It may be necessary to write
\.
rather than.
, but regex101 does not permit that when testing regular expressions with the PCRE engine.Alternatively, one may replace the positive lookbehind with
so that the expression becomes
(again, perhaps the backslashes may need to be doubled). The token
K
causes the start of the match to be reset to the current string location and for all previously-consumed characters to be discarded.Demo