skip to Main Content

I want to run a regexp find and replace on a wordpress database, specifically the in wp_posts table, in the post_content column. I have several affiliate URLs scattered in the text of the post_content column, and I need to replace them with new affiliate URLs. The problem is that the current URLs have multiple params that may vary depending on the URL, and they are not always in the same order. The only param that is consistent across all URLs id the store_id param.

Example of a current URL:
https://www.affialite.com/redirect.php?tt=el&store_id=this_is_the_store_id&param1=this_is_param_1&url=https://redirect_the_visitor_to_this_url.com&c_id=id_of_the_campaign

Example of how it should look after the find/replace operation:
https://newaffiliate.net/c/1234/5678/9012?campaign=id_of_the_campaign&u=https://redirect_the_visitor_to_this_url.com

The query must find the affiliate links, replace the first part of the URL with the new affiliate url, capture the value of param url and match it to the new u param, and capture the value of the c_id param and match it to the new campaign param.

I have close do zero experience with mysql, other than the regular CRUD operations, so I was using ChatGPT to help me build a query for this. The closest I got is the one below, but the resulting URL has the params in all the wrong places.

UPDATE wp_posts
SET post_content =
    REGEXP_REPLACE(
        post_content,
        'https://www.affiliate.com/redirect.php?(?:.*?&|.*?&)(store_id=this_is_the_store_id)(?:&.*?|&)(.*?)',
        CONCAT(
            'https://newaffiliate.net/c/1234/5678/9012?',
            'campaign=',
            REGEXP_SUBSTR(
                REGEXP_SUBSTR(CONCAT('&', '\2', '&'), '&c_id=([^&]+)'),
                '[^=]+$'
            ),
            '&u=',
            REGEXP_SUBSTR(
                REGEXP_SUBSTR(CONCAT('&', '\2', '&'), '&url=([^&]+)'),
                '[^=]+$'
            )
        )
    )
WHERE post_content REGEXP 'https://www.affiliate.com/click.php?(?:.*?&|.*?&)(store_id=this_is_the_store_id)(?:&.*?|&)(.*?)';

Here’s the example of a wrong resulting URL:
https://newaffiliate.com?campaing=&u=param1=this_is_param_1&url=https://redirect_the_visitor_to_this_url.com&c_id=id_of_the_campaign

Any insight on what’s wrong with the query? Any help is very much appreciated.

Thank you all.

2

Answers


  1. The task is much simpler than ChatGPT made it.

    I notice one thing ChatGPT got wrong is that the MySQL syntax for a back-reference is like $1, $2, not 1, 2.

    I tested the following in MySQL 8.0.36. I do the substitution and also include the literal intended URL, so we can compare them side-by-side.

    CREATE TABLE wp_posts ( post_content text );
    INSERT INTO wp_posts VALUES ('https://www.affiliate.com/redirect.php?tt=el&store_id=this_is_the_store_id&param1=this_is_param_1&url=https://redirect_the_visitor_to_this_url.com&c_id=id_of_the_campaign');
    
    SELECT REGEXP_REPLACE(post_content,
       'https://www.affiliate.com/redirect.php\?.*url=([^&]*).*c_id=([^&]*)',
       'https://newaffiliate.net/c/1234/5678/9012?campaign=$2&u=$1') AS new_url,
     'https://newaffiliate.net/c/1234/5678/9012?campaign=id_of_the_campaign&u=https://redirect_the_visitor_to_this_url.com' AS intended
    FROM wp_postsG
    

    Output:

    *************************** 1. row ***************************
     new_url: https://newaffiliate.net/c/1234/5678/9012?campaign=id_of_the_campaign&u=https://redirect_the_visitor_to_this_url.com
    intended: https://newaffiliate.net/c/1234/5678/9012?campaign=id_of_the_campaign&u=https://redirect_the_visitor_to_this_url.com
    

    You said the parameters in the URL are not always in the same order. The solution above only supports one order, where the url parameter is before the c_id parameter. The solution is simple to modify to handle cases where the parameters are in the other order.

    REGEXP_REPLACE(post_content,
       'https://www.affiliate.com/redirect.php\?.*c_id=([^&]*).*url=([^&]*)',
       'https://newaffiliate.net/c/1234/5678/9012?campaign=$2&u=$1')
    

    This may require running your UPDATE in two passes, but given that it allows both statements to be much simpler, it’s still a win.

    Login or Signup to reply.
  2. @BillKarwin has explained the issues with, and how to simplify, your current query very well. However, if you are running MySQL >= 8.0.4, you can modify his suggested query using capturing groups inside lookaheads, so that you can deal with the parameters in any order. For example:

    CREATE TABLE wp_posts (post_content text);
    INSERT INTO wp_posts VALUES
      ('https://www.affiliate.com/redirect.php?tt=el&store_id=this_is_the_store_id&param1=this_is_param_1&url=https://redirect_the_visitor_to_this_url.com&c_id=id_of_the_campaign'),
      ('https://www.affiliate.com/redirect.php?tt=el&store_id=this_is_the_store_id&c_id=id_of_the_campaign&param1=this_is_param_1&url=https://redirect_the_visitor_to_this_url.com')
      ('https://www.affiliate.com/redirect.php?tt=el&store_id=this_is_another_store_id&c_id=id_of_the_campaign&param1=this_is_param_1&url=https://redirect_the_visitor_to_this_url.com')
    ;
    
    UPDATE wp_posts
    SET post_content = REGEXP_REPLACE(
      post_content,
      'https://www\.affiliate\.com/redirect\.php\?(?=.*\bstore_id=this_is_the_store_id\b)(?=.*\bc_id=([^&]*))(?=.*\burl=([^&]*)).*',
      'https://newaffiliate.net/c/1234/5678/9012?campaign=$1&u=$2'
    )
    
    SELECT *
    FROM wp_posts
    

    Output:

    post_content
    https://newaffiliate.net/c/1234/5678/9012?campaign=id_of_the_campaign&u=https://redirect_the_visitor_to_this_url.com
    https://newaffiliate.net/c/1234/5678/9012?campaign=id_of_the_campaign&u=https://redirect_the_visitor_to_this_url.com
    https://www.affiliate.com/redirect.php?tt=el&store_id=this_is_another_store_id&c_id=id_of_the_campaign¶m1=this_is_param_1&url=https://redirect_the_visitor_to_this_url.com
    

    Demo on dbfiddle.uk

    Notes:

    • you don’t need a WHERE clause on the UPDATE; if the regex does not match then no replacement will occur.
    • you don’t need to escape /; but you should escape . as that can match any character
    • it’s safer to use a b in front of c_id and url so they don’t match (for example) ac_id or curl
    • to escape a value you need \ in your string due to the first being absorbed by string parsing.
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search