skip to Main Content

We are trying to clean up our site map as our Magento store has created duplicate pages. I want to use a regular expression to select, or invert select, all of the pages which are linked to the top level URL.

For example, we want to find the first line-

/site/product<<

/site/category/product/

/site/category/product

Is there any way to find only two instances of a forward slash in the whole string, which are not next to each other?

Thank you for your help in advance.

I’ve tried something like this

(.*(?<!/)$)

2

Answers


  1. Chosen as BEST ANSWER

    I would like like to provide a quick answer to this problem in case it helps anyone else in the future. Our sitemap had too many duplicate URLs due to an incorrect set up on our Magento store. Instead of submitting a sitemap with 20,000+ top-level URLs we decided to manually remove the top level items ourselves.

    Not ideal at all.

    We tweaked with the site map PHP generation code to pull top-level URLs as site/category/id/###. Then we used Notepad++ to bookmark and delete these lines accordingly.


  2. Your pattern (.*(?<!/)$) matches any char except a newline until the end of the string and after that asserts that what is on the left is not a forward slash which will give you the first and the third match.

    You could match from the start of the string ^ 2 times a forward slash and then 1+ times not a forward slash or a newline [^/n]+ and then assert the end of the string $

    ^/[^/n]+/[^/n]+$
    

    Regex demo

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search