I´m trying to use RegEx to find a website using this RegEx:
const RegEx = /h[w]+[://]+w+[.]+[w.=-]+[.]+[w]{2,3}[w/=-]+\/g
Websites = document.children[0].innerHTML.match(RegEx)
console.log(Websites)
//["https://www.linuxfoundation.org/cookies\","https://de.wikipedia.org\",...]
It should match "https://de.wikipedia.org"
,because of the \
at the end, where the first escapes the other.
But when I execute this, it only logs websites with "\"
at the end.
It works onregex101.com but not in the browser.
It should find domains with at the end!
I tried to use new RegExp and [\]
but both returned only \
at the end.
3
Answers
I see it still runs OK
-> Console result
In your regular expression pattern, you’re using
\
to match a single backslash. However, when you’re inspecting the matched results, JavaScript escapes it again, so you see
\
in the output.If you want to match URLs with a single backslash
at the end but see them as
\
in the output, you can achieve this by modifying the regular expression pattern to include a single backslash\
and then examine the matched results. Here’s how you can do it:In this code, after you’ve matched the URLs, you use map to replace double backslashes
\
with single backslashes. This way, you see the URLs with a single backslash in the output.
Discussion:
As suggested, please see the screenshot as this regex format ‘https?://(?:[a-z0-9-]{1,63}.)+(?:[a-z0-9-]{1,63})+’.
Your regular expression has many issues that need to be addressed before attempting to solve your problem.
You have:
And it doesn’t make sense. For example
[:://]
. Square brackets define a set of characters that you want to match. There is no reason to have the same character more than once in the set, and the characters used do not need to be escaped. Your[://]
is equal to[:/]
.I assume you want to match
http://
orhttps://
. The regular expressionhttps?://
is used for that. The question mark makes the "s" optional.The next part I assume is used to match the domain name.
w
is a character class that matches a-z, A-Z, 0-9 and _ (underscore). Domain names cannot have underscores. Valid ASCII domain names can have a-z, A-Z, 0-9 and – (hyphens), and can be between 1 and 63 characters. The expression[a-zA-Z0-9-]{1,63}
matches it.A website usually has one or more subdomains under the top-level domain. By placing the expression in a group, it can match multiple instances:
(?:[a-zA-Z0-9-]{1,63}.)+[a-zA-Z0-9-]{1,63 }
matches one or more subdomains followed by the top-level domain.We now have
/https?://(?:[a-zA-Z0-9-]{1,63}.)+[a-zA-Z0-9-]{1,63}/
that matches the protocol (http or https) and the domain name.Now we can add your requirements. As I understand it from your examples, you want to match the string
https://de.wikipedia.org
. Then we add\
to the end of the expression (because backslashes must be escaped) resulting in/https?://(?:[a-zA-Z0-9-]{1,63}.)+[a-zA-Z0-9-]{1.63}\/
You can test it bellow:
Note! The domain match is not complete. There are rules for where the hyphen can be placed, and there are rules for how many subdomains there can be. With international domain names, almost any character from any script can be used. The example provided does not attempt to implement a full domain validation and does not support IDNs unless they are in punycode.