Hi in our below code using codeql scanning got an alert that "This part of the regular expression may cause exponential backtracking on strings starting with ‘0’ and containing many repetitions of ‘0’."
const validateUrl = str => {
var pattern = new RegExp(
"^(https?:\/\/)?" + // protocol
"((([a-z\d]([a-z\d-]*[a-z\d])*)\.)+[a-z]{2,}|" + // domain name
please let me know how can we resolve this and what exactly the problem?
2
Answers
The problem is that the regular expression you’re using has a pattern that can cause exponential backtracking on certain strings. This means that for some strings, the regular expression engine may end up trying a huge number of possible paths through the pattern, resulting in slow performance and a possible denial of service if an attacker can control the input string.
The part of the expression that’s triggering this warning is likely this bit:
This pattern is trying to match a series of alphanumeric characters and dashes, but the way it’s structured can cause issues. The nested repetition (
*
inside of another*
) is what creates the risk of exponential backtracking.You can usually rewrite the pattern to avoid the nested repetition. In this case, you might rewrite the pattern to something like this:
Here, I’ve changed the pattern to use
([a-z\d]+[-])*
to match one or more alphanumeric characters followed by a dash, repeated zero or more times, and then followed by[a-z\d]+
, which matches one or more alphanumeric characters.This new pattern should still match valid domain names like the old pattern but without the risk of exponential backtracking. You may want to test it on various inputs to make sure it still behaves as intended.
As the other answer has correctly identified, the problem is
([a-zd-]*[a-zd])*
. This RegEx recognizes a superset of([a-zd]+)*
. The nesting of the two quantifiers gives a naive backtracking RegEx implementation too many options to try: Consider a long string consisting ofa
‘s (or0
‘s, or any other character in[a-zd-]
) followed by$
. The RegEx engine will first try matching a singlea
, then matching further according to the*
. If this fails, it will later try matching twoa
‘s and so on. The same will be repeated afterwards. Ultimately your RegEx engine will have tried all options of partitioning thea
‘s into substrings of length > 0, of which there are superexponentially many.Note that the enclosing
(....)+
is not an issue, since the dot serves as a delimiter.Your first option, which wouldn’t require changing the RegEx at all, would simply be to switch to a RegEx engine with guaranteed linear time complexity. This flaw is only exhibited by backtracking RegEx engines and not by RegEx engines which use DFAs (Deterministic Finite Automatons). There seem to be NPM packages which can convert RegEx to DFA.
Your second option is to rewrite the RegEx.
[a-zd]([a-zd-]*[a-zd])*
matches any sequence of alphanumerics and dashes as long as the dashes aren’t at the end or beginning; thus we can convert the outer*
quantifier into a?
quantifier:[a-zd]([a-zd-]*[a-zd])?
.Alternatively, we could also write it as
[a-zd]+(-+[a-zd]+)*
. This nests quantifiers, but the dashes serve as a delimiter, so this shouldn’t exhibit catastrophic backtracking either.You’re not showing the full RegEx, but it is most likely insufficient to validate URLs. A proper, spec-compliant RegEx to validate URLs is rather lengthy. Take a look at the syntax diagram on Wikipedia; you’re missing entire parts of it (like userinfo, port, query, fragment). Peter Seliger’s comment is spot on: