My Java 11/Android app has to check if the URL/IP the user entered could theoretically be valid (using myURL.trim().matches(regex)
).
What should be valid (including variations):
- stackoverflow.com
- http://stackoverflow.com
- http://amazon.co.uk
- https://www.amazon.co.uk
- ftp://10.1.1.9
- 10.1.1.9:1234
- http://10.1.1.9:1234/bla
What shouldn’t be valid:
- http://bla
- bla
- "—" (without the "", had to add them because of formatting)
- 10.9
- 10.1.1.
- 10.1..1.9
I don’t care if the user enters e.g. "999.999.999.999", I only want to know if the syntax is theoretically correct, which it would be in this case.
So far I’ve been using this regex:
String regex1 = "((http|https|ftp)://)?((\w)*|([0-9]*)|([-|_])*)+([\.|/]((\w)*|([0-9]*)|([-|_])*))+";
… but now I want to check for the port too and ^ doesn’t support that. This other version supports ports:
String regex2 = "((http|https|ftp)://)?(([^/:.[:space:]]+(.[^/:.[:space:]]+)*)|([0-9](.[0-9]{3})))(:[0-9]+)?((/[^?#[:space:]]+)([^#[:space:]]+)?(#.+)?)?";
… but it returns true
for strings like "http://bla" or "—".
I also tested this version by Microsoft:
String regex3 = "^(ht|f)tp(s?)\:\/\/[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*(:(0-9)*)*(\/?)([a-zA-Z0-9\-\.\?\,\'\/\\\+&%\$#_]*)?$";
… but it doesn’t work for some of the above examples ("http://bla" and "http://10.1.1.9:1234/bla" are valid, "10.1.1.9" and "stackoverflow.com" aren’t) and I’m not sure how to get this long PHP version to work with java, as it complains about the "#" and ")".
I’m aware there’s already a lot of regex code for validating URLs floating around on stackoverflow and I already tested a bunch of other versions from the mega-thread (and other questions) but the ones that I could get Android Studio to accept didn’t work and my regex "knowledge" isn’t even remotely big enough to modify someone else’s code, let alone write my own.
How do I accomplish this?
2
Answers
I don’t want to lose this regex in case I need it in the future so I will put it here.
The regex below separated into schema, subdomain(s), domain, (c)TLD, separate handling for IPv4 Address*, URL paths, query string, and finally anchor.
It will see the following as valid URL:
There might be many things it does not cover, but perhaps someone else can write better regex.
* The IP Address still allows 299, but it disallow 300 and above.
Fiddle: https://regex101.com/r/aRi65l/4
For reference, here is the relevant ABNF from, RFC 1738 – Uniform Resource Locators (URL).