skip to Main Content

My Java 11/Android app has to check if the URL/IP the user entered could theoretically be valid (using myURL.trim().matches(regex)).

What should be valid (including variations):

What shouldn’t be valid:

  • http://bla
  • bla
  • "—" (without the "", had to add them because of formatting)
  • 10.9
  • 10.1.1.
  • 10.1..1.9

I don’t care if the user enters e.g. "999.999.999.999", I only want to know if the syntax is theoretically correct, which it would be in this case.

So far I’ve been using this regex:

String regex1 = "((http|https|ftp)://)?((\w)*|([0-9]*)|([-|_])*)+([\.|/]((\w)*|([0-9]*)|([-|_])*))+";

… but now I want to check for the port too and ^ doesn’t support that. This other version supports ports:

String regex2 = "((http|https|ftp)://)?(([^/:.[:space:]]+(.[^/:.[:space:]]+)*)|([0-9](.[0-9]{3})))(:[0-9]+)?((/[^?#[:space:]]+)([^#[:space:]]+)?(#.+)?)?";

… but it returns true for strings like "http://bla" or "—".

I also tested this version by Microsoft:

String regex3 = "^(ht|f)tp(s?)\:\/\/[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*(:(0-9)*)*(\/?)([a-zA-Z0-9\-\.\?\,\'\/\\\+&%\$#_]*)?$";

… but it doesn’t work for some of the above examples ("http://bla" and "http://10.1.1.9:1234/bla" are valid, "10.1.1.9" and "stackoverflow.com" aren’t) and I’m not sure how to get this long PHP version to work with java, as it complains about the "#" and ")".

I’m aware there’s already a lot of regex code for validating URLs floating around on stackoverflow and I already tested a bunch of other versions from the mega-thread (and other questions) but the ones that I could get Android Studio to accept didn’t work and my regex "knowledge" isn’t even remotely big enough to modify someone else’s code, let alone write my own.

How do I accomplish this?

2

Answers


  1. I don’t want to lose this regex in case I need it in the future so I will put it here.

    The regex below separated into schema, subdomain(s), domain, (c)TLD, separate handling for IPv4 Address*, URL paths, query string, and finally anchor.

    ^(((ht|f)tps?)://)?(((w+-*w+.)*(w+-*w+)(.[w]{2,})+)|(((1|2){0,1}[0-9]{1,2}.){3}(1|2){0,1}[0-9]{1,2}))(:[0-9]{2,5})?(/(#/)?[w0-9+/-]*(?([w0-9 ]+=([w0-9]|%[0-7][0-9a-f])+&?)*)?(#[w0-9]*)?)*$
    

    It will see the following as valid URL:

    stackoverflow.com
    http://stackoverflow.com
    http://stack-overflow.com
    http://test-1.stack-overflow.com/salsa1/#11223a
    http://salsa.com/#/about-us
    http://test.com/#/about--cats
    http://amazon.co.uk
    https://www.amazon.co.uk
    ftp://10.1.1.9
    10.1.1.9:1234/bla/bla/bla
    255.1.1.9/bla/bla/bla
    http://10.1.1.9:1234/bla/bla
    http://amazon.co.uk/?name=value
    http://amazon.co.uk/test?name=value&sasa=1
    http://amazon10.com
    

    There might be many things it does not cover, but perhaps someone else can write better regex.

    * The IP Address still allows 299, but it disallow 300 and above.

    Fiddle: https://regex101.com/r/aRi65l/4

    Login or Signup to reply.
  2. For reference, here is the relevant ABNF from, RFC 1738 – Uniform Resource Locators (URL).

    genericurl     = scheme ":" schemepart
    scheme         = 1*[ lowalpha | digit | "+" | "-" | "." ]
    schemepart     = *xchar | ip-schemepart
    ip-schemepart  = "//" login [ "/" urlpath ]
    login          = [ user [ ":" password ] "@" ] hostport
    user           = *[ uchar | ";" | "?" | "&" | "=" ]
    password       = *[ uchar | ";" | "?" | "&" | "=" ]
    hostport       = host [ ":" port ]
    host           = hostname | hostnumber
    hostname       = *[ domainlabel "." ] toplabel
    hostnumber     = digits "." digits "." digits "." digits
    domainlabel    = alphadigit | alphadigit *[ alphadigit | "-" ] 
    toplabel       = alpha | alpha *[ alphadigit | "-" ] alphadigit
    port           = digits
    urlpath        = *xchar    ; depends on protocol see section 3.1
    uchar          = unreserved | escape
    xchar          = unreserved | reserved | escape
    unreserved     = alpha | digit | safe | extra
    safe           = "$" | "-" | "_" | "." | "+"
    extra          = "!" | "*" | "'" | "(" | ")" | ","
    reserved       = ";" | "/" | "?" | ":" | "@" | "&" | "="
    escape         = "%" hex hex
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search