skip to Main Content

Before you close this question as a duplicate or some other reason, please consider that the emphasis here is in part on defining (how) a character class for validating the allowed IDN character set. This question is not intended to be opinion based. I’m looking for expert advice and guidance (your expert knowledge with any supporting references), among other things, but the scope of the question is well defined and specific.

For the above, my current references are Unicode IDN FAQ, last Q&A, and https://www.unicode.org/reports/tr36/idn-chars.html.

While I belive the first reference above is a good solid starting point, there seem to be a few errors in the specified pattern (potential errata), that I’m finding it difficult to interpret that. Also, there is the consideration that the specified derivation may be getting outdated. There seems to be ongoing interest in explicitly defining a whitelist based derivation, rather than one based on set intersection and subtraction.

However, for my purposes, I will be satisfied with an updated error-free version of the derivation specification as specified in the IDN FAQ.

There doesn’t seem to be a regular expression based solution developed yet that can even validate an IDN in a lax manner. The problem seems to be deemed too difficult to solve using regexps. Even if a fully compliant regexp seems too difficult, I’m trying to at least develop a set of three regexps, for classic DNs, ACE IDNs, and non-ACE IDNs.

In this question, I’m mainly looking for assistance and clarifications of specifications and references to help with my own attempts at developing those regexps. Any pointers in this regard are welcome.

The next issue with existing regexps, is that some of many of those available on SO and elsewhere seem to be incomplete, and not very robust, and in some cases not very well defined or performant. Notably, as optimization considerations, I’m interested in reducing backtracking where possible, and sub-expression and character/range ordering within character classes, based on frequency of encounter. There may be other considerations. Notably, one of the correctness considerations is that many of the regexps in the wild don’t do correct length validation of the whole and the parts. Another issue is that as I believe the TLD is supposed to start with a letter only, but many solutions in the wild also allow a digit as the first. There are multiple such issues with existing solutions. I’m also unsure of the use of consecutive dashes except for IDN prefix xn--. I think in the past such consecutive dashes was disallowed, but current DNS implementations seem to allow such.

I will proceed to post my current solutions as an answer to this question. This question and my own paired answer are in part still to be considered as work-in-progress (still under development, and still subject to potential revisions).

My interpretation of the derivation specified in the IDN FAQ is as follows for a RegExp with the v flag.

IDNA 2003/2008

[[P{Changes_When_NFKC_Casefolded}--p{c}--p{z}--p{s}--p{p}--p{n1}--p{no}--p{me}--p{HST=L}--p{HST=V}--p{block=Combining_Diacritical_Marks_For_Symbols}--p{block=Musical_Symbols}--p{block=Ancient_Greek_Musical_Notation}--[u0640u07FAu302Eu302Fu3031-u3035u303B]]++[u00B7u0375u05F3u05F4u30FB]++[u002Du06FDu06FEu0F0Bu3007]++[u00DFu03C2]++p{JoinControl}]

I hope you can correct me if I have messed up the above interpretation (it has not yet been tested, as it will take a significant amount of time to setup a proper test environment, cases, and samples), and I hope you can point me to any updated and better (more corrrect) derivation, if one exists.

2

Answers


  1. Chosen as BEST ANSWER

    This is a partial answer, in the scope of the original question. This is still work-in-progress. Any potential errors will be rectified in due course, if an when found.

    Intranet

    ^(?=.{1,254}$)[a-z0-9A-Z](?:[a-z0-9A-Z-]{0,61}[a-z0-9A-Z]|)(?:.[a-zA-Z]([a-z0-9A-Z-]{0,61}[a-z0-9A-Z]|))*$
    

    Internet

    ^(?=.{4,254}$)(?:[a-z0-9A-Z](?:[a-z0-9A-Z-]{0,61}[a-z0-9A-Z]|).)+[a-zA-Z][a-z0-9A-Z-]{0,61}[a-z0-9A-Z]$
    

    Intranet - Lower Case or for i flag

    ^(?=.{1,254}$)[a-z0-9](?:[a-z0-9-]{0,61}[a-z0-9]|)(?:.[a-z]([a-z0-9-]{0,61}[a-z0-9]|))*$
    

    Internet - Lower Case or for i flag

    ^(?=.{4,254}$)(?:[a-z0-9](?:[a-z0-9-]{0,61}[a-z0-9]|).)+[a-z][a-z0-9-]{0,61}[a-z0-9]$
    

    Internet + ASCII Compatible Encoding (ACE), Lower Case or for i flag

    ^(?=.{4,254}$)((xn--(?=59)([a-z]+-)?[a-z0-9]+|[a-z0-9]([a-z0-9-]{0,61}[a-z0-9]|)).)+(xn--(?=59)([a-z]+-)?[a-z0-9]+|[a-z][a-z0-9-]{0,61}[a-z0-9])$
    

    Note to @Keith: The last ACE version is after being processed to punycode. The first 4 expressions are for non-IDN (classic) domains. However, I'm also currently in the process and attempt at developing a pre-punycode pre-validation (partial validation/lax validation). That would highly depend on the validation of the correctness of the interpretation of the character derivation as in the original question.


  2. Doing this with regular expressions will be very, very, very hard, so hard that I suggest looking at too-lax or too-strict regexes instead, at least if they’re shorter than 150 characters or so.

    I’ll answer about the Real Thing.

    The rules for which characters are permissible vary. In the top-level domains (.com/.السعودية/…), the set of permissible characters is defined by the RZ-LGR, the root zone label generation zone defined by ICANN based on input from something like twenty committees, each containing experts on one script.

    Each top-level domain gets to set its rules, and practically all of them choose to use a subset of the MSR, maximal starting repertoire, of the LGR. The MSR contains a little over 33,000 code points and contains those unicode codepoints which are considered safely distinguishable and used by a current community. It’s not complete, it contains the big scripts but smaller ones are still being added. I’ve spoken to someone who is working on adding regional balinese scripts to it. It does not contain things like viking runes (no current community) or smileys (not easily distinguishable).

    Something like .org will accept most of them, something like .it will accept the 30 or 40 that Italians tend to use. .fr and some others defined their own sets (that happen to be subsets of the MSR), one well-known TLD uses a superset right now (but don’t expect that to last).

    The rules for each TLD are published, but I’m too lazy tonight to look up the URLs. Anyway, you could write a regex like (([…]+.se)|([…]+.fi)|…0$, with no more 500 character classes, each containing a tiny character class matching fewer than 33.000 code points.

    All told, the regex might be a megabyte long, perhaps a little longer, and could be generated from the published specifications.

    I admit I’m curious and would like to see such a thing. And if it triggers an overflow bug in a regex engine, I would really like to see the maintainer’s response to the bug report.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search