skip to Main Content

I need to get the second part of the URI, the possible URI are:

/api/application/v1/method
/web/application/v1/method

I can get "application" using:

([^/api]w*)

and

([^/web]w*)

But I know is not the best approach, what would be the good way?

Thanks!

Edit: thank you all for the input, the goal was to set the second parte of the uri into a header in apache with rewrite rules

4

Answers


  1. A general regex (Perl or PCRE syntax) solution would be:

    ^/[^/]+/([^/]+)
    

    Each section is delimited with /, so just capture as many non-/ characters as there are.

    This is preferable to non-greedy regexes because it does not need to backtrack, and allows for whatever else the sections may contain, which can easily contain non-word characters such as - that won’t be matched by w.

    Login or Signup to reply.
  2. Your pattern ([^/api]w*) consists of a capturing group and a negated character class that will first match 1 time not a /, a, p or i. See demo.

    After that 0+ times a word char will be matched. The pattern could for example only match a single char which is not listed in the character class.

    What you might do is use a capturing group and match w+

    ^/(?:api|web)/(w+)/v1/method
    

    Explanation

    • ^ Start of string
    • (?:api|web) Non capturing group with alternation. Match either api or web
    • (w+) Capturing group 1, match 1+ word chars
    • /v1/method Match literally as in your example data.

    Regex demo

    Login or Signup to reply.
  3. There are so many options that we can do so, not sure which one would be best, but it could be as simple as:

    /(.+?)/(.+?)/.*
    

    which our desired output is in the second capturing group $2.

    Demo 1

    Example

    #!/usr/bin/perl -w
    
    use strict;
    use warnings;
    use feature qw( say );
    
    main();   
    
    sub main{    
       my $string = '/api/application/v1/method
    /web/application/v1/method';
       my $pattern = '/(.+?)/(.+?)/.*';
       my $match = replace($pattern, '$2', $string); 
       say $match , " is a match 💚💚💚 ";
    
    }        
    
    sub replace {
       my ($pattern, $replacement, $string) = @_;
       $string =~s/$pattern/$replacement/gee;
    
       return $string;
    }
    

    Output

    application
    application is a match 💚💚💚
    

    Advice

    zdim advises that:

    A legitimate approach, notes:

    (1) there is no need for the trailing .*

    (2) Need /|$ (not just /), in case the path finishes without / (to
    terminate the non-greedy pattern at the end of string, if there is no
    /)

    (3) note though that /ee can be vulnerable (even just to errors),
    since the second evaluation (e) will run code if the first evaluation
    results in code. And it may be difficult to ensure that that is always
    done under full control. More to the point, for this purpose there is
    no reason to run a substitution — just match and capture is enough.

    Login or Signup to reply.
  4. With all the regex, explicitly asked for, I’d like to bring up other approaches.

    These also parse only a (URI style) path, like the regex ones, and return the second directory.

    • The most basic and efficient one, just split the string on /

      my $dir = ( split ///, $path )[2];
      

      The split returns '' first (before the first /) thus we need the third element. (Note that we can use an alternate delimiter for the separator pattern, it being regex: split m{/}, $path.)

    • Use appropriate modules, for example URI

      use URI;
      my $dir = ( URI->new($path)->path_segments )[2];
      

      or Mojo::Path

      use Mojo::Path;
      my $dir = Mojo::Path->new($path)->parts->[1];
      

    What to use depends on details of what you do — if you’ve got any other work with URLs and web then you clearly want modules for that; otherwise they may (or may not) be an overkill.

    I’ve benchmarked these for a sanity check of what one is paying with modules.

    The split either beats regex by up to 10-15% (the regex using negated character class and the one based on non-greedy .+? come around the same), or is about the same with them. They are faster than Mojo by about 30%, and only URI lags seriously, by a factor of 5 behind Mojo.

    That’s for paths typical for real-life URLs, with a handful of short components. With only two very long strings (10k chars), Mojo::Path (surprisingly for me) is a factor of six ahead of split (!), which is ahead of character-class regex by more than an order of magnitude.

    The negated-character-class regex for such long strings beats the non-greedy (.+?) one by a factor of 3, good to know in its own right.

    In all this the URI and Mojo objects were created once, ahead of time.


    Benchmark code. I’d like to note that the details of these timings are far less important than the structure and quality of code.

    use warnings;
    use strict;
    use feature 'say';
    use URI;
    use Mojo::Path;
    use Benchmark qw(cmpthese);
    
    my $runfor = shift // 3;  #/    
    #my $path = '/' . 'a' x 10_000 . '/' . 'X' x 10_000;
    my $path = q(/api/app/v1/method);    
    my $uri = URI->new($path);
    my $mojo = Mojo::Path->new($path);
    
    sub neg_cc {
        my ($dir) = $path =~ m{ [^/]+ / ([^/]+) }x;      return $dir; #/
    }
    sub non_greedy {
        my ($dir) = $path =~ m{ .+? / (.+?) (?:/|$) }x;  return $dir; #/  
    }
    sub URI_path {
        my $dir = ( $uri->path_segments )[2];            return $dir;
    }
    sub Mojo_path {
        my $dir = $mojo->parts->[1];                     return $dir;
    }
    sub just_split {
        my $dir = ( split ///, $path )[2];              return $dir;
    }
    
    cmpthese( -$runfor, {
        neg_cc      => sub { neg_cc($path) },
        non_greedy  => sub { non_greedy($path) },
        just_split  => sub { just_split($path) },
        URI_path    => sub { URI_path($path) },  
        Mojo_path   => sub { Mojo_path($path) },  
    }); 
    

    With a (10-second) run this prints, on a laptop with v5.16

                    Rate   URI_path  Mojo_path non_greedy     neg_cc just_split
    URI_path    146731/s         --       -82%       -87%       -87%       -89%
    Mojo_path   834297/s       469%         --       -24%       -28%       -36%
    non_greedy 1098243/s       648%        32%         --        -5%       -16%
    neg_cc     1158137/s       689%        39%         5%         --       -11%
    just_split 1308227/s       792%        57%        19%        13%         --
    

    One should keep in mind that the overhead of the function-call is very large for such a simple job, and in spite of Benchmark‘s work these numbers are probably best taken as a cursory guide.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search