skip to Main Content

I am looking for a way faster solution to split a string into an array of values. The string has unfortunately ; as a delimited character. Split seems to be quite slow when it comes for a set of > 1k strings and in-memory based solutions.

my (@array) = split(';', $string);

Is there any way to speed up Perl here (sort of workaround using unpack etc?)

Update:
Approx. 1500 @members take 0.6 – 0.8 seconds. (measured within the foreach) from Having some dummy invalid $ref (without split), it’s like real time. Maybe the $ref/$redis fills the time up? (using RedisDB)

Some code:

my $ref = $redis->execute("MGET", @members);    

foreach my $i (@members) { $counter++;

  my @result = split(';', $ref->[$counter]);    
  
  # approx. 30 comparisons/operations like:
  # if($result[0] == 1 && $result[7] == 1) {...}
}

3

Answers


  1. If you do not need all the fields, use LIMIT, as in: split /PATTERN/,EXPR,LIMIT. For example, this splits into 2 fields instead of as many as there are (I also removed the extra parens):

    my @array = split ';', $string, 2;
    

    Related to the above: According to perldoc -f split, one of the way to make it faster is to split into only as many fields as needed (and avoid splitting into an array without a LIMIT):

            In time-critical applications, it is worthwhile to avoid
            splitting into more fields than necessary. Thus, when assigning
            to a list, if LIMIT is omitted (or zero), then LIMIT is treated
            as though it were one larger than the number of variables in the
            list; for the following, LIMIT is implicitly 3:
    
                my ($login, $passwd) = split(/:/);
    
    Login or Signup to reply.
  2. You could try implement it in XS, for example:

    #define PERL_NO_GET_CONTEXT
    #include "EXTERN.h"
    #include "perl.h"
    #include "XSUB.h"
    #include "ppport.h"
    #define MOD_NAME "String::Split::Fast"
    
    MODULE = String::Split::Fast  PACKAGE = String::Split::Fast
    PROTOTYPES: DISABLE
    
    AV *
    split(str, sep_chr)
        SV *str
        SV *sep_chr
      CODE:
        if (DO_UTF8(str) || DO_UTF8(sep_chr)) {
             croak(MOD_NAME ": UTF-8 not implemented yet..");
        }
        // ...or if you use typemap: AV*   T_AVREF_REFCOUNT_FIXED
        //    You can avoid the call to sv_2mortal(), see: perldoc perlxs
        AV *array = (AV *)sv_2mortal((SV *)newAV());
        STRLEN seplen;
        const char *sep_chrs = SvPV(sep_chr, seplen);
        if (seplen != 1) {
            croak(MOD_NAME ": length of separator != 1");
        }
        const char sep = sep_chrs[0];
        STRLEN len;
        const char *buf = SvPV(str, len);
        const char *start = buf;
        for (int i=0; i<len; i++) {
            if (buf[i] == sep) {
                SV *sv = newSVpvn(start, (STRLEN)(buf+i-start));
                av_push(array, sv);
                start = buf+i+1;
            }
            if (i == (len-1)) {
                SV *sv = newSVpvn(start, (STRLEN)(buf+i+1-start));
                av_push(array, sv);
            }
        }
        RETVAL = array;
      OUTPUT:
        RETVAL
    

    I am not sure how much faster this will be yet..

    Login or Signup to reply.
  3. Using another delimiter is unlikely to help. Below is a benchmark that compares semicolon, blank, and the NULL character as delimiters. The speed is the same within the error of the benchmarking, regardless of the field delimiter used. Your speedup may have to come from the code used outside of the split.

    Benchmark:

    #!/usr/bin/env perl
    
    use strict;
    use warnings;
    use Benchmark;
    
    my $num_fields = 90;
    my @in_arr = map { rand } 1..$num_fields;
    
    our $str_semicolon = join ';',  @in_arr;
    our $str_blank     = join ' ',  @in_arr;
    our $str_null      = join "", @in_arr;
    
    timethese(1e6, {
        semicolon   => q{ my @out_arr = split ';',  $str_semicolon; },
        blank       => q{ my @out_arr = split ' ',  $str_blank;     },
        null        => q{ my @out_arr = split "", $str_null;     },
    } );
    

    Results:

    Benchmark: timing 1000000 iterations of blank, null, semicolon...
         blank: 18 wallclock secs (17.60 usr +  0.03 sys = 17.63 CPU) @ 56721.50/s (n=1000000)
          null: 17 wallclock secs (17.23 usr +  0.02 sys = 17.25 CPU) @ 57971.01/s (n=1000000)
     semicolon: 17 wallclock secs (16.73 usr +  0.02 sys = 16.75 CPU) @ 59701.49/s (n=1000000)
    
    

    I ran this using MacBook Pro, macOS 10.14.6 and perl v5.30.3.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search