Redis - Perl: Faster way to split string by character

JOhnlw009a
November 5, 2020
141 views
2 votes
3 Answers

I am looking for a way faster solution to split a string into an array of values. The string has unfortunately ; as a delimited character. Split seems to be quite slow when it comes for a set of > 1k strings and in-memory based solutions.

my (@array) = split(';', $string);

Is there any way to speed up Perl here (sort of workaround using unpack etc?)

Update:
Approx. 1500 @members take 0.6 – 0.8 seconds. (measured within the foreach) from Having some dummy invalid $ref (without split), it’s like real time. Maybe the $ref/$redis fills the time up? (using RedisDB)

Some code:

my $ref = $redis->execute("MGET", @members);    

foreach my $i (@members) { $counter++;

  my @result = split(';', $ref->[$counter]);    
  
  # approx. 30 comparisons/operations like:
  # if($result[0] == 1 && $result[7] == 1) {...}
}

Answers

- TimurShtatland
- November 5, 2020 at 3:54 pm
- 0 votes
0
If you do not need all the fields, use LIMIT, as in: split /PATTERN/,EXPR,LIMIT. For example, this splits into 2 fields instead of as many as there are (I also removed the extra parens):
```
my @array = split ';', $string, 2;
```
Related to the above: According to perldoc -f split, one of the way to make it faster is to split into only as many fields as needed (and avoid splitting into an array without a LIMIT):
```
        In time-critical applications, it is worthwhile to avoid
        splitting into more fields than necessary. Thus, when assigning
        to a list, if LIMIT is omitted (or zero), then LIMIT is treated
        as though it were one larger than the number of variables in the
        list; for the following, LIMIT is implicitly 3:

            my ($login, $passwd) = split(/:/);
```
Login or Signup to reply.

You could try implement it in XS, for example:

#define PERL_NO_GET_CONTEXT
#include "EXTERN.h"
#include "perl.h"
#include "XSUB.h"
#include "ppport.h"
#define MOD_NAME "String::Split::Fast"

MODULE = String::Split::Fast  PACKAGE = String::Split::Fast
PROTOTYPES: DISABLE

AV *
split(str, sep_chr)
    SV *str
    SV *sep_chr
  CODE:
    if (DO_UTF8(str) || DO_UTF8(sep_chr)) {
         croak(MOD_NAME ": UTF-8 not implemented yet..");
    }
    // ...or if you use typemap: AV*   T_AVREF_REFCOUNT_FIXED
    //    You can avoid the call to sv_2mortal(), see: perldoc perlxs
    AV *array = (AV *)sv_2mortal((SV *)newAV());
    STRLEN seplen;
    const char *sep_chrs = SvPV(sep_chr, seplen);
    if (seplen != 1) {
        croak(MOD_NAME ": length of separator != 1");
    }
    const char sep = sep_chrs[0];
    STRLEN len;
    const char *buf = SvPV(str, len);
    const char *start = buf;
    for (int i=0; i<len; i++) {
        if (buf[i] == sep) {
            SV *sv = newSVpvn(start, (STRLEN)(buf+i-start));
            av_push(array, sv);
            start = buf+i+1;
        }
        if (i == (len-1)) {
            SV *sv = newSVpvn(start, (STRLEN)(buf+i+1-start));
            av_push(array, sv);
        }
    }
    RETVAL = array;
  OUTPUT:
    RETVAL

I am not sure how much faster this will be yet..

Using another delimiter is unlikely to help. Below is a benchmark that compares semicolon, blank, and the NULL character as delimiters. The speed is the same within the error of the benchmarking, regardless of the field delimiter used. Your speedup may have to come from the code used outside of the split.

Benchmark:

#!/usr/bin/env perl

use strict;
use warnings;
use Benchmark;

my $num_fields = 90;
my @in_arr = map { rand } 1..$num_fields;

our $str_semicolon = join ';',  @in_arr;
our $str_blank     = join ' ',  @in_arr;
our $str_null      = join "", @in_arr;

timethese(1e6, {
    semicolon   => q{ my @out_arr = split ';',  $str_semicolon; },
    blank       => q{ my @out_arr = split ' ',  $str_blank;     },
    null        => q{ my @out_arr = split "", $str_null;     },
} );

Results:

Benchmark: timing 1000000 iterations of blank, null, semicolon...
     blank: 18 wallclock secs (17.60 usr +  0.03 sys = 17.63 CPU) @ 56721.50/s (n=1000000)
      null: 17 wallclock secs (17.23 usr +  0.02 sys = 17.25 CPU) @ 57971.01/s (n=1000000)
 semicolon: 17 wallclock secs (16.73 usr +  0.02 sys = 16.75 CPU) @ 59701.49/s (n=1000000)

I ran this using MacBook Pro, macOS 10.14.6 and perl v5.30.3.

Please signup or login to give your own answer.

Click here to cancel reply.

Redis – Perl: Faster way to split string by character

Answers