TL;DR:
Automatic line end handling in perl5 seemingly degrades performance. Is there a possibility to mitigate this without sacrificing convenience by having to add newlines manually instead of using say
or -l
?
Long version:
I repeatedly wondered why my own (subjective) experience of string processing speed in comparison of perl5 (feels fast) vs python so grossly disagrees with the presumably objective performance comparisons on the benchmark game web page (e.g. in the fasta example perl is slow). What I found striking is: All world seems to use "say" for ordinary output, while the strange guy here sticks with "print" and manually adding "n".
So I set up a minimal example:
- Count from 1 to an arbitrary number in the language itself.
- For every step output the number in a new line.
- Repeat the whole process three times
- Take the minimum time of these three runs while stdout is redirected to
/dev/null
The whole thing was done for three different approaches:
- Use perl’s ordinary "print" statement and append newline explicitly every single time
- alternatively call perl with "-l", which enables automatic newline handling (again with print)
- 3rd alternative: use "say" feature, which also handles this automatically
For every method I contrasted output buffering states (enabled vs disabled).
I was stunned, when I saw, that disabling the output buffer is far less detrimental for performance than enabling automatic line end handling — the latter more than doubled the run time.
Records/sec n_total Time Command
7120052.46 10000000 1.40 perl -e '$|=0; for (my $i=1; $i<=10000000;$i++) {print $i . "n";};'
4961279.63 10000000 2.02 perl -e '$|=1; for (my $i=1; $i<=10000000;$i++) {print $i . "n";};'
2821234.16 10000000 3.54 perl -l -e '$|=0; for (my $i=1; $i<=10000000;$i++) {print $i;};'
2462226.44 10000000 4.06 perl -l -e '$|=1; for (my $i=1; $i<=10000000;$i++) {print $i;};'
2898328.96 10000000 3.45 perl -e 'use feature "say"; $|=0; for (my $i=1; $i<=10000000;$i++) {say $i;};'
2452486.23 10000000 4.08 perl -e 'use feature "say"; $|=1; for (my $i=1; $i<=10000000;$i++) {say $i;};'
Question:
While for the time being it seems worthwile to add the line ends manually, this is still somewhat cumbersome: Is there any switch that I have missed here and that could prevent this degradation in output performance when using either "say" or the "-l" switch?
FWIW:
Couldn’t help myself when I was already going down the rabbit hole. So I went on…
To get a more comprehensive picture I finally continued and compared the above to other tools I either use daily (mainly gawk, since recently also a bit of jq) – or did so in the past (python and even earlier tcl) or just dabbled around with at some time (lisp flavours). All on a windows 10 machine running cygwin.
Main criteria now was to choose the number so that it runs within 10 seconds on my mychine and still leaves the run numbers somewhat comparable (only powers of 10 considered).
Also contrasted using line buffer vs flushing output after every single line for some of the other.
The complete story is: disabling the output buffer in perl is less painful than in other languages (in particular python). For some I wouldn’t even think trying it (see below).
Records/sec n_total Time Command
46772945.30 100000000 2.14 -- seq 100000000
7102358.15 10000000 1.41 -- perl -e '$|=0; for (my $i=1; $i<=10000000;$i++) {print $i . "x0A";};'
4851147.42 10000000 2.06 -- php -r 'for ($i=1; $i<=10000000;$i++) {echo $i . PHP_EOL;};'
4524956.61 10000000 2.21 -- perl -e '$|=1; for (my $i=1; $i<=10000000;$i++) {print $i . "x0A";};'
2643538.99 10000000 3.78 -- perl -l -e '$|=0; for (my $i=1; $i<=10000000;$i++) {print $i;};'
2542994.96 10000000 3.93 -- awk -e "BEGIN{for (i=1;i<=10000000;i++) print i}"
2537951.85 10000000 3.94 -- perl -e '$|=0; use feature "say"; for (my $i=1; $i<=10000000;$i++) {say $i;};'
2407226.07 10000000 4.15 -- luajit -e 'for i = 1,10000000 do print(i) end'
2347240.89 10000000 4.26 -- perl -e '$|=1; use feature "say"; for (my $i=1; $i<=10000000;$i++) {say $i;};'
2342141.92 10000000 4.27 -- perl -l -e '$|=1; for (my $i=1; $i<=10000000;$i++) {print $i;};'
1869504.49 10000000 5.35 -- python -c 'for i in range(1,round(10000000)+1): print(i)'
1641012.48 10000000 6.09 -- awk -e "BEGIN{for (i=1;i<=10000000;i++) {print i; fflush()}}"
1591311.72 10000000 6.28 -- jq -n 'range(1;10000000+1)'
1395488.79 10000000 7.17 -- lua -e 'for i = 1,10000000 do print(i) end'
905889.95 1000000 1.10 -- echo 'for {set i 1} {$i <= 1000000} {incr i} {puts $i}' | tclsh
516841.54 1000000 1.93 -- python -uc 'for i in range(1,round(1000000)+1): print(i)'
263998.59 1000000 3.79 -- clisp -x '(loop for x from 1 to 1000000 do (print x))'
139282.21 1000000 7.18 -- echo '(loop for x from 1 to 1000000 do (print x))' | sbcl
used versions: Perl 5.36.3; Python 3.9.16; PHP 7.3.7; GAWK 5.3.0; Lua 5.3.6; Luajit 2.0.4;
jq 1.7.1; clisp 2.49.92; sbcl 2.4.9; tclsh 8.6.12; seq from GNU core utils 9.0
2
Answers
TL;DR:
Simply quoting the numeric variable seems to resolve the issue with
say
and-l
!Long version:
After some comments above I carried out further tests. TLP and jhnc seem to be on to something:
When using
,
instead of.
to combine the variable value and the newline for the print statement, that slows down the print call like using "say" did in the original post.However, when the variable was quoted there was no slowdown. Likewise
say
and-l
could be sped up again be quoting the variable withsay
being the fastest now. See tests below.This behaviour seems to occur because of two different integer stringification methods within the interpreter. The slowdown might be fixed in a future perl release: https://github.com/Perl/perl5/pull/22927