The basic Unix paste can be implemented in Python like (the example works with two files only; Unix paste works with multiple files):
def paste(fn1, fn2):
with open(fn1) as f1:
with open(fn2) as f2:
for l1 in f1:
l2 = f2.readline()
if l2 != None:
print(l1[:-1] + "t" + l2[:-1])
else:
print(l1[:-1])
for l2 in f2:
print("t" + l2[:-1])
import sys
if __name__ == "__main__":
if len(sys.argv) >= 3:
paste(sys.argv[1], sys.argv[2])
The task is to implement the same functionality in Node.js. Importantly, as the input file can be huge, the implementation should read the input file line by line, not reading the entire file into the memory. I want to see how to achieve this with built-in Node functionality without external packages.
Note that it is easy to implement Unix paste with synchronous I/O as is shown in the Python example, but Node doesn’t provide synchronous I/O for line reading. Meanwhile, there are ways to read one file by line with asynchronous I/O, but jointly reading two files is harder because the two streams are not synchronized.
So far the only solution I can think of is to implement synchronous line reading using the basic read API. Dave Newton pointed out in the comment that the npm n-readlines package implements this approach in 100+ lines. Because n-readlines inspects each byte to find line endings, I suspect it is inefficient and thus did a microbenchmark with results shown in the table below. For line reading (not for this task), n-readlines is 3 times as slow as a Node Readline implementation and is an order of magnitude slower than built-in line reading in Python, Perl or mawk.
What is the proper way to implement Unix paste? N-readlines is using synchronous APIs. Would a good async solution be cleaner and faster?
Language | Runtime | Version | Elapsed (s) | User (s) | Sys (s) | Code |
---|---|---|---|---|---|---|
JavaScript | node | 21.5.0 | 6.30 | 5.33 | 0.90 | lc-node.js |
node | 21.5.0 | 22.34 | 20.41 | 2.24 | lc-n-readlines.js | |
bun | 1.0.20 | 4.91 | 5.30 | 1.47 | lc-node.js | |
bun | 1.0.20 | 21.16 | 19.22 | 3.37 | lc-n-readlines.js | |
k8 | 1.0 | 1.49 | 1.06 | 0.37 | lc-k8.js | |
C | clang | 15.0.0 | 0.71 | 0.35 | 0.35 | lc-c.c |
python | python | 3.11.17 | 3.48 | 2.85 | 0.62 | lc-python.py |
perl | perl | 5.34.3 | 1.70 | 1.13 | 0.57 | lc-perl.pl |
awk | mawk | 1.3.4 | 2.08 | 1.27 | 0.80 | lc-awk.awk |
apple awk | ? | 90.06 | 87.90 | 1.12 | lc-awk.awk |
2
Answers
I am posting an answer based on n-readlines. It is lengthy and inefficient (see the table in the question) but it solves the problem. I am still looking for a better solution.