skip to Main Content

The basic Unix paste can be implemented in Python like (the example works with two files only; Unix paste works with multiple files):

def paste(fn1, fn2):
  with open(fn1) as f1:
    with open(fn2) as f2:
      for l1 in f1:
        l2 = f2.readline()
        if l2 != None:
          print(l1[:-1] + "t" + l2[:-1])
        else:
          print(l1[:-1])
      for l2 in f2:
        print("t" + l2[:-1])

import sys
if __name__ == "__main__":
  if len(sys.argv) >= 3:
    paste(sys.argv[1], sys.argv[2])

The task is to implement the same functionality in Node.js. Importantly, as the input file can be huge, the implementation should read the input file line by line, not reading the entire file into the memory. I want to see how to achieve this with built-in Node functionality without external packages.

Note that it is easy to implement Unix paste with synchronous I/O as is shown in the Python example, but Node doesn’t provide synchronous I/O for line reading. Meanwhile, there are ways to read one file by line with asynchronous I/O, but jointly reading two files is harder because the two streams are not synchronized.

So far the only solution I can think of is to implement synchronous line reading using the basic read API. Dave Newton pointed out in the comment that the npm n-readlines package implements this approach in 100+ lines. Because n-readlines inspects each byte to find line endings, I suspect it is inefficient and thus did a microbenchmark with results shown in the table below. For line reading (not for this task), n-readlines is 3 times as slow as a Node Readline implementation and is an order of magnitude slower than built-in line reading in Python, Perl or mawk.

What is the proper way to implement Unix paste? N-readlines is using synchronous APIs. Would a good async solution be cleaner and faster?

Language Runtime Version Elapsed (s) User (s) Sys (s) Code
JavaScript node 21.5.0 6.30 5.33 0.90 lc-node.js
node 21.5.0 22.34 20.41 2.24 lc-n-readlines.js
bun 1.0.20 4.91 5.30 1.47 lc-node.js
bun 1.0.20 21.16 19.22 3.37 lc-n-readlines.js
k8 1.0 1.49 1.06 0.37 lc-k8.js
C clang 15.0.0 0.71 0.35 0.35 lc-c.c
python python 3.11.17 3.48 2.85 0.62 lc-python.py
perl perl 5.34.3 1.70 1.13 0.57 lc-perl.pl
awk mawk 1.3.4 2.08 1.27 0.80 lc-awk.awk
apple awk ? 90.06 87.90 1.12 lc-awk.awk

2

Answers


  1. Chosen as BEST ANSWER

    I am posting an answer based on n-readlines. It is lengthy and inefficient (see the table in the question) but it solves the problem. I am still looking for a better solution.

    const fs = require('fs');
    
    // the following implementation is copied from https://github.com/nacholibre/node-readlines (the n-readlines package)
    // Written by Liucw and distributed under the MIT license
    class LineByLine {
        constructor(file, options) {
            options = options || {};
            if (!options.readChunk) options.readChunk = 1024;
            if (!options.newLineCharacter) {
                options.newLineCharacter = 0x0a; //linux line ending
            } else {
                options.newLineCharacter = options.newLineCharacter.charCodeAt(0);
            }
            if (typeof file === 'number') {
                this.fd = file;
            } else {
                this.fd = fs.openSync(file, 'r');
            }
            this.options = options;
            this.newLineCharacter = options.newLineCharacter;
            this.reset();
        }
        _searchInBuffer(buffer, hexNeedle) {
            let found = -1;
            for (let i = 0; i <= buffer.length; i++) {
                let b_byte = buffer[i];
                if (b_byte === hexNeedle) {
                    found = i;
                    break;
                }
            }
            return found;
        }
        reset() {
            this.eofReached = false;
            this.linesCache = [];
            this.fdPosition = 0;
        }
        close() {
            fs.closeSync(this.fd);
            this.fd = null;
        }
        _extractLines(buffer) {
            let line;
            const lines = [];
            let bufferPosition = 0;
            let lastNewLineBufferPosition = 0;
            while (true) {
                let bufferPositionValue = buffer[bufferPosition++];
                if (bufferPositionValue === this.newLineCharacter) {
                    line = buffer.slice(lastNewLineBufferPosition, bufferPosition);
                    lines.push(line);
                    lastNewLineBufferPosition = bufferPosition;
                } else if (bufferPositionValue === undefined) {
                    break;
                }
            }
            let leftovers = buffer.slice(lastNewLineBufferPosition, bufferPosition);
            if (leftovers.length) {
                lines.push(leftovers);
            }
            return lines;
        };
        _readChunk(lineLeftovers) {
            let totalBytesRead = 0;
            let bytesRead;
            const buffers = [];
            do {
                const readBuffer = new Buffer(this.options.readChunk);
                bytesRead = fs.readSync(this.fd, readBuffer, 0, this.options.readChunk, this.fdPosition);
                totalBytesRead = totalBytesRead + bytesRead;
                this.fdPosition = this.fdPosition + bytesRead;
                buffers.push(readBuffer);
            } while (bytesRead && this._searchInBuffer(buffers[buffers.length-1], this.options.newLineCharacter) === -1);
            let bufferData = Buffer.concat(buffers);
            if (bytesRead < this.options.readChunk) {
                this.eofReached = true;
                bufferData = bufferData.slice(0, totalBytesRead);
            }
            if (totalBytesRead) {
                this.linesCache = this._extractLines(bufferData);
                if (lineLeftovers) {
                    this.linesCache[0] = Buffer.concat([lineLeftovers, this.linesCache[0]]);
                }
            }
            return totalBytesRead;
        }
        next() {
            if (!this.fd) return false;
            let line = false;
            if (this.eofReached && this.linesCache.length === 0) {
                return line;
            }
            let bytesRead;
            if (!this.linesCache.length) {
                bytesRead = this._readChunk();
            }
            if (this.linesCache.length) {
                line = this.linesCache.shift();
                const lastLineCharacter = line[line.length-1];
                if (lastLineCharacter !== this.newLineCharacter) {
                    bytesRead = this._readChunk(line);
                    if (bytesRead) {
                        line = this.linesCache.shift();
                    }
                }
            }
            if (this.eofReached && this.linesCache.length === 0) {
                this.close();
            }
            if (line && line[line.length-1] === this.newLineCharacter) {
                line = line.slice(0, line.length-1);
            }
            return line;
        }
    }
    
    function main(args) {
        if (args.length < 2) {
            console.log("Usage: node lc-n-readlines.js <in1.txt> <in2.txt>");
            return;
        }
        const f1 = new LineByLine(args[0]);
        const f2 = new LineByLine(args[1]);
        let l1, l2;
        while (l1 = f1.next()) {
            if (l2 = f2.next()) {
                console.log(`${l1}t${l2}`);
            } else {
                console.log(l1);
            }
        }
        while (l2 = f2.next())
            console.log(`t${l2}`);
    }
    
    main(process.argv.splice(2));
    

  2. import { open as fsOpenAsync } from 'node:fs/promises'
    import { createWriteStream } from 'node:fs'
    
    const filenames = ['a.txt', 'b.txt', 'c.txt']
    const outname = 'out.txt'
    
    await paste(filenames, outname)
    
    /**
     * Read multiple files line by line and write lines concatenated by `t`
     */
    async function paste(from: string[], to: string) {
      const files = await Promise.all(filenames.map(fn => fsOpenAsync(fn)))
      const zip = zipAsyncs(files.map(f => f.readLines()[Symbol.asyncIterator]()))
      const writeStream = createWriteStream(to, { flags: 'w' })
      for await (const lines of zip)
        writeStream.write(`${lines.map(e => e ?? '').join('t')}n`)
      writeStream.close()
      await Promise.all(files.map(f => f.close()))
    }
    
    /**
     * Zip multiple async iterables, returning `undefined` for missing values
     * @template {T}
     * @param {AsyncIterator<T>[]} its
     * @returns {AsyncGenerator<IteratorResult<T | undefined, any>[]>}
     */
    async function* zipAsyncs(its) {
      while (true) {
        const results = await Promise.all(its.map(e => e.next()))
        yield results.map(r => r.value)
        if (results.every(r => r.done))
          return
      }
    }
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search