skip to Main Content

The code below reads the user-selected input file entirely. This requires a lot of memory for very large (> 10 GB) files. I need to read a file line by line.

How can I read a file in Pyodide one line at a time?


<!doctype html>
<html>
  <head>
      <script src="https://cdn.jsdelivr.net/pyodide/v0.22.1/full/pyodide.js"></script>
  </head>
  <body>
    <button>Analyze input</button>
    <script type="text/javascript">
      async function main() {
        // Get the file contents into JS
        const [fileHandle] = await showOpenFilePicker();
        const fileData = await fileHandle.getFile();
        const contents = await fileData.text();

        // Create the Python convert toy function
        let pyodide = await loadPyodide();
        let convert = pyodide.runPython(`
from pyodide.ffi import to_js
def convert(contents):
    return to_js(contents.lower())
convert
      `);

        let result = convert(contents);
        console.log(result);

        const blob = new Blob([result], {type : 'application/text'});

        let url = window.URL.createObjectURL(blob);

        var downloadLink = document.createElement("a");
        downloadLink.href = url;
        downloadLink.text = "Download output";
        downloadLink.download = "out.txt";
        document.body.appendChild(downloadLink);

      }
      const button = document.querySelector('button');
      button.addEventListener('click', main);
    </script>
  </body>
</html>

The code is from this answer to question "Select and read a file from user’s filesystem".


Based on the answer by rth, I used the code below. It still has 2 issues:

  • The chunks break some lines into parts, as shown on the example input file, which has 100 chars per line. The console log (below) shows that this is not always the case for chunks (thus, lines in chunks are broken not at the newline).
  • I cannot get the variable result to be written into the output file, which is available for download to the user (see below, where for the example purposes it is replaced by a dummy string 'result').
<!doctype html>
<html>
  <head>
    <script src="https://cdn.jsdelivr.net/pyodide/v0.22.1/full/pyodide.js"></script>
  </head>
  <body>
    <button>Analyze input</button>
    <script type="text/javascript">
      async function main() {
          
          // Create the Python convert toy function
          let pyodide = await loadPyodide();
          let convert = pyodide.runPython(`
from pyodide.ffi import to_js
def convert(contents):
    for line in contents.split('\n'):
        print(len(line))
    return to_js(contents.lower())
convert
      `);
          
          // Get the file contents into JS
          const bytes_func = pyodide.globals.get('bytes');                                               
          
          const [fileHandle] = await showOpenFilePicker();  
          let fh = await fileHandle.getFile()  
          const stream = fh.stream();  
          const reader = stream.getReader();
          // Do a loop until end of file


          while( true ) {
              const { done, value } = await reader.read();
              if( done ) { break; }
              handleChunk( value );
          }
          console.log( "all done" );


          function handleChunk( buf ) {
              console.log( "received a new buffer", buf.byteLength );
              let result = convert(bytes_func(buf).decode('utf-8'));
          }
          
          const blob = new Blob(['result'], {type : 'application/text'});
          
          let url = window.URL.createObjectURL(blob);
          
          var downloadLink = document.createElement("a");
          downloadLink.href = url;
          downloadLink.text = "Download output";
          downloadLink.download = "out.txt";
          document.body.appendChild(downloadLink);
          
      }
      const button = document.querySelector('button');
      button.addEventListener('click', main);
    </script>
  </body>
</html>

Given this input file with 100 characters per line:

perl -le 'for (1..1e5) { print "0" x 100 }' > test_100x1e5.txt

I am getting this console log output, indicating that lines are broken not at the newline:

received a new buffer 65536
648pyodide.asm.js:10 100
pyodide.asm.js:10 88
read_write_bytes_func.html:41 received a new buffer 2031616
pyodide.asm.js:10 12
20114pyodide.asm.js:10 100
pyodide.asm.js:10 89
read_write_bytes_func.html:41 received a new buffer 2097152
pyodide.asm.js:10 11
20763pyodide.asm.js:10 100
pyodide.asm.js:10 77
read_write_bytes_func.html:41 received a new buffer 2097152
pyodide.asm.js:10 23
20763pyodide.asm.js:10 100
pyodide.asm.js:10 65
read_write_bytes_func.html:41 received a new buffer 2097152
pyodide.asm.js:10 35
20763pyodide.asm.js:10 100
pyodide.asm.js:10 53
read_write_bytes_func.html:41 received a new buffer 1711392
pyodide.asm.js:10 47
16944pyodide.asm.js:10 100
pyodide.asm.js:10 0
read_write_bytes_func.html:37 all done

If I change from this:

const blob = new Blob(['result'], {type : 'application/text'});

to that:

const blob = new Blob([result], {type : 'application/text'});

then I get the error:

Uncaught (in promise) ReferenceError: result is not defined
    at HTMLButtonElement.main (read_write_bytes_func.html:45:34)

2

Answers


  1. The available memory in this environment is currently limited to 2GB so you would not be able to read a 10GB file entirely.

    If you can process the file as a stream, line by line, you could try mounting a local folder where the file is using the File System Access API (currently only available in Chrome and Edge).

    To mount a local folder in Pyodide,

    const dirHandle = await showDirectoryPicker();
    
    if ((await dirHandle.queryPermission({ mode: "readwrite" })) !== "granted") {
      if (
        (await dirHandle.requestPermission({ mode: "readwrite" })) !== "granted"
      ) {
        throw Error("Unable to read and write directory");
      }
    }
    
    const nativefs = await pyodide.mountNativeFS("/mount_dir", dirHandle);
    

    then you can access it as a normal file from Pyodide,

    pyodide.runPython(`
      import os
      print(os.listdir('/mount_dir'))
    `);
    

    You can then open this file path and iterate on lines as you would usually do in Python.

    If you make any changes to this folder you need to run,

    await nativefs.syncfs();
    

    See the documentation for more details

    Login or Signup to reply.
  2. Another solution if you want to process a single file, is to use streaming JavaScript API and process each chunk in Python.

    A partial solution, for a UTF8 encoded text file could look something like,

    const bytes_func = pyodide.globals.get('bytes');                                               
                                                           
    const [fileHandle] = await showOpenFilePicker();  
    let fh = await fileHandle.getFile()  
    const stream = fh.stream();  
    const reader = stream.getReader();
    // Do a loop until and of file
    const {done, value } = await reader.read()    
    if (done) {
      // process a single chunk
      let chunk = bytes_func(value).decode('utf-8')
      // chunk is now a Python string proxied to JavaScript                 
    }
    

    Each chunk is some number of lines so would need to re-split it get an iterator over lines in Python. And I’m not sure if it would break in the middle of the line or not.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search