The code below reads the user-selected input file entirely. This requires a lot of memory for very large (> 10 GB) files. I need to read a file line by line.
How can I read a file in Pyodide one line at a time?
<!doctype html>
<html>
<head>
<script src="https://cdn.jsdelivr.net/pyodide/v0.22.1/full/pyodide.js"></script>
</head>
<body>
<button>Analyze input</button>
<script type="text/javascript">
async function main() {
// Get the file contents into JS
const [fileHandle] = await showOpenFilePicker();
const fileData = await fileHandle.getFile();
const contents = await fileData.text();
// Create the Python convert toy function
let pyodide = await loadPyodide();
let convert = pyodide.runPython(`
from pyodide.ffi import to_js
def convert(contents):
return to_js(contents.lower())
convert
`);
let result = convert(contents);
console.log(result);
const blob = new Blob([result], {type : 'application/text'});
let url = window.URL.createObjectURL(blob);
var downloadLink = document.createElement("a");
downloadLink.href = url;
downloadLink.text = "Download output";
downloadLink.download = "out.txt";
document.body.appendChild(downloadLink);
}
const button = document.querySelector('button');
button.addEventListener('click', main);
</script>
</body>
</html>
The code is from this answer to question "Select and read a file from user’s filesystem".
Based on the answer by rth, I used the code below. It still has 2 issues:
- The chunks break some lines into parts, as shown on the example input file, which has 100 chars per line. The console log (below) shows that this is not always the case for chunks (thus, lines in chunks are broken not at the newline).
- I cannot get the variable
result
to be written into the output file, which is available for download to the user (see below, where for the example purposes it is replaced by a dummy string'result'
).
<!doctype html>
<html>
<head>
<script src="https://cdn.jsdelivr.net/pyodide/v0.22.1/full/pyodide.js"></script>
</head>
<body>
<button>Analyze input</button>
<script type="text/javascript">
async function main() {
// Create the Python convert toy function
let pyodide = await loadPyodide();
let convert = pyodide.runPython(`
from pyodide.ffi import to_js
def convert(contents):
for line in contents.split('\n'):
print(len(line))
return to_js(contents.lower())
convert
`);
// Get the file contents into JS
const bytes_func = pyodide.globals.get('bytes');
const [fileHandle] = await showOpenFilePicker();
let fh = await fileHandle.getFile()
const stream = fh.stream();
const reader = stream.getReader();
// Do a loop until end of file
while( true ) {
const { done, value } = await reader.read();
if( done ) { break; }
handleChunk( value );
}
console.log( "all done" );
function handleChunk( buf ) {
console.log( "received a new buffer", buf.byteLength );
let result = convert(bytes_func(buf).decode('utf-8'));
}
const blob = new Blob(['result'], {type : 'application/text'});
let url = window.URL.createObjectURL(blob);
var downloadLink = document.createElement("a");
downloadLink.href = url;
downloadLink.text = "Download output";
downloadLink.download = "out.txt";
document.body.appendChild(downloadLink);
}
const button = document.querySelector('button');
button.addEventListener('click', main);
</script>
</body>
</html>
Given this input file with 100 characters per line:
perl -le 'for (1..1e5) { print "0" x 100 }' > test_100x1e5.txt
I am getting this console log output, indicating that lines are broken not at the newline:
received a new buffer 65536
648pyodide.asm.js:10 100
pyodide.asm.js:10 88
read_write_bytes_func.html:41 received a new buffer 2031616
pyodide.asm.js:10 12
20114pyodide.asm.js:10 100
pyodide.asm.js:10 89
read_write_bytes_func.html:41 received a new buffer 2097152
pyodide.asm.js:10 11
20763pyodide.asm.js:10 100
pyodide.asm.js:10 77
read_write_bytes_func.html:41 received a new buffer 2097152
pyodide.asm.js:10 23
20763pyodide.asm.js:10 100
pyodide.asm.js:10 65
read_write_bytes_func.html:41 received a new buffer 2097152
pyodide.asm.js:10 35
20763pyodide.asm.js:10 100
pyodide.asm.js:10 53
read_write_bytes_func.html:41 received a new buffer 1711392
pyodide.asm.js:10 47
16944pyodide.asm.js:10 100
pyodide.asm.js:10 0
read_write_bytes_func.html:37 all done
If I change from this:
const blob = new Blob(['result'], {type : 'application/text'});
to that:
const blob = new Blob([result], {type : 'application/text'});
then I get the error:
Uncaught (in promise) ReferenceError: result is not defined
at HTMLButtonElement.main (read_write_bytes_func.html:45:34)
2
Answers
The available memory in this environment is currently limited to 2GB so you would not be able to read a 10GB file entirely.
If you can process the file as a stream, line by line, you could try mounting a local folder where the file is using the File System Access API (currently only available in Chrome and Edge).
To mount a local folder in Pyodide,
then you can access it as a normal file from Pyodide,
You can then open this file path and iterate on lines as you would usually do in Python.
If you make any changes to this folder you need to run,
See the documentation for more details
Another solution if you want to process a single file, is to use streaming JavaScript API and process each chunk in Python.
A partial solution, for a UTF8 encoded text file could look something like,
Each chunk is some number of lines so would need to re-split it get an iterator over lines in Python. And I’m not sure if it would break in the middle of the line or not.