I inherited a python3 project where we are trying to parse a 70 MB file with python 3.5.6 . I am using cgi.FieldStorage
File (named: paketti.ipk) I’m trying to send:
kissakissakissa
kissakissakissa
kissakissakissa
Headers:
X-FILE: /tmp/nginx/0000000001
Host: localhost:8082
Connection: close
Content-Length: 21
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:100.0) Gecko/20100101 Firefox/100.0
Accept: */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Content-Type: multipart/form-data; boundary=---------------------------264635460442698726183359332565
Origin: http://172.16.8.12
Referer: http://172.16.8.12/
DNT: 1
Sec-GPC: 1
Temporary file /tmp/nginx/0000000001:
-----------------------------264635460442698726183359332565
Content-Disposition: form-data; name="file"; filename="paketti.ipk"
Content-Type: application/octet-stream
kissakissakissa
kissakissakissa
kissakissakissa
-----------------------------264635460442698726183359332565--
Code:
class S(BaseHTTPRequestHandler):
def do_POST(self):
temp_filename = self.headers['X-FILE']
temp_file_pointer=open(temp_filename,"rb")
form = cgi.FieldStorage( fp=temp_file_pointer, headers=self.headers, environ={'REQUEST_METHOD':'POST', 'CONTENT_TYPE':self.headers['Content-Type'], 'CONTENT_LENGTH':self.headers['Content-Length'] }, )
actual_filename = form['file'].filename
logging.info("ACTUAL FILENAME={}".format(actual_filename))
open("/tmp/nginx/{}".format(actual_filename), "wb").write(form['file'].file.read())
logging.info("FORM={}".format(form))
Now the strangest things. Logs show:
INFO:root:ACTUAL FILENAME=paketti.ipk
INFO:root:FORM=FieldStorage(None, None, [FieldStorage('file', 'paketti.ipk', b'')])
Look at the /tmp/nginx directory:
root@am335x-evm:/tmp# ls -la /tmp/nginx/*
-rw------- 1 www www 286 May 18 20:48 /tmp/nginx/0000000001
-rw-r--r-- 1 root root 0 May 18 20:48 /tmp/nginx/paketti.ipk
So, it is like partially working because the name is got. But why it does not parse the data contents? What am I missing?
Is this even doable on python or should I just write a C utility? The file is 70 MB and if I read it in memory, OOM-killer kills the python3 process (and rightfully so, I’d say). But yeah, where does the data contents go?
2
Answers
There were more issues at play than I first thought.
First, /tmp was coming from tmpfs having maximum size of 120MB.
Secondly, my nginx.conf was problematic. I needed to comment out stuff like this to clean it up:
Then I needed to add these
After this the code
started to "work". I'm monitoring /tmp usage and it uses first 70MB and then full 120 MB. The uploaded file is truncated to 50 MB.
So, when I am reading and writing parsed cgi.FieldStorage even in a loop of 4096 characters, the system reads it automatically FULLY to somewhere in /tmp once and then tries to write the final file and encounters "No space left on device" error.
To fix this I keep the nginx.conf additions and just read the self.rfile manually myself in a loop, totally reading ['Content-Length'] (anything other makes it go bonkers). This is able to save it cleanly with one pass; there is no more than single time 70MB usage of /tmp .
Instead of the
cgi
module need a multipart parser that can stream the data instead of reading all of it to RAM. AFAIK there is nothing useful in the standard library but this module could be of use: https://github.com/defnull/multipartAlternatively, DIY something along these lines should work:
It is taken that the boundaries are only up to 1024 bytes. You could use the actual lengths instead for perfection.