skip to Main Content

I inherited a python3 project where we are trying to parse a 70 MB file with python 3.5.6 . I am using cgi.FieldStorage

File (named: paketti.ipk) I’m trying to send:

kissakissakissa
kissakissakissa
kissakissakissa

Headers:

X-FILE: /tmp/nginx/0000000001
Host: localhost:8082
Connection: close
Content-Length: 21
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:100.0) Gecko/20100101 Firefox/100.0
Accept: */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Content-Type: multipart/form-data; boundary=---------------------------264635460442698726183359332565
Origin: http://172.16.8.12
Referer: http://172.16.8.12/
DNT: 1
Sec-GPC: 1

Temporary file /tmp/nginx/0000000001:

-----------------------------264635460442698726183359332565
Content-Disposition: form-data; name="file"; filename="paketti.ipk"
Content-Type: application/octet-stream

kissakissakissa
kissakissakissa
kissakissakissa

-----------------------------264635460442698726183359332565--

Code:

class S(BaseHTTPRequestHandler):
  def do_POST(self):
    temp_filename = self.headers['X-FILE']
    temp_file_pointer=open(temp_filename,"rb")
    form = cgi.FieldStorage( fp=temp_file_pointer, headers=self.headers, environ={'REQUEST_METHOD':'POST', 'CONTENT_TYPE':self.headers['Content-Type'], 'CONTENT_LENGTH':self.headers['Content-Length'] }, )
    actual_filename = form['file'].filename
    logging.info("ACTUAL FILENAME={}".format(actual_filename))
    open("/tmp/nginx/{}".format(actual_filename), "wb").write(form['file'].file.read())
    logging.info("FORM={}".format(form))

Now the strangest things. Logs show:

INFO:root:ACTUAL FILENAME=paketti.ipk
INFO:root:FORM=FieldStorage(None, None, [FieldStorage('file', 'paketti.ipk', b'')])

Look at the /tmp/nginx directory:

root@am335x-evm:/tmp# ls -la /tmp/nginx/*
-rw-------    1 www      www            286 May 18 20:48 /tmp/nginx/0000000001
-rw-r--r--    1 root     root             0 May 18 20:48 /tmp/nginx/paketti.ipk

So, it is like partially working because the name is got. But why it does not parse the data contents? What am I missing?

Is this even doable on python or should I just write a C utility? The file is 70 MB and if I read it in memory, OOM-killer kills the python3 process (and rightfully so, I’d say). But yeah, where does the data contents go?

2

Answers


  1. Chosen as BEST ANSWER

    There were more issues at play than I first thought.

    First, /tmp was coming from tmpfs having maximum size of 120MB.

    Secondly, my nginx.conf was problematic. I needed to comment out stuff like this to clean it up:

    #client_body_in_file_only       on
    #proxy_set_header               X-FILE $request_body_file;
    #proxy_set_body                 $request_body_file;
    

    Then I needed to add these

    proxy_redirect                 off; # Maybe not that importnat
    proxy_request_buffering        off; # Very important
    

    After this the code

    form = cgi.FieldStorage( fp=self.rfile, headers=self.headers, environ={'REQUEST_METHOD':'POST', 'CONTENT_TYPE':self.headers['Content-Type'], })
    

    started to "work". I'm monitoring /tmp usage and it uses first 70MB and then full 120 MB. The uploaded file is truncated to 50 MB.

    So, when I am reading and writing parsed cgi.FieldStorage even in a loop of 4096 characters, the system reads it automatically FULLY to somewhere in /tmp once and then tries to write the final file and encounters "No space left on device" error.

    To fix this I keep the nginx.conf additions and just read the self.rfile manually myself in a loop, totally reading ['Content-Length'] (anything other makes it go bonkers). This is able to save it cleanly with one pass; there is no more than single time 70MB usage of /tmp .


  2. Instead of the cgi module need a multipart parser that can stream the data instead of reading all of it to RAM. AFAIK there is nothing useful in the standard library but this module could be of use: https://github.com/defnull/multipart

    Alternatively, DIY something along these lines should work:

    boundary = b"-----whatever"
    # Begin and end lines (as per your example, I didn't check the RFCs)
    begin = b"rn%brn" % boundary
    end = b"rn%b--rn" % boundary
    # Prefer with blocks to open files so that they are also closed properly
    with open(temp_filename, "rb") as f:
      buf = bytearray()
      # Search for the boundary
      while begin not in buf:
        block = f.read(4096)
        if not block: raise ValueError("EOF without boundary begin")
        buf = buf[-1024:] + block  # Keep up to 5 KiB buffered
      # Delete buffer contents until the end of the boundary
      del buf[:buf.find(begin) + len(begin)]
    
      # Copy data to another file (or do what you need to do with it)
      with open("output.dat", "wb") as f2:
        while end not in buf:
          f2.write(buf[:-1024])
          del buf[:-1024]
          buf += f.read(4096)
          if not buf: raise ValueError("EOF without boundary end")
        f2.write(buf[:buf.find(end)])
    

    It is taken that the boundaries are only up to 1024 bytes. You could use the actual lengths instead for perfection.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search