skip to Main Content

I am aware you cant save files in a document database which MongoDB is. However as a work around I am trying to convert the zipped folder into bytes which i can then convert back to zip format when fetching the data from the database later. To process the data and communicate with the database I am using python. For the sake of testing I am saving the byte data in a txt file instead of uploading to the database.

I have tried a lot of different methods but this is my current base code:

# path to zip folder used for testing 
zipLink = "/Users/XXX/Desktop/XXX.zip"
# path for the txt file to be created 
testTxt = "/Users/XXX/Desktop/test.txt"

# read zip folder in binary mode 
with open(zipLink, 'rb') as file_data:
    bytes_content = file_data.read() 

# write bytes into txt file as string
# meant to simulate upload to database 
with open(testTxt, 'w') as file_data:
    file_data.write(str(bytes_content,'utf-8'))

# read bytes from txt file as string
# meant to simulate fetch from database 
with open(testTxt, 'r') as file_data:
    stringData = file_data.read()

# create a new zip folder using the bytes read from the text file
with open("/Users/XXX/Desktop/Remade.zip", 'wb') as file_data:
    file_data.write(bytes(stringData,'utf-8'))

I know the problem is with the encoding because when i skip the steps of the txt file and converting the bytes into a string, I am able to recreate the zip folder from the bytes.

The errors I keep getting are like the following:
"UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xab in position 11: invalid start byte"

I have tried different encodings and tried using different python libraries to figure out the encoding of the string read from the txt file. All throw a similar error.

2

Answers


  1. Chosen as BEST ANSWER

    The following works and it uses base64 but I will be looking into GridFS for future cases:

    import base64
    
    # path to zip folder used for testing 
    zipLink = "/Users/XXX/Desktop/XXX.zip"
    # path for the txt file to be created 
    testTxt = "/Users/XXX/Desktop/test.txt"
    
    # Reading zip file as binary and converting it to utf-8 which can be stored as a string 
    with open(zipLink, 'rb') as file_data:
        bytes_content = file_data.read() 
        base64_encoded_data = base64.b64encode(bytes_content)
        base64_message = base64_encoded_data.decode('utf-8')
    
    # Saving string to text file simulating saving it to database 
    with open(testTxt, 'w') as file_data:
        file_data.write(base64_message)
    
    # Reading string from text file simulating fetching it from database 
    with open(testTxt, 'r') as file_data:
        stringData = file_data.read()
    
    # converting it back to binary format 
    base64_bytes = stringData.encode('utf-8')
    decoded_data = base64.decodebytes(base64_bytes)
    
    # recreating the zip file from binary data 
    with open("/Users/XXX/Desktop/Remade.zip", 'wb') as file_data:
        file_data.write(decoded_data)
    

  2. You cannot use UTF-8 encoding for binary files.

    In Unicode encodings like UTF-8 or UTF-16 you have many Bit combinations which are not allowed. Just for comparison, Unicode has an address range of 32 Bit, i.e. 4.3 Billion addresses, however Unicode defines only 1,114,112 Code-Points. So, the wast majority of all possible Bit combinations are not allowed.

    Usually when you need to convert a binary file into text, then you use Base64. Most languages have built in functions for that, see https://docs.python.org/3/library/base64.html

    As long as the string is not longer than 16 MiByte, it is no problem to store them in MongoDB. Otherwise you have to use GridFS

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search