I am aware you cant save files in a document database which MongoDB is. However as a work around I am trying to convert the zipped folder into bytes which i can then convert back to zip format when fetching the data from the database later. To process the data and communicate with the database I am using python. For the sake of testing I am saving the byte data in a txt file instead of uploading to the database.
I have tried a lot of different methods but this is my current base code:
# path to zip folder used for testing
zipLink = "/Users/XXX/Desktop/XXX.zip"
# path for the txt file to be created
testTxt = "/Users/XXX/Desktop/test.txt"
# read zip folder in binary mode
with open(zipLink, 'rb') as file_data:
bytes_content = file_data.read()
# write bytes into txt file as string
# meant to simulate upload to database
with open(testTxt, 'w') as file_data:
file_data.write(str(bytes_content,'utf-8'))
# read bytes from txt file as string
# meant to simulate fetch from database
with open(testTxt, 'r') as file_data:
stringData = file_data.read()
# create a new zip folder using the bytes read from the text file
with open("/Users/XXX/Desktop/Remade.zip", 'wb') as file_data:
file_data.write(bytes(stringData,'utf-8'))
I know the problem is with the encoding because when i skip the steps of the txt file and converting the bytes into a string, I am able to recreate the zip folder from the bytes.
The errors I keep getting are like the following:
"UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xab in position 11: invalid start byte"
I have tried different encodings and tried using different python libraries to figure out the encoding of the string read from the txt file. All throw a similar error.
2
Answers
The following works and it uses base64 but I will be looking into GridFS for future cases:
You cannot use UTF-8 encoding for binary files.
In Unicode encodings like UTF-8 or UTF-16 you have many Bit combinations which are not allowed. Just for comparison, Unicode has an address range of 32 Bit, i.e. 4.3 Billion addresses, however Unicode defines only 1,114,112 Code-Points. So, the wast majority of all possible Bit combinations are not allowed.
Usually when you need to convert a binary file into text, then you use Base64. Most languages have built in functions for that, see https://docs.python.org/3/library/base64.html
As long as the string is not longer than 16 MiByte, it is no problem to store them in MongoDB. Otherwise you have to use GridFS