I found some code online that I am trying to work through which encodes to base64. I know Python has base64.urlsafe_b64decode()
but I would like to learn a bit more about what is going on.
The JS atob
looks like:
function atob (input) {
var chars = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=';
var str = String(input).replace(/=+$/, '');
if (str.length % 4 == 1) {
throw new InvalidCharacterError("'atob' failed: The string to be decoded is not correctly encoded.");
}
for (
// initialize result and counters
var bc = 0, bs, buffer, idx = 0, output = '';
// get next character
buffer = str.charAt(idx++);
// character found in table? initialize bit storage and add its ascii value;
~buffer && (bs = bc % 4 ? bs * 64 + buffer : buffer,
// and if not first of each 4 characters,
// convert the first 8 bits to one ascii character
bc++ % 4) ? output += String.fromCharCode(255 & bs >> (-2 * bc & 6)) : 0
) {
// try to find character in table (0-63, not found => -1)
buffer = chars.indexOf(buffer);
}
return output;
}
My goal is to port this Python, but I am trying to understand what the for loop is doing in Javascript.
It checks if the value is located in the chars
table and then initializes some variables using a ternary like: bs = bc % 4 ? bs*64+buffer: buffer, bc++ %4
I am not quite sure I understand what the buffer, bc++ % 4
part of the ternary is doing. The comma confuses me a bit. Plus the String.fromCharCode(255 & (bs >> (-2 * bc & 6)))
is a bit esoteric to me.
I’ve been trying something like this in Python, which produces some results, albeit different than what the javascript implementation is doing
# Test subject
b64_str: str = "fwHzODWqgMH+NjBq02yeyQ=="
# Lookup table for characters
chars: str = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/="
# Replace right padding with empty string
replaced = re.sub("=+$", '', b64_str)
if len(replaced) % 4 == 1:
raise ValueError("atob failed. The string to be decoded is not valid base64")
# Bit storage and counters
bc = 0
out: str = ''
for i in replaced:
# Get ascii value of character
buffer = ord(i)
# If counter is evenly divisible by 4, return buffer as is, else add the ascii value
bs = bc * 64 + buffer if bc % 4 else buffer
bc += 1 % 4 # Not sure I understand this part
# Check if character is in the chars table
if i in chars:
# Check if the bit storage and bit counter are non-zero
if bs and bc:
# If so, convert the first 8 bits to an ascii character
out += chr(255 & bs >> (-2 * bc & 6))
else:
out = 0
# Set buffer to the index of where the first instance of the character is in the b64 string
print(f"before: {chr(buffer)}")
buffer = chars.index(chr(buffer))
print(f"after: {buffer}")
print(out)
JS gives ó85ªÁþ60jÓlÉ
Python gives 2:u1(²ë:ð1G>%Y
2
Answers
First step would be determining if either implementation works right, RFC4648 contains Tests Vectors for that purpose
If one implementation works correctly you should determine what is causing difference, otherwise you might attempt to implement base64decode based on description contained in mentioned RFC4648.
Here is a tested version
https://www.online-python.com/PiseKNFuaO