Base64 is a binary to text encoding technique rather than an encryption technique but I thought it made sense to cover it in this series because it is widely used especially for transmitting the data over the wire. The reason being the set of characters selected for this encoding is a subset of most common characters in all encoding and printable characters.
Here is the Base64 index table:
The conversion of a string into Base64 happens by taking the 8-bit binary equivalent of the alphabets and then slicing it into 6-bit unit since the maximum value in the Base64 is 2^6 and then using the index table like above binary would be represented. Lets take an example of string Sun and see how it would be represented in Base64
Text | S | u | n | ACII Code | 083 | 117 | 110 | Binary | 01010011 | 01110101 | 01101110 | 6-bit | 010100 | 110111 | 010101 | 101110 | Base64 Index | 20 | 55 | 21 | 46 | Base64 encoded| U | 3 | V | u |
We can verify this by converting the string with Python
>>> "Sun".encode("base64") 'U3Vu\n'
The newline character that we see at the end of the output is ignored. Whether we decode the string with or without the we would still get the same string back
>>> "U3Vu\n".decode("base64") 'Sun' >>> "U3Vu".decode("base64") 'Sun'
The length of characters in the output has to be a multiple of 4. If it is not the case then the output is appended with either one or two “=” to make it so. For example when we convert Earth to Base64 we this in action
>>> "Earth".encode("base64") 'RWFydGg=\n'
Sometimes for various reasons the strings are Base64 encoded multiple times and you might have noticed by now this increases the length of the output. The base64 encoder that I wrote using the one builtin with Python takes the number of times you would like to encode your string. The code is pretty straightforward.
input_str = raw_input("Enter the string that you like to be base64 encoded:") times = int(raw_input("How deep do you want it encoded:")) output_str = input_str for i in range(times): output_str = output_str.encode("base64") print "Encoded string: ", output_str
And here is a sample run
This a where it gets a little bit trickier since while decoding I assume that I am not aware of the number of times the text was encoded. I created a base sting that contains all the valid characters in Base64 encoded strings and then take the input as base64 encoded string
base_64_encoding_characters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=" input_str = raw_input("Enter the base64 encoded string that you would like to decode: ")
With the string to be decoded in hand we go into a while loop and run in it until we have a potential candidate for the original string. The basic logic is to try and decode the string and if fails to decode then append an “=” to its end and try again and also increase the error count in the process. We repeat this twice and keep going until we have a string that cannot be decoded.
while error_count < 3: input_str, is_end = ValidateAndSplit(input_str.replace('\n','')) if is_end == True: break; try: temp = input_str.decode("base64") input_str = temp output_str = temp depth = depth + 1 error_count = 0 print input_str except binascii.Error as err: error_count = error_count + 1 input_str = input_str + "=" print "Potential decoded string: ", output_str, "\nWith depth: ", depth
The ValidateAndSplit method basically tries to remove unnecessary charters from the string to make sure we don’t down a bad path and also tells us when potentially we have reached the end of our search
def ValidateAndSplit(input_str): is_end = False n = len(input_str) if n < 1: is_end = True return input_str, is_end for i in range(n): c = input_str[i] location = base_64_encoding_characters.find(c) if location < 0 and c == " ": is_end = True break elif location < 0: data = input_str.split(c, 1) input_str = data break return input_str, is_end
Here’s a sample run of this decoder with the same base64 string that we encoded before 10 times
The problem with the current approach is that if we might over decode the string that are one word only. One fix to that could be reaching out to reach out to an online dictionary and see that we have found a valid word.
The entire source code for this post can be found at https://github.com/abhishuk85/cryptography-plays
Any questions, comments or feedback are most welcome.