I wrote multiple simple C++ functions to convert byte sequences to string representations.
It was pretty straight forward, I am sure my logic is right, I thought this to be extremely easy, until I started to print the strings and I found the output to be garbage:
#include <iostream>
#include <string>
#include <vector>
using std::vector;
typedef vector<uint8_t> bytes;
using std::string;
using std::cout;
using namespace std::literals;
string DIGITS = "0123456789abcdef"s;
static inline string hexlify(bytes arr) {
string repr = ""s;
for (auto& chr : arr) {
repr += " " + DIGITS[(chr & 240) >> 4] + DIGITS[chr & 15];
}
repr.erase(0, 1);
return repr;
}
bytes text = {
84, 111, 32, 98, 101, 32,
111, 114, 32, 110, 111, 116,
32, 116, 111, 32, 98, 101
}; // To be or not to be
int main() {
cout << hexlify(text);
}
2♠
÷82♠
÷82♠
÷82♠
÷
Why is this happening?
I know my logic is right, the following is the direct translation to Python:
digits = "0123456789abcdef"
def bytes_string(data):
s = ""
for i in data:
s += " " + digits[(i & 240) >> 4] + digits[i & 15]
return s[1:]
And it works:
>>> bytes_string(b"To be or not to be")
'54 6f 20 62 65 20 6f 72 20 6e 6f 74 20 74 6f 20 62 65'
But why it doesn’t work in C++?
I am using Visual Studio 2022 V17.9.7, compiler flags:
/permissive- /ifcOutput "hexlify_testx64Release" /GS /GL /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /sdl /Fd"hexlify_testx64Releasevc143.pdb" /Zc:inline /fp:precise /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /std:c17 /Gd /Oi /MD /std:c++20 /FC /Fa"hexlify_testx64Release" /EHsc /nologo /Fo"hexlify_testx64Release" /Ot /Fp"hexlify_testx64Releasehexlify_test.pch" /diagnostics:column
I just found out the garbage output only occurs on Debug mode after the fix is implemented, I targeted C++20 in Debug mode, somehow the code causes garbage output in Debug mode, switching to release mode fixes the problem. Before the fix is implemented I compiled in release mode and there was this problem.
2
Answers
As noted in comments (also here), the problem here is at or around string concatenation. The following code doesn’t do concatenation:
When you extract a character from the string
DIGTS
, it has typechar
— a dedicated type for single characters. For historical reasons (compatibility with C), the+
operator interprets the string literal" "
as a pointer and the digit character as an integer, and does some useless pointer arithmetic.To do concatenation, use a string literal of type
std::string
, like you did elsewhere in your code:Here,
operator+
encounters correct typesstd::string
andchar
, so it works correctly.The proper idiom in C++ to do string concatenation is a string stream.
The
stringstream
class is optimized for incremental building of strings. After the code finishes all the concatenations, it converts the stream to a regularstd::string
type, which is general-purpose.In addition to accepted answer, it IS possible to use your logic, but syntax is not Pythonic. One, a byte always is two digits in hex, so knowing length of original string, you know length of result. It’s wwway better to preallocate instead of outputing inidividual characters to stream or concatenate string, and then use a string as an array, somethign along:
There are some bad assuptions with math there and it can be optimized, but it’s easy to see how "best practices" in a native language are different from interpreter which preallocates everything and does all "dirty" work for you.
P.S.
(c & 0xF)["0123456789abcdef"]
is same as"0123456789abcdef"[c & 0xF]
but some compilers more often spot a constant inside of brackets.