skip to Main Content

I wrote multiple simple C++ functions to convert byte sequences to string representations.

It was pretty straight forward, I am sure my logic is right, I thought this to be extremely easy, until I started to print the strings and I found the output to be garbage:

#include <iostream>
#include <string>
#include <vector>

using std::vector;
typedef vector<uint8_t> bytes;
using std::string;
using std::cout;
using namespace std::literals;

string DIGITS = "0123456789abcdef"s;

static inline string hexlify(bytes arr) {
    string repr = ""s;
    for (auto& chr : arr) {
        repr += " " + DIGITS[(chr & 240) >> 4] + DIGITS[chr & 15];
    }
    repr.erase(0, 1);
    return repr;
}

bytes text = {
    84, 111, 32, 98, 101, 32,
    111, 114, 32, 110, 111, 116,
    32, 116, 111, 32, 98, 101
}; // To be or not to be

int main() {
    cout << hexlify(text);
}
2♠
÷82♠
÷82♠
÷82♠
÷

Why is this happening?

I know my logic is right, the following is the direct translation to Python:

digits = "0123456789abcdef"
def bytes_string(data):
    s = ""
    for i in data:
        s += " " + digits[(i & 240) >> 4] + digits[i & 15]
    return s[1:]

And it works:

>>> bytes_string(b"To be or not to be")
'54 6f 20 62 65 20 6f 72 20 6e 6f 74 20 74 6f 20 62 65'

But why it doesn’t work in C++?

I am using Visual Studio 2022 V17.9.7, compiler flags:

/permissive- /ifcOutput "hexlify_testx64Release" /GS /GL /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /sdl /Fd"hexlify_testx64Releasevc143.pdb" /Zc:inline /fp:precise /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /std:c17 /Gd /Oi /MD /std:c++20 /FC /Fa"hexlify_testx64Release" /EHsc /nologo /Fo"hexlify_testx64Release" /Ot /Fp"hexlify_testx64Releasehexlify_test.pch" /diagnostics:column 

I just found out the garbage output only occurs on Debug mode after the fix is implemented, I targeted C++20 in Debug mode, somehow the code causes garbage output in Debug mode, switching to release mode fixes the problem. Before the fix is implemented I compiled in release mode and there was this problem.

2

Answers


  1. As noted in comments (also here), the problem here is at or around string concatenation. The following code doesn’t do concatenation:

    " " + DIGITS[(chr & 240) >> 4]
    

    When you extract a character from the string DIGTS, it has type char — a dedicated type for single characters. For historical reasons (compatibility with C), the + operator interprets the string literal " " as a pointer and the digit character as an integer, and does some useless pointer arithmetic.

    To do concatenation, use a string literal of type std::string, like you did elsewhere in your code:

    " "s + DIGITS[(chr & 240) >> 4]
    

    Here, operator+ encounters correct types std::string and char, so it works correctly.


    The proper idiom in C++ to do string concatenation is a string stream.

    #include <sstream>
    ...
    std::ostringstream stream; // "output string stream"
    stream << " " << DIGITS[...] << DIGITS[...];
    ...
    return stream.str();
    

    The stringstream class is optimized for incremental building of strings. After the code finishes all the concatenations, it converts the stream to a regular std::string type, which is general-purpose.

    Login or Signup to reply.
  2. In addition to accepted answer, it IS possible to use your logic, but syntax is not Pythonic. One, a byte always is two digits in hex, so knowing length of original string, you know length of result. It’s wwway better to preallocate instead of outputing inidividual characters to stream or concatenate string, and then use a string as an array, somethign along:

    std::string hexify(const bytes& buf) {
        std::string result;
        auto length = buf.size()*2;
        result.resize(length);
        for(int i = 0, j = 0; i < length; i++, j++) {
            auto c = buf[i];
            result[j++] = (c & 0xF)["0123456789abcdef"];
            result[j]   = (c>>4)["0123456789abcdef"];
        }
        return result;
    }
    

    There are some bad assuptions with math there and it can be optimized, but it’s easy to see how "best practices" in a native language are different from interpreter which preallocates everything and does all "dirty" work for you.

    P.S. (c & 0xF)["0123456789abcdef"] is same as "0123456789abcdef"[c & 0xF] but some compilers more often spot a constant inside of brackets.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search