skip to Main Content

I have an application which deals with blockchain data, so the postgres tables have quite a lot of fields which are constant sized hex values.

One of the most important columns is the public_address column of the user table, which is a 20 bytes hex value and the primary key of the table. The reason I say important because this column is used in almost all other queries within a JOIN or WHERE clause, so it’s indexing and query-speed is crucial.

On the other hand there is another column nonce in transaction table, which is a 32 bytes hex value with a single constraint NOT NULL on it. This is not used in any JOIN or WHERE clauses, and only used in SELECT statements to get the value of the column.

I am aware that bytea is storage-wise more efficient than varchar, but I don’t know which one is better to use for indexes and for faster queries.

So my question is, under which circumstances bytea should be used over varchar(N) for hex values, and vice-verse, taking into account the indexing and query-speed?

2

Answers


  1. What you call a "hex value" is just a text representation of binary data. And that is what matters. It’s plain, raw, arbitrary, binary data. Just a sequence of bytes.

    Indexing and query speed (while I don’t expect them to be much different) do not matter, there is only one appropriate data type for arbitrary byte sequences: bytea.

    Do not use varchar: "The characters that can be stored in any of these data types are determined by the database character set, which is selected when the database is created. Regardless of the specific character set, the character with code zero (sometimes called NUL) cannot be stored."

    Login or Signup to reply.
  2. tl;dr. Use bytea. It will likely be more performant, but more importantly it’s correct; there are some bytes which are illegal for varchar. From the docs

    First, binary strings specifically allow storing octets of value zero and other “non-printable” octets (usually, octets outside the decimal range 32 to 126). Character strings disallow zero octets, and also disallow any other octet values and sequences of octet values that are invalid according to the database’s selected character set encoding. Second, operations on binary strings process the actual bytes, whereas the processing of character strings depends on locale settings. In short, binary strings are appropriate for storing data that the programmer thinks of as “raw bytes”, whereas character strings are appropriate for storing text.

    varchar stores bytes which represent characters using a given character encoding. In many character encodings, certain byte combinations are illegal; they represent no character. For example, in UTF-8 a single byte above 7F is illegal. So you can’t reliably store bytes as bytes in varchar.

    Instead, varchar will store a string representation of the bytes. So x00AA will be stored not as the two bytes 00 AA but as the string "x00AA" which, in UTF-8, is 5C 78 30 30 41 41.

    Demonstration.

    Converting back and forth between the hex characters and the real bytes is more work and more opportunities for mistakes. It’s more efficient and more correct to use bytea.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search