So if I understand correctly we can choose a binary format (protobuf, thrift or avro) when we want better performance as the data are represented in a more compact way and we don’t have the extra overhead of JSON/XML parsing (or even CSV).
I can understand that this is useful in cases of having a file with a huge amount of records or sending data over a network connection.
What is not clear to me is if the binary format can also be useful for saving individual records in an RDBMS as well. It seems to me it can’t be because we would not have any ability to search records based on any attributes, something that RDMBS already support e.g. MySQL.
So what are some useful use cases of using the binary protocols with an RDBMS (setting aside copying the whole RDBMS for backup in binary format)?
2
Answers
I don’t know the details of protobuf, thrift or avro, but I assume it involves (1) compression, (2) store in binary, (3) decompression when reading.
If you don’t have gigabytes of data, the proposed optimization is considered premature.
Text (English, JSON, CSV, XML, C code, etc) compresses about 3:1. While that shrinks the disk space needed and network cost, it comes at a cost of CPU for compression and decompression. XML is noticeably more bulky than JSON or YAML. (So I never ‘choose’ XML.)
Most image formats, plus PDF, are already compressed. Hence, compression of them is counterproductive.
Storing records in a database in an opaque format is not advisable, no matter how compact the record format is. It carries huge disadvantages, both for reading and writing data:
Reading: it makes it impossible to index records based on their attributes, making queries very slow (as you already mentioned: "we would not have any ability to search records based on any attributes")
Writing: it prevents databases to compress the data. Column databases, for example, they store column values together. As these are usually similar, this compresses extremely well; usually better than compressing records.