What is the best practice to store 400k line large python ruam yaml hierarchy data without to lost the data types. Which database or process is preferred for such use cases ?
The scope is to store data in database to make changes in parallel and export data by store finally same ruaml yaml files by prevent the value types.
Background is for my question that currently to load and working with such large file is not performant.
Question posted in PostgreSQL
The official documentation can be found here.
The official documentation can be found here.
2
Answers
one option is to use a NoSQL database, such as MongoDB or Apache Cassandra, which can handle large volumes of data and provide flexible schema support. NoSQL databases can store hierarchical structures in a more natural format, such as nested documents or key-value pairs.
I won’t apologize for
ruamel.yaml
being as slow as it is. There is (still) a lot of overhead, copying strings around between the various stages of loading and dumping. Additionally a scalar is loaded after dumping, to make sure the preserve the same type/doesn’t throw an error (if not the scalar will be dumped quoted).I switched to using
msgpack
for data I don’t have to read/edit, sometimes using automated YAML to msgpack conversion if the YAMLdocument is newer. That works well when you read way more often than update the YAML file.
which gives:
msgpack
allows you to define your own types, so that is how thedatetime.date
gets round-tripped.Some timings using a 400k YAML file on my Macbook M1:
which gives:
For some situations I store (concatenated) msgpack snippets as values in an
lmdb
database, but for a 400k YAML file that isIMO overkill.