Postgresql - What is the best practice to store 400k line large python ruam yaml files

ovntatar
July 19, 2023
293 views
0 votes
2 Answers

What is the best practice to store 400k line large python ruam yaml hierarchy data without to lost the data types. Which database or process is preferred for such use cases ?
The scope is to store data in database to make changes in parallel and export data by store finally same ruaml yaml files by prevent the value types.
Background is for my question that currently to load and working with such large file is not performant.

Answers

- Lendy345
- July 19, 2023 at 9:46 am
- 0 votes
0
one option is to use a NoSQL database, such as MongoDB or Apache Cassandra, which can handle large volumes of data and provide flexible schema support. NoSQL databases can store hierarchical structures in a more natural format, such as nested documents or key-value pairs.

Login or Signup to reply.

I won’t apologize for ruamel.yaml being as slow as it is. There is (still) a lot of overhead, copying strings around between the various stages of loading and dumping. Additionally a scalar is loaded after dumping, to make sure the preserve the same type/doesn’t throw an error (if not the scalar will be dumped quoted).

I switched to using msgpack for data I don’t have to read/edit, sometimes using automated YAML to msgpack conversion if the YAML
document is newer. That works well when you read way more often than update the YAML file.

import sys
import ruamel.yaml
from ruamel.ext.msgpack import pack, packb, unpackb

yaml_str = """
- abc: 2023-07-19T11:10:45
  24: some text
  xyz: false
  num: [42, 3.14, 192, 2011-10-02]
"""
    
yaml = ruamel.yaml.YAML(typ='safe')
yaml.explicit_start = True
data = yaml.load(yaml_str)
packed = packb(data)
unpacked = unpackb(packed)
print(unpacked)
yaml.dump(unpacked, sys.stdout)

which gives:

[{'abc': datetime.datetime(2023, 7, 19, 11, 10, 45, tzinfo=datetime.timezone.utc), 24: 'some text', 'xyz': False, 'num': [42, 3.14, 192, datetime.date(2011, 10, 2)]}]
---
- abc: 2023-07-19 11:10:45+00:00
  24: some text
  xyz: false
  num: [42, 3.14, 192, 2011-10-02]

msgpack allows you to define your own types, so that is how the datetime.date gets round-tripped.

Some timings using a 400k YAML file on my Macbook M1:

from pathlib import Path
import time

input = Path('input.yaml')
msp = Path('input.msgpack')

yaml = ruamel.yaml.YAML(typ='safe')

start = time.time()
print(f'YAML size:       {input.stat().st_size} bytes')
data = yaml.load(input)
print(f'loading YAML:    {time.time() - start:.4f}s')
start = time.time()
pack(data, msp.open('wb'))
print(f'dumping msgpack: {time.time() - start:.4f}s')
start = time.time()
res = unpackb(msp.open('rb').read())
print(f'loading msgpack: {time.time() - start:.4f}s')
print(f'msgpack size:    {msp.stat().st_size} bytes')

which gives:

YAML size:       409635 bytes
loading YAML:    0.5584s
dumping msgpack: 0.0023s
loading msgpack: 0.0046s
msgpack size:    335021 bytes

For some situations I store (concatenated) msgpack snippets as values in an lmdb database, but for a 400k YAML file that is
IMO overkill.

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – What is the best practice to store 400k line large python ruam yaml files

Answers