skip to Main Content

What is the best practice to store 400k line large python ruam yaml hierarchy data without to lost the data types. Which database or process is preferred for such use cases ?
The scope is to store data in database to make changes in parallel and export data by store finally same ruaml yaml files by prevent the value types.
Background is for my question that currently to load and working with such large file is not performant.

2

Answers


  1. one option is to use a NoSQL database, such as MongoDB or Apache Cassandra, which can handle large volumes of data and provide flexible schema support. NoSQL databases can store hierarchical structures in a more natural format, such as nested documents or key-value pairs.

    Login or Signup to reply.
  2. I won’t apologize for ruamel.yaml being as slow as it is. There is (still) a lot of overhead, copying strings around between the various stages of loading and dumping. Additionally a scalar is loaded after dumping, to make sure the preserve the same type/doesn’t throw an error (if not the scalar will be dumped quoted).

    I switched to using msgpack for data I don’t have to read/edit, sometimes using automated YAML to msgpack conversion if the YAML
    document is newer. That works well when you read way more often than update the YAML file.

    import sys
    import ruamel.yaml
    from ruamel.ext.msgpack import pack, packb, unpackb
    
    yaml_str = """
    - abc: 2023-07-19T11:10:45
      24: some text
      xyz: false
      num: [42, 3.14, 192, 2011-10-02]
    """
        
    yaml = ruamel.yaml.YAML(typ='safe')
    yaml.explicit_start = True
    data = yaml.load(yaml_str)
    packed = packb(data)
    unpacked = unpackb(packed)
    print(unpacked)
    yaml.dump(unpacked, sys.stdout)
    

    which gives:

    [{'abc': datetime.datetime(2023, 7, 19, 11, 10, 45, tzinfo=datetime.timezone.utc), 24: 'some text', 'xyz': False, 'num': [42, 3.14, 192, datetime.date(2011, 10, 2)]}]
    ---
    - abc: 2023-07-19 11:10:45+00:00
      24: some text
      xyz: false
      num: [42, 3.14, 192, 2011-10-02]
    

    msgpack allows you to define your own types, so that is how the datetime.date gets round-tripped.

    Some timings using a 400k YAML file on my Macbook M1:

    from pathlib import Path
    import time
    
    input = Path('input.yaml')
    msp = Path('input.msgpack')
    
    yaml = ruamel.yaml.YAML(typ='safe')
    
    start = time.time()
    print(f'YAML size:       {input.stat().st_size} bytes')
    data = yaml.load(input)
    print(f'loading YAML:    {time.time() - start:.4f}s')
    start = time.time()
    pack(data, msp.open('wb'))
    print(f'dumping msgpack: {time.time() - start:.4f}s')
    start = time.time()
    res = unpackb(msp.open('rb').read())
    print(f'loading msgpack: {time.time() - start:.4f}s')
    print(f'msgpack size:    {msp.stat().st_size} bytes')
    

    which gives:

    YAML size:       409635 bytes
    loading YAML:    0.5584s
    dumping msgpack: 0.0023s
    loading msgpack: 0.0046s
    msgpack size:    335021 bytes
    

    For some situations I store (concatenated) msgpack snippets as values in an lmdb database, but for a 400k YAML file that is
    IMO overkill.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search