skip to Main Content

I’m wondering if there is a fast on-disk key-value storage with Python bindings which supports millions of read/write calls to separate keys. My problem involves counting word co-occurrences in a very large corpora (Wikipedia), and continually updating co-occurrence counts. This involves reading and writing ~300 million values 70 times with 64 bit keys, and 64 bit values.

I can also represent my data as an upper-triangular sparse matrix with dimensions ~ 2M x 2M.

So far I have tried:

  • Redis (64GB RAM is not large enough)
  • TileDB SparseArray (no way to add to values)
  • Sqlite (way too slow)
  • LMDB (batching the 300 million read/write in transactions takes multiple hours to execute)
  • Zarr (coordinate based updating is SUPER slow)
  • Scipy .npz (can’t keep the matrices in memory for addition part)
  • sparse COO with memmapped coords and data (RAM usage is massive when adding matrices)

Right now the only solution which works well enough is LMDB, but the runtime is ~12 days which seems unreasonable since it does not feel like I’m processing that much data. Saving the sub-matrix (with ~300M values) to disk using .npz is almost instant.

Any ideas?

3

Answers


  1. PySpark is more useful here .

    PairFunction<String, String, String> keyData =
      new PairFunction<String, String, String>() {
      public Tuple2<String, String> call(String x) {
        return new Tuple2(x.split(" ")[0], x);
      }
    };
    

    JavaPairRDD<String, String> pairs = lines.mapToPair(keyData);
    https://www.oreilly.com/library/view/learning-spark/9781449359034/ch04.html

    Login or Signup to reply.
  2. You might want to check out my project.

    pip install rocksdict

    This is a fast on-disk key-value storage based on RockDB, it can take any python object as value. I consider it to be reliable and easy to use. It has a performance that’s on par with GDBM, but it is cross-platform compared to GDBM which is only available for python on Linux.

    https://github.com/Congyuwang/RocksDict

    Below is a demo:

    from rocksdict import Rdict, Options
    
    path = str("./test_dict")
    
    # create a Rdict with default options at `path`
    db = Rdict(path)
    
    db[1.0] = 1
    db[1] = 1.0
    db["huge integer"] = 2343546543243564534233536434567543
    db["good"] = True
    db["bad"] = False
    db["bytes"] = b"bytes"
    db["this is a list"] = [1, 2, 3]
    db["store a dict"] = {0: 1}
    
    import numpy as np
    db[b"numpy"] = np.array([1, 2, 3])
    
    import pandas as pd
    db["a table"] = pd.DataFrame({"a": [1, 2], "b": [2, 1]})
    
    # close Rdict
    db.close()
    
    # reopen Rdict from disk
    db = Rdict(path)
    assert db[1.0] == 1
    assert db[1] == 1.0
    assert db["huge integer"] == 2343546543243564534233536434567543
    assert db["good"] == True
    assert db["bad"] == False
    assert db["bytes"] == b"bytes"
    assert db["this is a list"] == [1, 2, 3]
    assert db["store a dict"] == {0: 1}
    assert np.all(db[b"numpy"] == np.array([1, 2, 3]))
    assert np.all(db["a table"] == pd.DataFrame({"a": [1, 2], "b": [2, 1]}))
    
    # iterate through all elements
    for k, v in db.items():
        print(f"{k} -> {v}")
    
    # batch get:
    print(db[["good", "bad", 1.0]])
    # [True, False, 1]
     
    # delete Rdict from dict
    del db
    Rdict.destroy(path)
    
    Login or Signup to reply.
  3. Have a look at Plyvel, which is a python interface to LevelDB.

    I used it successfully several years ago, and both projects appear to still be active. My own use-case was storing 100s of millions of key:value pairs, and I was more focussed on read performance, but it looks optimized for write also.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search