I’m wondering if there is a fast on-disk key-value storage with Python bindings which supports millions of read/write calls to separate keys. My problem involves counting word co-occurrences in a very large corpora (Wikipedia), and continually updating co-occurrence counts. This involves reading and writing ~300 million values 70 times with 64 bit keys, and 64 bit values.
I can also represent my data as an upper-triangular sparse matrix with dimensions ~ 2M x 2M.
So far I have tried:
- Redis (64GB RAM is not large enough)
- TileDB SparseArray (no way to add to values)
- Sqlite (way too slow)
- LMDB (batching the 300 million read/write in transactions takes multiple hours to execute)
- Zarr (coordinate based updating is SUPER slow)
- Scipy .npz (can’t keep the matrices in memory for addition part)
- sparse COO with memmapped coords and data (RAM usage is massive when adding matrices)
Right now the only solution which works well enough is LMDB, but the runtime is ~12 days which seems unreasonable since it does not feel like I’m processing that much data. Saving the sub-matrix (with ~300M values) to disk using .npz is almost instant.
Any ideas?
3
Answers
PySpark is more useful here .
JavaPairRDD<String, String> pairs = lines.mapToPair(keyData);
https://www.oreilly.com/library/view/learning-spark/9781449359034/ch04.html
You might want to check out my project.
pip install rocksdict
This is a fast on-disk key-value storage based on RockDB, it can take any python object as value. I consider it to be reliable and easy to use. It has a performance that’s on par with GDBM, but it is cross-platform compared to GDBM which is only available for python on Linux.
https://github.com/Congyuwang/RocksDict
Below is a demo:
Have a look at Plyvel, which is a python interface to LevelDB.
I used it successfully several years ago, and both projects appear to still be active. My own use-case was storing 100s of millions of key:value pairs, and I was more focussed on read performance, but it looks optimized for write also.