Ubuntu - How to run multiprocess Chroma.from_documents() in Langchain

ParisChar
August 31, 2023
209 views
1 vote
2 Answers

Can we somehow pass an option to run multiple threads/processes when we call Chroma.from_documents() in Langchain?

I am trying to embed 980 documents (embedding model is mpnet on CUDA), and it take forever.
Specs:
Software: Ubuntu 20.4 (on Win11 WSL2 host), Langchain version: 0.0.253, pyTorch version: 2.0.1+cu118, Chroma Version: 0.4.2, CUDA 11.8
Processor: Intel i9-13900k at 5.4Ghz all 8 P-cores and 4.3Ghz all remaining 16 E-cores.
GPU: RTX 4090 GPU

Answers

- user3517818
- August 13, 2023 at 3:59 am
- 0 votes
0
Chroma now supports multiple threads, so this should be technically possible. Why not simply import threads and spawn multiple loaders?

Login or Signup to reply.

- Codemaker2015
- August 31, 2023 at 7:11 pm
- 0 votes
0
You can use the multiprocessor.ChromaDB class to run multiprocess Chroma.from_documents() in Langchain. It provides a way to parallelize the computation of the Chroma embedding for a set of documents.
```
import os
from langchain.multiprocessor import ChromaDB

chroma_db = ChromaDB(
    path=os.environ["CHROMA_MODEL_PATH"],
    processes=4,
)

documents = ["Test document content 1", "Test document content 2"]

chroma_embeddings = chroma_db.from_documents(documents)

for document_id, chroma_embedding in chroma_embeddings.items():
    print(f"Document ID: {document_id}")
    print(f"Chroma Embedding: {chroma_embedding}")
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Ubuntu – How to run multiprocess Chroma.from_documents() in Langchain

Answers