skip to Main Content

Can we somehow pass an option to run multiple threads/processes when we call Chroma.from_documents() in Langchain?

I am trying to embed 980 documents (embedding model is mpnet on CUDA), and it take forever.
Specs:
Software: Ubuntu 20.4 (on Win11 WSL2 host), Langchain version: 0.0.253, pyTorch version: 2.0.1+cu118, Chroma Version: 0.4.2, CUDA 11.8
Processor: Intel i9-13900k at 5.4Ghz all 8 P-cores and 4.3Ghz all remaining 16 E-cores.
GPU: RTX 4090 GPU

2

Answers


  1. Chroma now supports multiple threads, so this should be technically possible. Why not simply import threads and spawn multiple loaders?

    Login or Signup to reply.
  2. You can use the multiprocessor.ChromaDB class to run multiprocess Chroma.from_documents() in Langchain. It provides a way to parallelize the computation of the Chroma embedding for a set of documents.

    import os
    from langchain.multiprocessor import ChromaDB
    
    chroma_db = ChromaDB(
        path=os.environ["CHROMA_MODEL_PATH"],
        processes=4,
    )
    
    documents = ["Test document content 1", "Test document content 2"]
    
    chroma_embeddings = chroma_db.from_documents(documents)
    
    for document_id, chroma_embedding in chroma_embeddings.items():
        print(f"Document ID: {document_id}")
        print(f"Chroma Embedding: {chroma_embedding}")
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search