skip to Main Content

we are using azure cognitive search as a vector database, we are generating embeddings using the azure open ai Ada02 model for the query and the document (RAG pattern).

we are observing different results being produced for the same question with and without ? (Question mark)

  1. What is Maize ?
  2. What is Maize
  3. What is Maize?

questions

  1. what is the impact of a ‘?’ in vector search especially in Azure Cognitive search.
  2. what is the standard way of handling it.
  3. Is Azure Cognitive Vector Search case sensitive.

Thanks -Nen

2

Answers


  1. Embeddings are not a character-by-character representation of the input, they are a mapping into a continuous vector space, so it’s expected that different inputs, no matter how small the differences, would produce different vectors, and thus may pull different results during search.

    They should be close since they are conceptually the same, but they aren’t going to be the same vector.

    Here are two ways of digging into this a bit more, first comparing embeddings directly, second looking at the tokenization side:

    Comparing embeddings

    Using the embeddings API you can look at the distance between vectors directly, to separate them from the search/retrieval details:

    a = get_embedding("What is Maize?", engine="embedding")
    b = get_embedding("What is Maize ?", engine="embedding")
    c = get_embedding("What is Maize", engine="embedding")
    d = get_embedding("What is maize?", engine="embedding")
    e = get_embedding("What is corn?", engine="embedding")
    f = get_embedding("What is spinach?", engine="embedding")
    print("'?' vs ' ?'", cosine_similarity(a, b))
    print("'?' vs ''", cosine_similarity(a, c))
    print("' ?' vs ''", cosine_similarity(b, c))
    print("Maize vs maize", cosine_similarity(a, d))
    print("maize vs corn", cosine_similarity(a, e))
    print("maize vs spinach", cosine_similarity(a, f))
    

    I get:

    '?' vs ' ?' 0.9789760561431554
    '?' vs '' 0.9726684993796191
    ' ?' vs '' 0.9646235430443343
    Maize vs maize 0.982432778637022
    maize vs corn 0.9262367100603125
    maize vs spinach 0.8305263015872602
    

    OpenAI tokenizer:

    You can try the tokenizer here: https://platform.openai.com/tokenizer
    Two observations you can make: a) the "?" is considered a token (e.g. not ignored or anything like that), and b) different case produces different tokens.

    Screenshot


    Screenshot


    Screenshot


    Screenshot

    Login or Signup to reply.
  2. Adding on to Pablo’s answer, some interesting things to note:

    • Trailing whitespace typically has a negative effect on text completion because of the way text completion models are trained. Since you mentioned RAG pattern, this might be something to note. This blog post has a good explanation about why.

    • The token IDs for ? and ? (note the extra space) are different, which will result in slight differences in the embedding vector representation. Minor differences in casing, spaces, or punctuation will also result in different token IDs (as seen in Pablo’s screenshots). Well-trained embedding models should be robust to these differences, so the embedding vectors should be close together (as Pablo shows with the embedding vector comparisons). However, since the vectors are slightly different, the approximate nearest neighbors may have some variations since the query vector locations are not identical.

    The specific behavior that you ask about is difficult to definitively quantify, since LLM embedding models are somewhat of a black box in terms of behavior. Regarding case sensitivity, the behavior of punctuation, or extra whitespace, it depends on the embedding model. For ada-002, I would expect slight differences in the vector embeddings, but the vectors should be similar to each other.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search