we are using azure cognitive search as a vector database, we are generating embeddings using the azure open ai Ada02 model for the query and the document (RAG pattern).
we are observing different results being produced for the same question with and without ? (Question mark)
- What is Maize ?
- What is Maize
- What is Maize?
questions
- what is the impact of a ‘?’ in vector search especially in Azure Cognitive search.
- what is the standard way of handling it.
- Is Azure Cognitive Vector Search case sensitive.
Thanks -Nen
2
Answers
Embeddings are not a character-by-character representation of the input, they are a mapping into a continuous vector space, so it’s expected that different inputs, no matter how small the differences, would produce different vectors, and thus may pull different results during search.
They should be close since they are conceptually the same, but they aren’t going to be the same vector.
Here are two ways of digging into this a bit more, first comparing embeddings directly, second looking at the tokenization side:
Comparing embeddings
Using the embeddings API you can look at the distance between vectors directly, to separate them from the search/retrieval details:
I get:
OpenAI tokenizer:
You can try the tokenizer here: https://platform.openai.com/tokenizer
Two observations you can make: a) the "?" is considered a token (e.g. not ignored or anything like that), and b) different case produces different tokens.
Adding on to Pablo’s answer, some interesting things to note:
Trailing whitespace typically has a negative effect on text completion because of the way text completion models are trained. Since you mentioned RAG pattern, this might be something to note. This blog post has a good explanation about why.
The token IDs for
?
and?
(note the extra space) are different, which will result in slight differences in the embedding vector representation. Minor differences in casing, spaces, or punctuation will also result in different token IDs (as seen in Pablo’s screenshots). Well-trained embedding models should be robust to these differences, so the embedding vectors should be close together (as Pablo shows with the embedding vector comparisons). However, since the vectors are slightly different, the approximate nearest neighbors may have some variations since the query vector locations are not identical.The specific behavior that you ask about is difficult to definitively quantify, since LLM embedding models are somewhat of a black box in terms of behavior. Regarding case sensitivity, the behavior of punctuation, or extra whitespace, it depends on the embedding model. For
ada-002
, I would expect slight differences in the vector embeddings, but the vectors should be similar to each other.