I’m building an Azure Retrieval-Augmented Generation (RAG) application using Azure Cognitive Search to process trainee-related data stored in JSON files. Here’s the general workflow:
- Data Structure: Each JSON file represents a plan (e.g.,
free-trainees-project-data.json
,premium-trainees-project-data.json
). Below is a sample structure:
[
{
"overview": "The user John Doe (UUID: 00000000-0000-0000-0000-000000000001) has subscribed to the plan 'Free' starting from 2024-07-08 00:00:00.0. The user isn't associated with any tracks or projects.",
"user_name": "John Doe",
"user_uuid": "00000000-0000-0000-0000-000000000001",
"plan_start_date": "2024-07-08 00:00:00.0",
"plan_name": "Free",
"tracks": []
},
{
"overview": "The user Jane Smith (UUID: 00000000-0000-0000-0000-000000000002) has subscribed to the plan 'Free' starting from 2024-02-21 00:00:00.0. The user is associated with the following skill tracks: Track 'Quality Assurance' includes the following projects: [Project 'Sample Project A' (UUID: 00000000-0000-0000-0000-000000000003) has tags [["TESTNG","Postman"]], difficulty level 'intermediate', and is currently 'In Progress'. It is led by Team Lead ABC, started on 2024-06-06 08:14:06.758, and ended on Ongoing. ]",
"user_name": "Jane Smith",
"user_uuid": "00000000-0000-0000-0000-000000000002",
"plan_start_date": "2024-02-21 00:00:00.0",
"plan_name": "Free",
"tracks": [
{
"track_name": "Quality Assurance",
"projects": [
{
"project_name": "Sample Project A",
"project_uuid": "00000000-0000-0000-0000-000000000003",
"project_tags": "["TESTNG","Postman"]",
"team_lead": "Team Lead ABC",
"joining_date": "2024-06-06 08:14:06.758",
"project_difficulty": "intermediate",
"project_status": "In Progress",
"updated_by": "Team Lead ABC",
"exit_date": "Ongoing"
}
]
}
]
},
...
]
-
Azure Workflow:
-
Upload JSON files to an Azure Storage container.
-
Configure Azure Cognitive Search to parse the JSON array using JSON Array Parsing mode.
-
Vectorize the
overview
field using thetext-embedding-ada-003
model. -
Problem: Queries return unreliable results. For example:
-
Query: "List users in the Starter plan without any tracks or projects"
-
Expected: A list of users meeting this condition.
-
Actual: Incorrect users or an incomplete list.
-
Query: "Count the number of users subscribed to the Free plan."
-
Expected: Accurate count.
-
Actual: Incorrect count is returned.
Steps Taken:
- Verified JSON structure and parsing mode.
- Used
overview
as the vectorization target. - Verified the connection between Azure Search and OpenAI services.
Questions:
- Am I structuring my JSON data or vectorization process incorrectly?
- How can I improve query accuracy for such use cases?
Environment Details:
3. Azure Cognitive Search with OpenAI integration.
4. Model: text-embedding-ada-003
.
My Code:
@Override
public Flux<ChatCompletions> getOpenAIAsyncClientChatStream(
ModelConfiguration modelConfiguration, List<ChatRequestMessage> messages) {
AIChatModel aiChatModel = modelConfiguration.getModel();
OpenAIAsyncClient openAIAsyncClient =
new OpenAIClientBuilder()
.credential(new AzureKeyCredential(aiChatModel.getApiKey()))
.endpoint(aiChatModel.getEndpoint())
.buildAsyncClient();
ChatCompletionsOptions chatCompletionsOptions = new ChatCompletionsOptions(messages);
chatCompletionsOptions.setMaxTokens(aiChatModel.getMaxTokens());
chatCompletionsOptions.setTemperature(Double.valueOf(aiChatModel.getTemperature()));
if (modelConfiguration.getDataIngestion()) {
AzureSearchChatExtensionParameters searchParameters =
getAzureSearchChatExtensionParameters(modelConfiguration);
AzureSearchChatExtensionConfiguration searchChatExtension =
new AzureSearchChatExtensionConfiguration(searchParameters);
chatCompletionsOptions.setDataSources(List.of(searchChatExtension));
}
return openAIAsyncClient.getChatCompletionsStream(
aiChatModel.getDeploymentName(), chatCompletionsOptions);
}
private AzureSearchChatExtensionParameters getAzureSearchChatExtensionParameters(
ModelConfiguration modelConfiguration) {
AzureIndex index = modelConfiguration.getIndex();
OnYourDataApiKeyAuthenticationOptions authenticationOptions =
new OnYourDataApiKeyAuthenticationOptions(searchApiKey);
AzureSearchChatExtensionParameters searchParameters =
new AzureSearchChatExtensionParameters(searchEndpoint, index.getIndexName());
searchParameters.setAuthentication(authenticationOptions);
searchParameters.setTopNDocuments(index.getTopNDocuments());
searchParameters.setSemanticConfiguration(index.getSemanticConfiguration());
searchParameters.setQueryType(AzureSearchQueryType.VECTOR_SEMANTIC_HYBRID);
searchParameters.setInScope(index.getInScope());
OnYourDataDeploymentNameVectorizationSource embeddingSource =
new OnYourDataDeploymentNameVectorizationSource(index.getTextEmbeddingModel());
searchParameters.setEmbeddingDependency(embeddingSource);
return searchParameters;
}
My Azure Index JSON:
{
"name": "all-trainees-json-index-production",
"fields": [
{
"name": "chunk_id",
"type": "Edm.String",
"key": true,
"retrievable": true,
"stored": true,
"searchable": true,
"filterable": false,
"sortable": true,
"facetable": false,
"analyzer": "keyword",
"synonymMaps": []
},
{
"name": "parent_id",
"type": "Edm.String",
"key": false,
"retrievable": true,
"stored": true,
"searchable": false,
"filterable": true,
"sortable": false,
"facetable": false,
"synonymMaps": []
},
{
"name": "chunk",
"type": "Edm.String",
"key": false,
"retrievable": true,
"stored": true,
"searchable": true,
"filterable": false,
"sortable": false,
"facetable": false,
"synonymMaps": []
},
{
"name": "title",
"type": "Edm.String",
"key": false,
"retrievable": true,
"stored": true,
"searchable": true,
"filterable": false,
"sortable": false,
"facetable": false,
"synonymMaps": []
},
{
"name": "text_vector",
"type": "Collection(Edm.Single)",
"key": false,
"retrievable": true,
"stored": true,
"searchable": true,
"filterable": false,
"sortable": false,
"facetable": false,
"synonymMaps": [],
"dimensions": 3072,
"vectorSearchProfile": "all-trainees-json-index-production-azureOpenAi-text-profile"
},
{
"name": "user_name",
"type": "Edm.String",
"key": false,
"retrievable": true,
"stored": true,
"searchable": true,
"filterable": false,
"sortable": false,
"facetable": false,
"synonymMaps": []
},
{
"name": "user_uuid",
"type": "Edm.String",
"key": false,
"retrievable": true,
"stored": true,
"searchable": true,
"filterable": false,
"sortable": false,
"facetable": false,
"synonymMaps": []
},
{
"name": "plan_start_date",
"type": "Edm.DateTimeOffset",
"key": false,
"retrievable": true,
"stored": true,
"searchable": false,
"filterable": false,
"sortable": false,
"facetable": false,
"synonymMaps": []
},
{
"name": "plan_name",
"type": "Edm.String",
"key": false,
"retrievable": true,
"stored": true,
"searchable": true,
"filterable": false,
"sortable": false,
"facetable": false,
"synonymMaps": []
},
{
"name": "tracks",
"type": "Collection(Edm.ComplexType)",
"fields": [
{
"name": "track_name",
"type": "Edm.String",
"key": false,
"retrievable": true,
"stored": true,
"searchable": true,
"filterable": false,
"sortable": false,
"facetable": false,
"synonymMaps": []
},
{
"name": "projects",
"type": "Collection(Edm.ComplexType)",
"fields": [
{
"name": "project_name",
"type": "Edm.String",
"key": false,
"retrievable": true,
"stored": true,
"searchable": true,
"filterable": false,
"sortable": false,
"facetable": false,
"synonymMaps": []
},
{
"name": "project_uuid",
"type": "Edm.String",
"key": false,
"retrievable": true,
"stored": true,
"searchable": true,
"filterable": false,
"sortable": false,
"facetable": false,
"synonymMaps": []
},
{
"name": "project_tags",
"type": "Edm.String",
"key": false,
"retrievable": true,
"stored": true,
"searchable": true,
"filterable": false,
"sortable": false,
"facetable": false,
"synonymMaps": []
},
{
"name": "team_lead",
"type": "Edm.String",
"key": false,
"retrievable": true,
"stored": true,
"searchable": true,
"filterable": false,
"sortable": false,
"facetable": false,
"synonymMaps": []
},
{
"name": "joining_date",
"type": "Edm.String",
"key": false,
"retrievable": true,
"stored": true,
"searchable": true,
"filterable": false,
"sortable": false,
"facetable": false,
"synonymMaps": []
},
{
"name": "project_difficulty",
"type": "Edm.String",
"key": false,
"retrievable": true,
"stored": true,
"searchable": true,
"filterable": false,
"sortable": false,
"facetable": false,
"synonymMaps": []
},
{
"name": "project_status",
"type": "Edm.String",
"key": false,
"retrievable": true,
"stored": true,
"searchable": true,
"filterable": false,
"sortable": false,
"facetable": false,
"synonymMaps": []
},
{
"name": "updated_by",
"type": "Edm.String",
"key": false,
"retrievable": true,
"stored": true,
"searchable": true,
"filterable": false,
"sortable": false,
"facetable": false,
"synonymMaps": []
},
{
"name": "exit_date",
"type": "Edm.String",
"key": false,
"retrievable": true,
"stored": true,
"searchable": true,
"filterable": false,
"sortable": false,
"facetable": false,
"synonymMaps": []
}
]
}
]
},
{
"name": "AzureSearch_DocumentKey",
"type": "Edm.String",
"key": false,
"retrievable": false,
"stored": true,
"searchable": true,
"filterable": false,
"sortable": false,
"facetable": false,
"synonymMaps": []
},
{
"name": "metadata_storage_content_type",
"type": "Edm.String",
"key": false,
"retrievable": false,
"stored": true,
"searchable": true,
"filterable": false,
"sortable": false,
"facetable": false,
"synonymMaps": []
},
{
"name": "metadata_storage_size",
"type": "Edm.Int64",
"key": false,
"retrievable": false,
"stored": true,
"searchable": false,
"filterable": false,
"sortable": false,
"facetable": false,
"synonymMaps": []
},
{
"name": "metadata_storage_last_modified",
"type": "Edm.DateTimeOffset",
"key": false,
"retrievable": false,
"stored": true,
"searchable": false,
"filterable": false,
"sortable": false,
"facetable": false,
"synonymMaps": []
},
{
"name": "metadata_storage_content_md5",
"type": "Edm.String",
"key": false,
"retrievable": false,
"stored": true,
"searchable": true,
"filterable": false,
"sortable": false,
"facetable": false,
"synonymMaps": []
},
{
"name": "metadata_storage_name",
"type": "Edm.String",
"key": false,
"retrievable": false,
"stored": true,
"searchable": true,
"filterable": false,
"sortable": false,
"facetable": false,
"synonymMaps": []
},
{
"name": "metadata_storage_path",
"type": "Edm.String",
"key": false,
"retrievable": false,
"stored": true,
"searchable": true,
"filterable": false,
"sortable": false,
"facetable": false,
"synonymMaps": []
},
{
"name": "metadata_storage_file_extension",
"type": "Edm.String",
"key": false,
"retrievable": false,
"stored": true,
"searchable": true,
"filterable": false,
"sortable": false,
"facetable": false,
"synonymMaps": []
}
],
"scoringProfiles": [],
"suggesters": [],
"analyzers": [],
"tokenizers": [],
"tokenFilters": [],
"charFilters": [],
"normalizers": [],
"similarity": {
"@odata.type": "#Microsoft.Azure.Search.BM25Similarity"
},
"semantic": {
"defaultConfiguration": "all-trainees-json-index-production-semantic-configuration",
"configurations": [
{
"name": "all-trainees-json-index-production-semantic-configuration",
"prioritizedFields": {
"titleField": {
"fieldName": "title"
},
"prioritizedContentFields": [
{
"fieldName": "chunk"
}
],
"prioritizedKeywordsFields": []
}
}
]
},
"vectorSearch": {
"algorithms": [
{
"name": "all-trainees-json-index-production-algorithm",
"kind": "hnsw",
"hnswParameters": {
"m": 4,
"efConstruction": 400,
"efSearch": 500,
"metric": "cosine"
}
}
],
"profiles": [
{
"name": "all-trainees-json-index-production-azureOpenAi-text-profile",
"algorithm": "all-trainees-json-index-production-algorithm",
"vectorizer": "all-trainees-json-index-production-azureOpenAi-text-vectorizer"
}
],
"vectorizers": [
{
"name": "all-trainees-json-index-production-azureOpenAi-text-vectorizer",
"kind": "azureOpenAI",
"azureOpenAIParameters": {
"resourceUri": "https://custom-test-sample-openai.openai.azure.com",
"deploymentId": "text-embedding-3-large",
"apiKey": "<redacted>",
"modelName": "text-embedding-3-large"
}
}
],
"compressions": []
},
"@odata.etag": ""0x8DD28A373EA1D3C""
}
Import and vectorize data
I’m using parsing mode as JSON array
I’m using overview column for vectorizing
Any insights or suggestions are highly appreciated!
2
Answers
RAG is not perfect especially for certain prompts. for your two questions below:
Query: "Count the number of users subscribed to the Free plan."
Actual: Incorrect count is returned.
The ai search only returns top matched results typically, you can increase the setting but not always all records would be returned
Query: "List users in the Starter plan without any tracks or projects"
Actual: Incorrect users or an incomplete list.
same reason for incomplete list as #1.
for incorrect users, ai search would try to find records by semantic search. the best way to investigate is to inspect which json records are returned from ai search or test it out via ai search UI in portal.
For these kind of questions, my best suggestion is to look into function calling. LLM will invoke the function calling (if found relevent) and pass relevant parameters, then you can implement the function calling logic to query the data (e.g. pull full list of uses back based on plan name).
@ JayashankarGS
is there any way to contact you? I have some questions about Azure functions combined with azure ai search Blog Storage etc.