I have an indexer that reads through blob storage, chunks, and vectorizes the data into an index. This is working great. I also have a key field, lets call it fileID that is stored in the metadata of the document and is also in the index. This is unique to the document, however it is not unique after chunking because a document will be split into multiple documents each with the same fileid.
I want to have a second indexer than can add data from a sql query into the index, joined on that fileid. However since I can’t use fileid anymore as the key – because of the chunking process and the fact that a key needs to be unique, how can I merge the data from the sql query indexer into the index?
I’m guessing this is not possible right now but if anyone has any suggestions, that would be amazing!
2
Answers
I ended up doing this with a custom web api skill using an Azure Function which takes a recordid and returns additional fields from a sql server database.
Custom Skill Web API Documentation
Here is my skillset definition when all is said and done.
The key field in the index will be unique per document, and it will be the same for the chunks created on that document.
So, it is not possible to create a unique ID for each chunk unless you create two separate indexes, one basic fields and further chunking with a custom web API skillset and one for loading the chunked data with creating unique fields.
Here, the skillset takes input and creates the chunked data for each document, and writes it to blob storage. Then, with that storage as a data source, it reads the key field as
uniqueid
.tmp-index
below shows the fields.Next, the indexer is used for this index with the skillset.
Indexer definition.
Here, the
dataSourceName
is configured as your original DataSource and given the skillset, which takes input and writes chunked data in JSON format to the storage account.Skillset definition.
Here, a POST request is made to the web API, passing the inputs.
You can configure the skillset according to your requirements. In the web API, you can use these inputs.
text
, create chunks of size1024
.document_id
and a unique file withchunk_id
.You can create a new container and write according to your requirement, but make sure you are providing this as a data source in the next step.
So, the script in your web API should create unique folders and files, then write the content.
Next, create a new index and indexer with the definitions below.
Chunked-index
Here, you can also add your extra fields, which are the results of the join query.
Indexer
tmp-datasource-chunk
is the DataSource created using blob storage, providing the container where you wrote the data earlier.Now, you can add a custom web API skill here, which does join on the unique ID.