We have a cosmos-db container which has about 1M records containing information about customers. The partition key for the documentDb is customerId which holds a unique GUID reference for the customer. I have read the partitioning and scaling document which would suggest that our choice of key appears appropriate, however if we want to query this data using a field such as DOB or Address, the query will be considered as a cross-partition query and will essentially send the same query to every record in the documentDb before returning.
The query stats in Data Explorer suggests that a query on customer address will return the first 200 documents at a cost of 36.9 RU’s but I was under the impression that this would be far higher given the amount of records that this query would be sent to? Are these query stats accurate?
It is likely that we will want to extend our app to be able to query on multiple non-partition data elements so are we best replicating the customer identity and searchable data element within another documentDb using the desired searchable data element as the partition key. We can then return the identities of all customers who match the query. This essentially changes the query to be an in-partition query and should prevent additional expenditure?
Our current production database has a 4000 (Max Throughput)(Shared) so there appears to be adequate provision for cross-partition queries so would I be wasting my time building out a change-feed to maintain a partitioned representation of the data to support in-partition queries over cross-partition queries?
2
Answers
To get accurate estimate of query cost you need to do the measurement on a container that has a realistic amount of data within it. For example, if I have a container with 5000 RU/s and 5GB of data my cross-partition query will be fairly inexpensive because it only ran on a single physical partition.
If I ran that same query on a container with 100,000 RU/s I would have > 10 physical partitions and the query would show much greater RU/s reported due to the query having to execute across all 10 physical partitions. (Note: 1 physical partition has maximum 10,000 RU/s or 50GB of storage).
It is impossible to say at what amount of RU/s and storage you will begin to get a more realistic number for RU charges. I also don’t know how much throughput or storage you need. If the workload is small then maybe you only need 10K RU and < 50GB of storage. It’s only when you need to scale out that is where you need to first scale out, then measure your query’s RU charge.
To get accurate query measurements, you need to have a container with the throughput and amount of data you would expect to have in production.
You don’t necessarily need to be afraid of cross-partition queries in CosmosDB. Yes, single-partition queries are faster, but if you need to query "find any customers matching X" then cross-partition query is naturally required (unless you really need the hassle of duplicating the info elsewhere in optimized form).
The cross-partition query will not be sent to "each document" as long as you have good indexes in partitions. Just make sure every query has a predicate on a field that is:
.. and the returned number of docs should be limited by business model or forced (top N). This way your RU should be more-or-less top-bound.
36RU per 200 returned docs does not sound too bad as long as it’s not done too many times per sec. But if in doubt, test with predicted data volume and fire up some realistic queries..