skip to Main Content

I have a CosmosDB (using a MongoDB api) running at a avg load of 30% (of its RUs). Very sporadically (like twice a week) there’s a huge spike that takes it to 100% for about 1-2 minutes, so I usually get word of this after the fact, so using db.currentOp() is not an option…

In Sql Server I could run

SELECT TOP 10 SUBSTRING(qt.TEXT, (qs.statement_start_offset/2)+1,
((CASE qs.statement_end_offset
WHEN -1 THEN DATALENGTH(qt.TEXT)
ELSE qs.statement_end_offset
END - qs.statement_start_offset)/2)+1),
qs.execution_count,
qs.total_logical_reads, qs.last_logical_reads,
qs.total_logical_writes, qs.last_logical_writes,
qs.total_worker_time,
qs.last_worker_time,
qs.total_elapsed_time/1000000 total_elapsed_time_in_S,
qs.last_elapsed_time/1000000 last_elapsed_time_in_S,
qs.last_execution_time,
qp.query_plan
FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) qt
CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle) qp
ORDER BY qs.total_logical_reads DESC -- logical reads
-- ORDER BY qs.total_logical_writes DESC -- logical writes
-- ORDER BY qs.total_worker_time DESC -- CPU time

and get everything about that damn query…

Is there a similar query for Cosmos/Mongo?

2

Answers


  1. There are no DMV equivalents collecting this sort of information by default.

    You can get some clues from looking at "Mongo Request Charge" metric filtered to the collection and split by command name for the minutes of interest but this may well not be sufficient. (sometimes it is though. If you see a spike in Request Charge for "Insert" command you can then use the "Mongo Requests" metric to establish whether there was a high volume of these or just expensive individual ones, If you see a spike in Request Charge for "aggregate" commands and your code base does not include this you might suspect an expensive manual adhoc query)

    Otherwise you need to collect the MongoRequests in diagnostic logging.

    I would probably collect this in Blob Storage as much cheaper than App Insights then you can just download the files for the minutes of interest after the fact and see what was causing the spike. You should consider whether or not to enable the "Diagnostics full-text query" feature that collects the plain text query or an anonymised version too.

    How are you measuring "takes it to 100% for about 1-2 minutes"?

    The normalized RU consumption metric is not very granular or informative in seeing if the collection is truly getting hammered. If this is showing 100% in a minute all it means is that at least one partition used its per second request unit budget for at least one second in that minute.

    I have some collections which show up as being constantly at 100% Normalized RU consumption but in reality are using a fairly small percentage of the provisioned request units. If your collection is 100,000 RU/S and 200 partitions this means the "per partition per second" budget is just 500 request units. If the documents are large and wildcard indexing is in use a single document insert can cause that.

    This is a lot different from being hammered constantly for minutes. You should at least split it by partition key range to see if multiple physical partitions are hot or just one of them and (in the case that multiple consecutive minutes are 100%) whether it is consistently the same partition at 100%. Also correlate this with the metric for throttled requests.

    Login or Signup to reply.
  2. You can find the most expensive queries using the diagnostic logging. Make sure you use the resource specific logging. Don’t use AzureDiagnostics. They are more expensive.

    Make sure you enable full text query

    Here’s docs on diagnostic log queries.

    This one finds the most expensive queries.

      CDBMongoRequests
      | where todouble(RequestCharge) > 10.0
      | project ActivityId, RequestCharge
      | join kind= inner (
      CDBQueryRuntimeStatistics
      | project ActivityId, QueryText
      ) on $left.ActivityId == $right.ActivityId
      | order by RequestCharge desc
      | limit 100
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search