We are experiencing issues in our integration within .NET Core 3.1 with the Azure Redis Cache.
The exception thrown is
An unhandled exception has occurred while executing the
request.","@l":"Error","@x":"StackExchange.Redis.RedisTimeoutException:
Timeout awaiting response (outbound=1403KiB, inbound=5657KiB, 15000ms
elapsed, timeout is 15000ms), command=EVAL, next: EVAL, inst: 0, qu:
0, qs: 709, aw: True, rs: ReadAsync, ws: Writing, in: 0,
serverEndpoint: redis-scr-mns-dev.redis.cache.windows.net:6380, mc:
1/1/0, mgr: 10 of 10 available, clientName: xxxxxxxxxxxx, IOCP:
(Busy=0,Free=1000,Min=4,Max=1000), WORKER:
(Busy=7,Free=32760,Min=4,Max=32767), v: 2.1.58.34321 (Please take a
look at this article for some common client-side issues that can cause
timeouts:
https://stackexchange.github.io/StackExchange.Redis/Timeouts)
Yes, I read the article already and we are using the StackExchange.Redis NuGet package, latest version available. Steps we already took were
- Set the minimum threadpool count with several values (ThreadPool.SetMinThreads(short.MaxValue, short.MaxValue);)
- Increase Redis timeout value from the default 5 seconds to 15 seconds (going any higher will not solve it I think to be honest, as you will read a bit further :))
What is the setup you ask?
- .NET Core 3.1 REST API running on latest IIS with a 3 worker threads setting on a 4 core windows server with 16GB of RAM (don’t see any extremes on the monitoring regarding cpu or memory)
- Connected to Azure Redis Cache. Currently running a Basic C5 with high network bandwidth and 23GB of memory (it was a lower one before, so we tried scaling this one)
- pushing request to an Azure Service Bus at the end (no problems there)
A Batch process is running and processing a couple of 10000’s of API calls (several API’s) of which the one mentioned above is crashing against the Redis Cache with the time out exception. The other api’s are running correctly and not timing out, but are currently connecting to a different Redis cache (just to isolate this api’s behavior)
All api’s and/or batch programs are using a custom NuGet package that has the cache implementation, so we are sure it can’t be an implementation issue in that 1 api, all shared code.
How do we use the cache? Well, via dependency injection we inject ISharedCacheStore, which is just our own interface we put on top of IDistributedCache to make sure only asynchronous calls are available, together with the RedisCache, which is the implementation using Redis (the ISharedCacheStore is for future use of other caching mechanisms)
We use Microsoft.Extensions.Caching.StackExchangeRedis, Version 3.1.5 and registration in startup is
services.Configure<CacheConfiguration>(options => configuration?.GetSection("CacheConfiguration").Bind(options))
.AddStackExchangeRedisCache(s =>
{
s.Configuration = connectionString;
})
.AddTransient<IRedisCache, RedisCache>()
.AddTransient<ISharedCacheStore, SharedCacheStore>();
We are out of ideas to be honest. We don’t see an issue with the Redis Cache instance in Azure as this one is not even near it’s top when we get the time-outs. Server load hits about 80% on the lower pricing plan and on the higher didn’t even reach 10% on the current plan.
According to Insights, we have a 4000 cache hits per minute on the run we did, causing the about 10% server load.
UPDATE: It is worth mentioning that the batch and API are running on an on-premise environment today, instead of the cloud. Move to cloud is planned in the upcoming months.
This also is applicable for other api’s connecting to Redis Cache and NOT giving an issue
Comparison
- Another Azure Redis cache is getting 45K hits a minute without giving any issue whatsoever (from on-premise)
- This one is hitting the time-out mark nog even reaching 10K hits per minute
2
Answers
So, we found the issue on this. The issue sits within the registration of our classes which is AddTransient as shown in the original code above. When altering this to AddScoped, the performance is a lot faster. Even wondering if it can be a singleton. Weird thing is that the addtransient should increase 'connected clients', which it does as a matter of fact, but has a bigger impact on the number of requests that can be handled as well. Since we never reached the max connections limit during processing.
With this code instead of AddTransient, we did 220 000 operations on a 4-5 minute period without an issues, whereas with the old code, we didn't even reach 40 000 operations, because of time out exceptions
There’s a couple of possible things here:
EVAL
is doing; it could be that the Lua being executed is causing a blockage; the only way to know for sure would be to look atSLOWLOG
, but I don’t know whether this is exposed on Azure redisI don’t think there’s a simple "add this line and everything becomes great" answer here. These are non-trivial at-scale remote scenarios, that take a lot of investigation. And simply: the Azure folks don’t pay for our time.