skip to Main Content

We are experiencing issues in our integration within .NET Core 3.1 with the Azure Redis Cache.
The exception thrown is

An unhandled exception has occurred while executing the
request.","@l":"Error","@x":"StackExchange.Redis.RedisTimeoutException:
Timeout awaiting response (outbound=1403KiB, inbound=5657KiB, 15000ms
elapsed, timeout is 15000ms), command=EVAL, next: EVAL, inst: 0, qu:
0, qs: 709, aw: True, rs: ReadAsync, ws: Writing, in: 0,
serverEndpoint: redis-scr-mns-dev.redis.cache.windows.net:6380, mc:
1/1/0, mgr: 10 of 10 available, clientName: xxxxxxxxxxxx, IOCP:
(Busy=0,Free=1000,Min=4,Max=1000), WORKER:
(Busy=7,Free=32760,Min=4,Max=32767), v: 2.1.58.34321 (Please take a
look at this article for some common client-side issues that can cause
timeouts:
https://stackexchange.github.io/StackExchange.Redis/Timeouts)

Yes, I read the article already and we are using the StackExchange.Redis NuGet package, latest version available. Steps we already took were

  • Set the minimum threadpool count with several values (ThreadPool.SetMinThreads(short.MaxValue, short.MaxValue);)
  • Increase Redis timeout value from the default 5 seconds to 15 seconds (going any higher will not solve it I think to be honest, as you will read a bit further :))

What is the setup you ask?

  • .NET Core 3.1 REST API running on latest IIS with a 3 worker threads setting on a 4 core windows server with 16GB of RAM (don’t see any extremes on the monitoring regarding cpu or memory)
  • Connected to Azure Redis Cache. Currently running a Basic C5 with high network bandwidth and 23GB of memory (it was a lower one before, so we tried scaling this one)
  • pushing request to an Azure Service Bus at the end (no problems there)

A Batch process is running and processing a couple of 10000’s of API calls (several API’s) of which the one mentioned above is crashing against the Redis Cache with the time out exception. The other api’s are running correctly and not timing out, but are currently connecting to a different Redis cache (just to isolate this api’s behavior)
All api’s and/or batch programs are using a custom NuGet package that has the cache implementation, so we are sure it can’t be an implementation issue in that 1 api, all shared code.

How do we use the cache? Well, via dependency injection we inject ISharedCacheStore, which is just our own interface we put on top of IDistributedCache to make sure only asynchronous calls are available, together with the RedisCache, which is the implementation using Redis (the ISharedCacheStore is for future use of other caching mechanisms)
We use Microsoft.Extensions.Caching.StackExchangeRedis, Version 3.1.5 and registration in startup is

 services.Configure<CacheConfiguration>(options => configuration?.GetSection("CacheConfiguration").Bind(options))
            .AddStackExchangeRedisCache(s =>
                {
                    s.Configuration = connectionString;
                })
            .AddTransient<IRedisCache, RedisCache>()
            .AddTransient<ISharedCacheStore, SharedCacheStore>();

We are out of ideas to be honest. We don’t see an issue with the Redis Cache instance in Azure as this one is not even near it’s top when we get the time-outs. Server load hits about 80% on the lower pricing plan and on the higher didn’t even reach 10% on the current plan.

According to Insights, we have a 4000 cache hits per minute on the run we did, causing the about 10% server load.

UPDATE: It is worth mentioning that the batch and API are running on an on-premise environment today, instead of the cloud. Move to cloud is planned in the upcoming months.
This also is applicable for other api’s connecting to Redis Cache and NOT giving an issue

Comparison

  • Another Azure Redis cache is getting 45K hits a minute without giving any issue whatsoever (from on-premise)
  • This one is hitting the time-out mark nog even reaching 10K hits per minute

2

Answers


  1. Chosen as BEST ANSWER

    So, we found the issue on this. The issue sits within the registration of our classes which is AddTransient as shown in the original code above. When altering this to AddScoped, the performance is a lot faster. Even wondering if it can be a singleton. Weird thing is that the addtransient should increase 'connected clients', which it does as a matter of fact, but has a bigger impact on the number of requests that can be handled as well. Since we never reached the max connections limit during processing.

    .AddScoped<IRedisCache, RedisCache>()
            .AddScoped<ISharedCacheStore, SharedCacheStore>();
    

    With this code instead of AddTransient, we did 220 000 operations on a 4-5 minute period without an issues, whereas with the old code, we didn't even reach 40 000 operations, because of time out exceptions


  2. There’s a couple of possible things here:

    1. I don’t know what that EVAL is doing; it could be that the Lua being executed is causing a blockage; the only way to know for sure would be to look at SLOWLOG, but I don’t know whether this is exposed on Azure redis
    2. It could be that your payloads are saturating the available bandwidth – I don’t know what you are transferring
    3. It could simply be a network/socket stall/break; they happen, especially with cloud – and the (relatively) high latency makes this especially painful
    4. We want to enable a new optional pooled (rather than multiplexed) model; this would in theory (the proof-of-concept worked great) avoid large backlogs, which means even if a socket fails: only that one call is impacted, rather than causing a cascade of failure; the limiting factor on this one is our time (and also, this needs to be balanced with any licensing implications from the redis provider; is there an upper bound on concurrent connections, for example)
    5. It could simply be a bug in the library code; if so, we’re not seeing it here, but we don’t use the same setup as you; we do what we can, but it is very hard to diagnose problems that we don’t see, that only arise in someone else’s at-cost setup that we can’t readily replicate; plus ultimately: this isn’t our day job 🙁

    I don’t think there’s a simple "add this line and everything becomes great" answer here. These are non-trivial at-scale remote scenarios, that take a lot of investigation. And simply: the Azure folks don’t pay for our time.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search