I have the following infrastructure:
.NET Core 3.1 API, which is hosted in VNet. Inside of VNet we have 8 servers with load balancer + SQL Server + Redis Cache.
We are running the API Load test 1200 operation per second on login operation (which is not a lightweight operation). At this moment the load on all servers is 5-10%. But the problem is we’re getting API timeout and Redis timeout issues.
It seems like something is blocking our threads
This is from my Startup.cs (we’re trying to play with the value, but no success):
var threadCount = 2000;
ThreadPool.GetMaxThreads(out _, out var completionThreads);
ThreadPool.SetMinThreads(threadCount, completionThreads);
This is from *.csproj file:
<PropertyGroup>
<ThreadPoolMinThreads>315</ThreadPoolMinThreads>
Update1-> Redis issue information is added
Redis error:
StackExchange.Redis.RedisTimeoutException: Timeout awaiting response (outbound=0KiB, inbound=0KiB, 10008ms elapsed, timeout is 10000ms), command=GET, next: SET key_digievents____freeevent_4072, inst: 0, qu: 0, qs: 684, aw: False, rs: ReadAsync, ws: Idle, in: 2197285, in-pipe: 0, out-pipe: 0, serverEndpoint: 10.0.0.34:6379, mc: 1/1/0, mgr: 10 of 10 available, clientName: akssocial27apiapp-xkkb4, IOCP: (Busy=0,Free=1000,Min=1000,Max=1000), WORKER: (Busy=430,Free=32337,Min=315,Max=32767), v: 2.1.58.34321
StackExchange.Redis.RedisTimeoutException: Timeout awaiting response (outbound=0KiB, inbound=0KiB, 10008ms elapsed, timeout is 10000ms), command=GET, next: SET key_digievents____freeevent_4072, inst: 0, qu: 0, qs: 684, aw: False, rs: ReadAsync, ws: Idle, in: 2197285, in-pipe: 0, out-pipe: 0, serverEndpoint: 10.0.0.34:6379, mc: 1/1/0, mgr: 10 of 10 available, clientName: akssocial27apiapp-xkkb4, IOCP: (Busy=0,Free=1000,Min=1000,Max=1000), WORKER: (Busy=430,Free=32337,Min=315,Max=32767), v: 2.1.58.34321
at Datadog.Trace.ClrProfiler.Integrations.StackExchange.Redis.ConnectionMultiplexer.ExecuteAsyncImplInternal[T](Object multiplexer, Object message, Object processor, Object state, Object server, Func`6 originalMethod)
I will be glad for any advice.
Thanks in advance.
2
Answers
Below is the text I enclose my notes on the question I asked. I hope this helps someone and saves a lot of time.
First of all, we had no exceptions/errors/reports that bandwidth is a bottleneck in Azure infrastructure. It was just our assumptions. But to counter that assumption we increased capacity a lot that even MS Azure team is saying we are over-provisioning than our usage. So bandwidth has never been an issue. It's the limitation of:
StackExchangeRedis Nuget package especially when it handles more data bytes.
Analysis says we are calling lots of unnecessary endpoints OR data for pages that we don't need for that page.
So thus as my POV, we need to figure out an optimization of how we can use StackExchangeRedis package to handle minimum and only necessary data AND how we can reduce calls of unwanted endpoints on FE, etc.
My teammate had been in touch with the guy who developed StachExchangeRedis. The developer admitted that with huge data chunks it has limitations. He also told us that we were not the only ones who are experiencing this issue.
So after all the discussions, we optimized our calls for GET/SET operations. It allowed us to handle data bytes in a more precise & effective way for necessary endpoints cutting off all unwanted endpoints and calls. We also implemented some kind of compression on the backend side.
At the end of this, we have added additional regions which allow us to improve the current situation.
Nevertheless, we are thinking about skipping of usages Redis and going to NoSQL solutions. We will resolve several issues at once - cache issues and real-time data + SQL limitations on big data (So far, this is only at the level of an idea)
P.S. Factors impaction Redis performance:
Many people meet TimeoutException when they upgrade to 2.x
https://github.com/StackExchange/StackExchange.Redis/issues/1226
this solution might help you:
Are you seeing a high number of busyio or busyworker threads in the timeout exception?
At the end of the post, it says: