I have a virtual machine with 64 vCPUs and 256GB of memory. Recently, I decided to perform some stress tests on the website that is running on this virtual machine. The entire VM is only for this website.
The first test I ran was with 20,000 users per second, and the average response time was around 1400ms. During the test, the site was not usable.
After that, I decided to inspect the top processes to identify the source of the problem. These are the processes and their CPU utilization during the test:
top - 10:30:19 up 1 day, 34 min, 0 users, load average: 8.39, 3.04, 1.46
Tasks: 711 total, 2 running, 709 sleeping, 0 stopped, 0 zombie
%Cpu(s): 6.0 us, 9.8 sy, 3.8 ni, 79.2 id, 0.2 wa, 0.0 hi, 0.9 si, 0.0 st
MiB Mem : 257925.6 total, 219425.1 free, 3658.2 used, 34842.3 buff/cache
MiB Swap: 2048.0 total, 2048.0 free, 0.0 used. 252346.8 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
218159 mysql 20 0 6911232 96204 19792 S 491.4 0.0 4:24.99 mysqld
139405 nobody 20 0 54948 34196 6128 D 44.9 0.0 0:52.17 litespeed
218251 obl74+ 21 1 347708 29228 19328 S 40.9 0.0 0:20.83 lsphp
218402 obl74+ 21 1 347708 29152 19264 S 40.9 0.0 0:22.35 lsphp
218955 obl74+ 21 1 273004 21336 12472 D 40.9 0.0 0:22.39 lsphp
218957 obl74+ 21 1 273004 21336 12472 D 40.9 0.0 0:22.22 lsphp
218961 obl74+ 21 1 273004 21336 12472 S 40.9 0.0 0:22.37 lsphp
218963 obl74+ 21 1 273004 21328 12468 S 40.9 0.0 0:22.31 lsphp
218252 obl74+ 21 1 347708 29228 19328 D 40.5 0.0 0:22.42 lsphp
218407 obl74+ 21 1 347708 29152 19264 D 40.5 0.0 0:22.30 lsphp
218956 obl74+ 21 1 273004 21332 12472 S 40.5 0.0 0:20.73 lsphp
218959 obl74+ 21 1 273004 21336 12472 S 40.5 0.0 0:22.13 lsphp
Interestingly, despite the website’s poor performance during the test, neither the CPU nor memory usage was particularly high. Also, during the test, CyberPanel indicated a CPU usage of 19% and a memory usage of 2%. Therefore, I conclude that the server is not experiencing any resource constraints, as it is not utilizing all its CPU and memory. However, it is still lagging for some reason.
Then, I decided to remove the components related to MySQL from the page on which I performed the stress test. The outcome was much more stable.
top - 10:43:54 up 1 day, 47 min, 0 users, load average: 0.87, 1.23, 1.41
Tasks: 705 total, 5 running, 699 sleeping, 0 stopped, 1 zombie
%Cpu(s): 2.8 us, 1.0 sy, 0.4 ni, 95.2 id, 0.0 wa, 0.0 hi, 0.5 si, 0.0 st
MiB Mem : 257925.6 total, 218249.7 free, 3910.0 used, 35765.9 buff/cache
MiB Swap: 2048.0 total, 2048.0 free, 0.0 used. 252098.9 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
139416 nobody 20 0 53200 32480 6128 S 18.3 0.0 0:47.00 litespeed
139402 nobody 20 0 52928 33308 7204 S 16.6 0.0 0:44.40 litespeed
139409 nobody 20 0 54900 34136 6188 S 16.6 0.0 0:46.38 litespeed
139410 nobody 20 0 49904 29156 6128 S 16.6 0.0 0:35.43 litespeed
139414 nobody 20 0 51688 30936 6128 R 16.6 0.0 0:45.46 litespeed
139415 nobody 20 0 55492 35280 6680 R 15.9 0.0 0:46.24 litespeed
139412 nobody 20 0 52112 31420 6188 S 15.6 0.0 0:45.05 litespeed
139404 nobody 20 0 50396 29644 6128 S 15.3 0.0 0:44.83 litespeed
139413 nobody 20 0 44700 23816 6128 S 15.3 0.0 0:21.83 litespeed
139406 nobody 20 0 50752 30004 6128 S 15.0 0.0 1:05.25 litespeed
According to CyberPanel, during the new test, the CPU usage was 4% and the memory usage was 2%.
Therefore, it is obvious that there is an issue with MySQL. I am currently using the default my.cnf configuration provided by CyberPanel, but I have attempted various other configurations found on the internet, yet nothing has improved the performance even a little bit. I’ve also tried stuff like MySQL Tuner but it didn’t change the performance.
The MySQL part that I removed for the second test was a basic query of a table that contained 7 rows. It verified the user’s IP address to determine if they were on the IP whitelist. This operation should not have posed a significant problem.
As observed in both tests, I detect a threshold or a bottleneck at the start, beyond which the site experiences a sharp increase in lag. Despite having ample free memory and CPU, there seems to be some limiting factor.
Some might argue that a rate of 20,000 users per second is excessive and unrealistic. However, even when I conducted the test with only 250 users per second, the outcome was the same: the website was extremely slow and not usable.
At this point I am totally lost. I am uncertain as to where to focus my efforts and what steps to take next to decrease the average response time. I would greatly appreciate any insightful comments or suggestions you may have and I thank you in advance for your time and consideration.
UPDATE
I have reinstalled the operating system and CyberPanel, and it appears that the problem has been resolved. Although I am uncertain about what went wrong previously, I suspect that an incorrect setting was responsible.
2
Answers
For 20K users per second, you need multiple servers and switches in front of them. Period. End of discussion.
Well, OK, I’ll discuss it further.
When MySQL is presented with lots of ‘simultaneous’ users, it plays fair with them — each is given equal access to all resources. This is fine until it falls off the cliff. This is when most of the processing is dealing with sharing of the resources. All the threads will eventually finish, but each will take a long time and you (the DBA) will think it crashed and pull the plug.
A simple cure is to lower (YES, lower) the value of
max_connections
. It turns out that the "cliff" is at a few dozen connections.Is benchmarking, one throws as much stuff at the server until it croaks. That’s usually a few dozen.
In real life web pages are not doing 100% database operations, they are letting the user react, building pages, etc. So, a
max_connections
of a few hundred is realistic.Once it reaches the cliff, latency goes through the roof. You would expect throughput to increase, too, but it decreases slightly. I believe that this is because the threads are stumbling over each other too much. Think about any "cache" (buffer_pool, open_tables, table_definitions, etc) — if "too many" threads are running, the caches may become ineffective.
Think about a market with so many shoppers that they spend most of their time juggling around other people. More shoppers per hour can get through the market if they keep shoppers from entering when it is "full".
max_connections
is that limiter.INDEXes needed
(Then take a crash course in the benefits of Indexes (aka "KEYs").)
Suggestions to consider for your CloudSQL configuration
Please view profile for contact info. Other performance enhancements available.