we have an issue with varnish memory usage fine tuning.
This happens on our two EC2 instances t4g.medium (4gb ram).
Varnish memory usage keeps increasing till the instance crashes.
We tried by using malloc to only allocate 2048 mb (even if we found out that doesn’t have any effect on varnishd memory usage) and by lowering down number of min threads from 200 to 100.
So, actually we do have the following settings:
[Unit]
Description=Varnish Cache, a high-performance HTTP accelerator
After=network-online.target
[Service]
Type=simple
Environment="MALLOC_CONF=thp:never,narenas:2"
# Maximum number of open files (for ulimit -n)
LimitNOFILE=131072
# Locked shared memory - should suffice to lock the shared memory log
# (varnishd -l argument)
# Default log size is 80MB vsl + 1M vsm + header -> 82MB
# unit is bytes
LimitMEMLOCK=85983232
# Enable this to avoid "fork failed" on reload.
#TasksMax=infinity
# Maximum size of the corefile.
#LimitCORE=infinity
ExecStart=/usr/sbin/varnishd -j unix,user=vcache -F -a :80 -T :6082 -f /etc/varnish/default.vcl -S /etc/varnish/secret -p vcc_allow_inline_c=on -p feature=+esi_ignore_other_elements -p feature=+esi_disable_xml_check -p http_max_hdr=128 -p http_resp_hdr_len=42000 -p http_resp_size=74768 -p workspace_client=256k -p workspace_backend=256k -p feature=+esi_ignore_https -p thread_pool_min=50 -s malloc,2048m
ExecReload=/usr/sbin/varnishreload
ProtectSystem=full
ProtectHome=true
PrivateTmp=true
PrivateDevices=true
[Install]
WantedBy=multi-user.target
And this is the varnishstat output:
MGT.uptime 0+03:50:37
MAIN.uptime 0+03:50:38
MAIN.sess_conn 11765
MAIN.client_req 93915
MAIN.cache_hit 11321
MAIN.cache_hitmiss 9224
MAIN.cache_miss 44954
MAIN.backend_conn 5675
MAIN.backend_reuse 74129
MAIN.backend_recycle 79616
MAIN.fetch_head 118
MAIN.fetch_length 10725
MAIN.fetch_chunked 50532
MAIN.fetch_none 5310
MAIN.fetch_304 13083
MAIN.fetch_failed 1
MAIN.pools 2
MAIN.threads 100
MAIN.threads_created 100
MAIN.busy_sleep 305
MAIN.busy_wakeup 305
MAIN.n_object 23839
MAIN.n_objectcore 23863
MAIN.n_objecthead 20855
MAIN.n_backend 3
MAIN.n_lru_nuked 602
MAIN.s_sess 11765
MAIN.s_pipe 33
MAIN.s_pass 34815
MAIN.s_fetch 79769
MAIN.s_synth 2856
MAIN.s_req_hdrbytes 219.91M
MAIN.s_req_bodybytes 9.47M
MAIN.s_resp_hdrbytes 49.07M
MAIN.s_resp_bodybytes 8.14G
MAIN.s_pipe_hdrbytes 24.02K
MAIN.s_pipe_out 4.25M
MAIN.sess_closed 1564
MAIN.sess_closed_err 9501
MAIN.backend_req 82081
MAIN.n_vcl 1
MAIN.bans 1
MAIN.vmods 2
MAIN.n_gzip 33598
MAIN.n_gunzip 29012
SMA.s0.c_req 352442
SMA.s0.c_fail 917
SMA.s0.c_bytes 4.79G
SMA.s0.c_freed 2.79G
SMA.s0.g_alloc 147586
SMA.s0.g_bytes 2.00G
SMA.s0.g_space 124.88K
SMA.Transient.c_req 262975
SMA.Transient.c_bytes 3.03G
SMA.Transient.c_freed 3.02G
SMA.Transient.g_alloc 13436
SMA.Transient.g_bytes 11.20M
VBE.boot.web_asg_10_0_2_23.happy ffffffffff
VBE.boot.web_asg_10_0_2_23.bereq_hdrbytes 64.35M
VBE.boot.web_asg_10_0_2_23.bereq_bodybytes 1.38M
VBE.boot.web_asg_10_0_2_23.beresp_hdrbytes 23.24M
VBE.boot.web_asg_10_0_2_23.beresp_bodybytes 1.08G
VBE.boot.web_asg_10_0_2_23.pipe_hdrbytes 9.65K
VBE.boot.web_asg_10_0_2_23.pipe_in 1.61M
VBE.boot.web_asg_10_0_2_23.conn 2
VBE.boot.web_asg_10_0_2_23.req 27608
VBE.boot.web_asg_10_0_1_174.happy ffffffffff
VBE.boot.web_asg_10_0_1_174.bereq_hdrbytes 65.10M
VBE.boot.web_asg_10_0_1_174.bereq_bodybytes 6.66M
VBE.boot.web_asg_10_0_1_174.beresp_hdrbytes 23.25M
VBE.boot.web_asg_10_0_1_174.beresp_bodybytes 1.13G
VBE.boot.web_asg_10_0_1_174.pipe_hdrbytes 5.54K
VBE.boot.web_asg_10_0_1_174.pipe_in 973.57K
VBE.boot.web_asg_10_0_1_174.conn 4
VBE.boot.web_asg_10_0_1_174.req 27608
VBE.boot.web_asg_10_0_3_248.happy ffffffffff
VBE.boot.web_asg_10_0_3_248.bereq_hdrbytes 64.92M
VBE.boot.web_asg_10_0_3_248.bereq_bodybytes 1.47M
VBE.boot.web_asg_10_0_3_248.beresp_hdrbytes 23.37M
VBE.boot.web_asg_10_0_3_248.beresp_bodybytes 1.12G
VBE.boot.web_asg_10_0_3_248.pipe_hdrbytes 10.33K
VBE.boot.web_asg_10_0_3_248.pipe_in 1.68M
VBE.boot.web_asg_10_0_3_248.conn 3
VBE.boot.web_asg_10_0_3_248.req 27609
Any idea on how to solve this? Thanks a lot in advance
2
Answers
Threads
Please watch out when lowering the thread settings. Varnish may use more than one thread per incoming request depending on hits or misses.
My recommendation is to keep the default values and increase of needed. Don’t decrease them, because if you have a traffic spike, you don’t want to run out of threads. It will slow down your Varnish massively.
Workspace memory
I also noticed that you increased your workspace parameters. This will also cause more memory consumption per thread. Multiply that by the number of active threads and it might have an impact.
Since you lowered your thread pool settings, it won’t cause much problems now, but keeping such low
thread_pool_min
&thread_pool_max
settings is tricky as I explained earlier.Just prepare for the fact that your thread limits may eventually be increased.
Transient storage
Every now and then, there seems to be a lot of content in your transient storage.
The
varnishstat
output below doesn’t indicate a lot of transient usage right now (11.20M), but theSMA.Transient.c_bytes
shows that at some point, a lot of bytes were allocated in that storage engine.The transient storage is unbounded memory space that is used to temporarily store content. Short-lived content with a TTL of less than 10 seconds is stored there. But also uncacheable content is stored in transient while it is being fetched.
Because Varnish supports both content buffering and content streaming, transferred bytes from the backend need to be temporarily stored while the client fetches it from Varnish.
The slower the client, the longer it takes for content to be freed from the transient storage.
Keep an eye out for the
SMA.Transient.g_bytes
counter and see if it increases. If that coincides with Varnish running out of memory, you’ve spotted the root cause.While it is possible to limit the size of the transient storage, the trade-off isn’t any better: while the server will stay online, individual transactions will fail.
The following
varnishlog
command will help you find requests that triggered content to be stored in transient storage:This might help you figure out the reason why so much transient storage is used.
Why is content uncacheable?
The built-in VCL will prevent content from being served from the cache if:
GET
orHEAD
Authorization
headerCookie
headerYou own VCL code may also contain
return(pass)
calls that cause the bypass. TheMAIN.s_pass
counter indicates that quite some passes have taken place.Additionally, the backend responses coming from your origin server may also be Hit-For-Miss. This results in caching the decision not to cache for about 2 minutes.
This happens when:
Cache-Control
header contains values likeprivate
,no-cache
orno-store
Set-Cookie
header is part of the responseVary
is used that has*
as its value.https://www.varnish-software.com/developers/tutorials/varnish-builtin-vcl/#hit-for-miss
The value of your
MAIN.cache_hitmiss
counter indicates that there is quite some uncacheable content. Especially if you compare that value toMAIN.cache_hit
.If you spotted some issues in your VCL that cause backend responses to become Hit-For-Miss, you might want to fix these.
More head room
My main advice is to assign more memory to Varnish to have more head room.
Investigating the behavior of the transient storage and the requests responsible for its growth is your main priority.
If the conclusion is that all the uncacheable content is supposed to be there, you’ll need to assign more memory on a permanent basis.
If your VCL contains some issues that cause too much content becoming uncacheable, fixing this might drop the memory consumption on your Varnish server.
Assuring a constant memory footprint
In Varnish Cache, the open source version of Varnish the object storage is constant and the rest of the memory consumption is variable.
In Varnish Enterprise, the commercial version of Varnish, the Massive Storage Engine has a featured called "the memory governor" that allows you to maintain a constant memory footprint on the server.
See https://docs.varnish-software.com/varnish-enterprise/features/mse/memory_governor/ to learn more about the Massive Storage Engine.
See https://docs.varnish-software.com/varnish-enterprise/features/mse/ to learn more about the Massive Storage Engine in general.
The problem seems due to something weird with libjemalloc 5.2.1 on linux.
As reported by other users it’s not due to any of the parameters nor the VCL.
Libjemalloc 5.2.1 has problems on linux.
I don’t know if it’s a memory leak or if it depends on fragmentation but from the preliminary tests, libjemalloc 5.3.0 seems to fix the memory usage over time.