Why does Varnish memory usage keep increasing on Amazon Web Services?

Andrea
May 17, 2023
200 views
0 votes
2 Answers

we have an issue with varnish memory usage fine tuning.

This happens on our two EC2 instances t4g.medium (4gb ram).
Varnish memory usage keeps increasing till the instance crashes.
We tried by using malloc to only allocate 2048 mb (even if we found out that doesn’t have any effect on varnishd memory usage) and by lowering down number of min threads from 200 to 100.

So, actually we do have the following settings:

[Unit]
Description=Varnish Cache, a high-performance HTTP accelerator
After=network-online.target

[Service]
Type=simple
Environment="MALLOC_CONF=thp:never,narenas:2"

# Maximum number of open files (for ulimit -n)
LimitNOFILE=131072

# Locked shared memory - should suffice to lock the shared memory log
# (varnishd -l argument)
# Default log size is 80MB vsl + 1M vsm + header -> 82MB
# unit is bytes
LimitMEMLOCK=85983232

# Enable this to avoid "fork failed" on reload.
#TasksMax=infinity

# Maximum size of the corefile.
#LimitCORE=infinity

ExecStart=/usr/sbin/varnishd -j unix,user=vcache -F -a :80 -T :6082 -f /etc/varnish/default.vcl -S /etc/varnish/secret -p vcc_allow_inline_c=on -p feature=+esi_ignore_other_elements -p feature=+esi_disable_xml_check -p http_max_hdr=128 -p http_resp_hdr_len=42000 -p http_resp_size=74768 -p workspace_client=256k -p workspace_backend=256k -p feature=+esi_ignore_https -p thread_pool_min=50 -s malloc,2048m
ExecReload=/usr/sbin/varnishreload
ProtectSystem=full
ProtectHome=true
PrivateTmp=true
PrivateDevices=true

[Install]
WantedBy=multi-user.target

And this is the varnishstat output:

MGT.uptime                                       0+03:50:37
MAIN.uptime                                      0+03:50:38
MAIN.sess_conn                                       11765
MAIN.client_req                                      93915
MAIN.cache_hit                                       11321
MAIN.cache_hitmiss                                    9224
MAIN.cache_miss                                      44954
MAIN.backend_conn                                     5675
MAIN.backend_reuse                                   74129
MAIN.backend_recycle                                 79616
MAIN.fetch_head                                        118
MAIN.fetch_length                                    10725
MAIN.fetch_chunked                                   50532
MAIN.fetch_none                                       5310
MAIN.fetch_304                                       13083
MAIN.fetch_failed                                    1
MAIN.pools                                           2
MAIN.threads                                           100
MAIN.threads_created                                   100
MAIN.busy_sleep                                        305
MAIN.busy_wakeup                                       305
MAIN.n_object                                          23839
MAIN.n_objectcore                                      23863
MAIN.n_objecthead                                      20855
MAIN.n_backend                                         3
MAIN.n_lru_nuked                                       602
MAIN.s_sess                                            11765
MAIN.s_pipe                                            33
MAIN.s_pass                                            34815
MAIN.s_fetch                                           79769
MAIN.s_synth                                           2856    
MAIN.s_req_hdrbytes                                    219.91M   
MAIN.s_req_bodybytes                                   9.47M   
MAIN.s_resp_hdrbytes                                   49.07M   
MAIN.s_resp_bodybytes                                  8.14G   
MAIN.s_pipe_hdrbytes                                   24.02K   
MAIN.s_pipe_out                                        4.25M   
MAIN.sess_closed                                       1564    
MAIN.sess_closed_err                                   9501    
MAIN.backend_req                                       82081    
MAIN.n_vcl                                             1    
MAIN.bans                                              1    
MAIN.vmods                                             2    
MAIN.n_gzip                                            33598    
MAIN.n_gunzip                                          29012    
SMA.s0.c_req                                           352442    
SMA.s0.c_fail                                          917    
SMA.s0.c_bytes                                         4.79G   
SMA.s0.c_freed                                         2.79G   
SMA.s0.g_alloc                                         147586    
SMA.s0.g_bytes                                         2.00G   
SMA.s0.g_space                                         124.88K   
SMA.Transient.c_req                                    262975    
SMA.Transient.c_bytes                                  3.03G   
SMA.Transient.c_freed                                  3.02G   
SMA.Transient.g_alloc                                  13436    
SMA.Transient.g_bytes                                  11.20M   
VBE.boot.web_asg_10_0_2_23.happy                       ffffffffff   
VBE.boot.web_asg_10_0_2_23.bereq_hdrbytes              64.35M  
VBE.boot.web_asg_10_0_2_23.bereq_bodybytes              1.38M       
VBE.boot.web_asg_10_0_2_23.beresp_hdrbytes             23.24M       
VBE.boot.web_asg_10_0_2_23.beresp_bodybytes             1.08G       
VBE.boot.web_asg_10_0_2_23.pipe_hdrbytes                9.65K       
VBE.boot.web_asg_10_0_2_23.pipe_in                      1.61M       
VBE.boot.web_asg_10_0_2_23.conn                            2        
VBE.boot.web_asg_10_0_2_23.req                         27608        
VBE.boot.web_asg_10_0_1_174.happy                 ffffffffff    
VBE.boot.web_asg_10_0_1_174.bereq_hdrbytes             65.10M       
VBE.boot.web_asg_10_0_1_174.bereq_bodybytes             6.66M       
VBE.boot.web_asg_10_0_1_174.beresp_hdrbytes            23.25M       
VBE.boot.web_asg_10_0_1_174.beresp_bodybytes            1.13G       
VBE.boot.web_asg_10_0_1_174.pipe_hdrbytes               5.54K       
VBE.boot.web_asg_10_0_1_174.pipe_in                   973.57K       
VBE.boot.web_asg_10_0_1_174.conn                           4        
VBE.boot.web_asg_10_0_1_174.req                        27608        
VBE.boot.web_asg_10_0_3_248.happy                 ffffffffff     
VBE.boot.web_asg_10_0_3_248.bereq_hdrbytes             64.92M       
VBE.boot.web_asg_10_0_3_248.bereq_bodybytes             1.47M       
VBE.boot.web_asg_10_0_3_248.beresp_hdrbytes            23.37M       
VBE.boot.web_asg_10_0_3_248.beresp_bodybytes            1.12G       
VBE.boot.web_asg_10_0_3_248.pipe_hdrbytes              10.33K       
VBE.boot.web_asg_10_0_3_248.pipe_in                     1.68M       
VBE.boot.web_asg_10_0_3_248.conn                           3        
VBE.boot.web_asg_10_0_3_248.req                        27609

Any idea on how to solve this? Thanks a lot in advance

Answers

- ThijsFeryn
- May 15, 2023 at 2:30 pm
- 0 votes
0
Threads

Please watch out when lowering the thread settings. Varnish may use more than one thread per incoming request depending on hits or misses.

My recommendation is to keep the default values and increase of needed. Don’t decrease them, because if you have a traffic spike, you don’t want to run out of threads. It will slow down your Varnish massively.

See https://www.varnish-software.com/developers/tutorials/troubleshooting-varnish/#not-enough-threads for more information about threading troubleshooting.

Workspace memory

I also noticed that you increased your workspace parameters. This will also cause more memory consumption per thread. Multiply that by the number of active threads and it might have an impact.

Since you lowered your thread pool settings, it won’t cause much problems now, but keeping such low thread_pool_min & thread_pool_max settings is tricky as I explained earlier.

Just prepare for the fact that your thread limits may eventually be increased.

See https://www.varnish-software.com/developers/tutorials/troubleshooting-varnish/#not-enough-workspace-memory for more information about workspace memory troubleshooting.

Transient storage

Every now and then, there seems to be a lot of content in your transient storage.

The varnishstat output below doesn’t indicate a lot of transient usage right now (11.20M), but the SMA.Transient.c_bytes shows that at some point, a lot of bytes were allocated in that storage engine.
```
SMA.Transient.c_req                                    262975    
SMA.Transient.c_bytes                                  3.03G   
SMA.Transient.c_freed                                  3.02G   
SMA.Transient.g_alloc                                  13436    
SMA.Transient.g_bytes                                  11.20M   
```
The transient storage is unbounded memory space that is used to temporarily store content. Short-lived content with a TTL of less than 10 seconds is stored there. But also uncacheable content is stored in transient while it is being fetched.

Because Varnish supports both content buffering and content streaming, transferred bytes from the backend need to be temporarily stored while the client fetches it from Varnish.

The slower the client, the longer it takes for content to be freed from the transient storage.

Keep an eye out for the SMA.Transient.g_bytes counter and see if it increases. If that coincides with Varnish running out of memory, you’ve spotted the root cause.

While it is possible to limit the size of the transient storage, the trade-off isn’t any better: while the server will stay online, individual transactions will fail.

The following varnishlog command will help you find requests that triggered content to be stored in transient storage:
```
varnishlog -g request -q "Storage[2] eq 'Transient'"
```
This might help you figure out the reason why so much transient storage is used.

Why is content uncacheable?

The built-in VCL will prevent content from being served from the cache if:
- The request method is not GET or HEAD
- If the request contains an Authorization header
- If the request contains a Cookie header
You own VCL code may also contain return(pass) calls that cause the bypass. The MAIN.s_pass counter indicates that quite some passes have taken place.

Additionally, the backend responses coming from your origin server may also be Hit-For-Miss. This results in caching the decision not to cache for about 2 minutes.

This happens when:
- The TTL is 0
- The Cache-Control header contains values like private, no-cache or no-store
- If a Set-Cookie header is part of the response
- If a Vary is used that has * as its value.
https://www.varnish-software.com/developers/tutorials/varnish-builtin-vcl/#hit-for-miss

The value of your MAIN.cache_hitmiss counter indicates that there is quite some uncacheable content. Especially if you compare that value to MAIN.cache_hit.

If you spotted some issues in your VCL that cause backend responses to become Hit-For-Miss, you might want to fix these.

More head room

My main advice is to assign more memory to Varnish to have more head room.

Investigating the behavior of the transient storage and the requests responsible for its growth is your main priority.

If the conclusion is that all the uncacheable content is supposed to be there, you’ll need to assign more memory on a permanent basis.

If your VCL contains some issues that cause too much content becoming uncacheable, fixing this might drop the memory consumption on your Varnish server.

Assuring a constant memory footprint

In Varnish Cache, the open source version of Varnish the object storage is constant and the rest of the memory consumption is variable.

In Varnish Enterprise, the commercial version of Varnish, the Massive Storage Engine has a featured called "the memory governor" that allows you to maintain a constant memory footprint on the server.

See https://docs.varnish-software.com/varnish-enterprise/features/mse/memory_governor/ to learn more about the Massive Storage Engine.

See https://docs.varnish-software.com/varnish-enterprise/features/mse/ to learn more about the Massive Storage Engine in general.
Login or Signup to reply.

- FrancescoCorriga
- May 17, 2023 at 9:31 pm
- 0 votes
0
The problem seems due to something weird with libjemalloc 5.2.1 on linux.
As reported by other users it’s not due to any of the parameters nor the VCL.
Libjemalloc 5.2.1 has problems on linux.
I don’t know if it’s a memory leak or if it depends on fragmentation but from the preliminary tests, libjemalloc 5.3.0 seems to fix the memory usage over time.

Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Why does Varnish memory usage keep increasing on Amazon Web Services?

Answers

Threads

Workspace memory

Transient storage

Why is content uncacheable?

More head room

Assuring a constant memory footprint