skip to Main Content

We’re having issues with Elasticsearch crashing from time to time. It also sometimes spikes up RAM + CPU and server becomes unresponsive.

We have left most of the settings as is, but had to add more RAM to JVM heap (48GB) to get it not to crash frequently.

I started digging and apparently 32GB is the max you should be using. We’ll tweak that.

The server is:

CentOS 7
RAM: 125GB
CPU: 40 threads
Hard Drive: 2x Raid 1 NVME

^^^ there is more than enough hardware to handle something like this, but something tells me there needs to be more configuration done to handle this much data.

We’re running a Magento 2.4.3 CE store with about 400,000 products.

Here are all of our config files:

jvm.options file

    ## JVM configuration
    
    ################################################################
    ## IMPORTANT: JVM heap size
    ################################################################
    ##
    ## You should always set the min and max JVM heap
    ## size to the same value. For example, to set
    ## the heap to 4 GB, set:
    ##
    ## -Xms4g
    ## -Xmx4g
    ##
    ## See https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html
    ## for more information
    ##
    ################################################################
    
    # Xms represents the initial size of total heap space
    # Xmx represents the maximum size of total heap space
    
    -Xms48g
    -Xmx48g
    
    ################################################################
    ## Expert settings
    ################################################################
    ##
    ## All settings below this section are considered
    ## expert settings. Don't tamper with them unless
    ## you understand what you are doing
    ##
    ################################################################
    
    ## GC configuration
    8-13:-XX:+UseConcMarkSweepGC
    8-13:-XX:CMSInitiatingOccupancyFraction=75
    8-13:-XX:+UseCMSInitiatingOccupancyOnly
    
    ## G1GC Configuration
    # NOTE: G1 GC is only supported on JDK version 10 or later
    # to use G1GC, uncomment the next two lines and update the version on the
    # following three lines to your version of the JDK
    # 10-13:-XX:-UseConcMarkSweepGC
    # 10-13:-XX:-UseCMSInitiatingOccupancyOnly
    14-:-XX:+UseG1GC
    14-:-XX:G1ReservePercent=25
    14-:-XX:InitiatingHeapOccupancyPercent=30
    
    ## DNS cache policy
    # cache ttl in seconds for positive DNS lookups noting that this overrides the
    # JDK security property networkaddress.cache.ttl; set to -1 to cache forever
    -Des.networkaddress.cache.ttl=60
    # cache ttl in seconds for negative DNS lookups noting that this overrides the
    # JDK security property networkaddress.cache.negative ttl; set to -1 to cache
    # forever
    -Des.networkaddress.cache.negative.ttl=10
    
    ## optimizations
    
    # pre-touch memory pages used by the JVM during initialization
    -XX:+AlwaysPreTouch
    
    ## basic
    
    # explicitly set the stack size
    -Xss1m
    
    # set to headless, just in case
    -Djava.awt.headless=true
    
    # ensure UTF-8 encoding by default (e.g. filenames)
    -Dfile.encoding=UTF-8
    
    # use our provided JNA always versus the system one
    -Djna.nosys=true
    
    # turn off a JDK optimization that throws away stack traces for common
    # exceptions because stack traces are important for debugging
    -XX:-OmitStackTraceInFastThrow
    
    # enable helpful NullPointerExceptions (https://openjdk.java.net/jeps/358), if
    # they are supported
    14-:-XX:+ShowCodeDetailsInExceptionMessages
    
    # flags to configure Netty
    -Dio.netty.noUnsafe=true
    -Dio.netty.noKeySetOptimization=true
    -Dio.netty.recycler.maxCapacityPerThread=0
    
    # log4j 2
    -Dlog4j.shutdownHookEnabled=false
    -Dlog4j2.disable.jmx=true
    
    -Djava.io.tmpdir=${ES_TMPDIR}
    
    ## heap dumps
    
    # generate a heap dump when an allocation from the Java heap fails
    # heap dumps are created in the working directory of the JVM
    -XX:+HeapDumpOnOutOfMemoryError
    
    # specify an alternative path for heap dumps; ensure the directory exists and
    # has sufficient space
    -XX:HeapDumpPath=/var/lib/elasticsearch
    
    # specify an alternative path for JVM fatal error logs
    -XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log
    
    ## JDK 8 GC logging
    
    8:-XX:+PrintGCDetails
    8:-XX:+PrintGCDateStamps
    8:-XX:+PrintTenuringDistribution
    8:-XX:+PrintGCApplicationStoppedTime
    8:-Xloggc:/var/log/elasticsearch/gc.log
    8:-XX:+UseGCLogFileRotation
    8:-XX:NumberOfGCLogFiles=32
    8:-XX:GCLogFileSize=64m
    
    # JDK 9+ GC logging
    9-:-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m
    # due to internationalization enhancements in JDK 9 Elasticsearch need to set the provider to COMPAT otherwise
    # time/date parsing will break in an incompatible way for some date patterns and locals
    9-:-Djava.locale.providers=COMPAT
    
    # temporary workaround for C2 bug with JDK 10 on hardware with AVX-512
    10-:-XX:UseAVX=2


**elasticsearch.yml**

# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
#       Before you set out to tweak and tune the configuration, make sure you
#       understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please consult the documentation for further information on configuration options:
# https://www.elastic.co/guide/en/elasticsearch/reference/index.html
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
#cluster.name: my-application
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
#node.name: node-1
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
path.data: /var/lib/elasticsearch
#
# Path to log files:
#
path.logs: /var/log/elasticsearch
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
#bootstrap.memory_lock: true
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
#network.host: 192.168.0.1
#
# Set a custom port for HTTP:
#
#http.port: 9200
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when new node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
#discovery.zen.ping.unicast.hosts: ["host1", "host2"]
#
# Prevent the "split brain" by configuring the majority of nodes (total number of master-eligible nodes / 2 + 1):
#
#discovery.zen.minimum_master_nodes: 
#
# For more information, consult the zen discovery module documentation.
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
#gateway.recover_after_nodes: 3
#
# For more information, consult the gateway module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
#action.destructive_requires_name: true
xpack.security.enabled: true

I researched the RAM + CPU spike could be due to these settings not being set:

gateway.expected_nodes: 10
gateway.recover_after_time: 5m

Here is some other data from Elasticsearch:

curl -XGET --user username:password http://localhost:9200/

{
  "name" : "web1.example.com",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "S8fFQ993QDWkLY8lZtp_mQ",
  "version" : {
    "number" : "7.13.2",
    "build_flavor" : "default",
    "build_type" : "rpm",
    "build_hash" : "4d960a0733be83dd2543ca018aa4ddc42e956800",
    "build_date" : "2021-06-10T21:01:55.251515791Z",
    "build_snapshot" : false,
    "lucene_version" : "8.8.2",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

curl --user username:password -sS http://localhost:9200/_cluster/health?pretty

{
  "cluster_name" : "elasticsearch",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 5,
  "active_shards" : 5,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 4,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 55.55555555555556
}

curl --user username:password -sS http://localhost:9200/_cluster/allocation/explain?pretty

{
  "index" : "example-amasty_product_1_v156",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "INDEX_CREATED",
    "at" : "2021-09-14T16:52:28.854Z",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "2THEUTSaQdmOJAAhTTN71g",
      "node_name" : "web1.example.com",
      "transport_address" : "127.0.0.1:9300",
      "node_attributes" : {
        "ml.machine_memory" : "134622244864",
        "xpack.installed" : "true",
        "transform.node" : "true",
        "ml.max_open_jobs" : "512",
        "ml.max_jvm_size" : "51539607552"
      },
      "node_decision" : "no",
      "weight_ranking" : 1,
      "deciders" : [
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "a copy of this shard is already allocated to this node"
        }
      ]
    }
  ]
}

^^^ The issue is that I don’t know how to set up multiple nodes on one machine.

The misconfig from what I understand is that we’re running one node only. From my readings 3 master nodes is required for green status.

How do I set up multiple nodes on a single machine and do I need to increase data nodes?

My main suspicions:

  • not enough master / data nodes
  • newer Garbage Collector is having issues (G1GC is enabled – I’m not sure how to determine which one is currently enabled from the config) –— ALREADY FIGURED IT OUT – G1 is used.
  • no recovery setup in case of crash (gateway.expected_nodes, gateway.recover_after_time)

UPDATE:

Here is the error log from elasticsearch.log

https://privfile.com/download.php?fid=6141256d5e639-MTAwNTM=

Apologies the log file did not fit into Stackoverflow post 🙂

Pastebin:

Part #1: https://pastebin.com/86sLM9BD
Part #2: https://pastebin.com/1VEn63TQ

UPDATE:

OUTPUT OF: _cluster/stats?pretty&human

https://pastebin.com/EM8ZMVst

UPDATE:

Figured out how to limit the number of replicas.

This can be done via templates:

PUT _template/all
{
  "template": "*",
  "settings": {
    "number_of_replicas": 0
  }
}

I will be testing it tomorrow if it makes an effect and makes the status green.

I don’t think it will do anything performance wise, but we’ll see.

I’m working through other suggestions:

  • Limited RAM use to 31GB
  • File descriptor is already set to 65535
  • Maximum number of threads is already set to 4096
  • Maximum size virtual memory check is already increased and configured
  • Maximum map count bumped to 262144
  • G1GC is disabled (by default)

One thing I’m trying is to reduce the:

8-13:-XX:CMSInitiatingOccupancyFraction=75

to

8-13:-XX:CMSInitiatingOccupancyFraction=70

I believe this will speed up garbage collection and will prevent out of ram errors. We’ll try to adjust this up/down to see it if helps.

Switch to G1GC

I realize that this not really encouraged, but there are articles about this dealing with similar issues of out of memory where switching to G1GC helped resolve the issue: https://medium.com/naukri-engineering/garbage-collection-in-elasticsearch-and-the-g1gc-16b79a447181

This is going to be the last thing I’m going to try.

UPDATE:

After all these changes, the index is finally green (the template fix worked).

It has also not had any issues overnight. Its not as snappy as it was with 50GB of RAM, but at least it’s stable.

General advice for future Elasticsearch troubleshooters: go through the bootstrap checks – this will at least put you on the baseline of the performance.

UPDATE: Found issues with JVM grabbing settings from two locations and using them for different purposes.

Looks like Sysadmin put heap_size.options into

/etc/elasticsearch/jvm.options.d

with JVM settings of 31GB, but the main jvm.options file was showing 8GB. This was affecting the GC collection threads which were running only with 8GB RAM (yet all 31GB of RAM was still taken up).

I removed the file and added 31GB to the jvm.options file.

This somewhat stabilized the situation, but GC is still doing collection at high rate.

As soon as I added any attributes to the list to index, GC collection overran the memory again.

The only thing that saves this is deleting the index and reindexing again.

I’m at the point where I’m thinking to nuke the whole Elasticsearch install and then do it myself.

This should not be this difficult.

2

Answers


  1. Chosen as BEST ANSWER

    We have solved the issue. The problem was a bad install.

    Something was not working correctly (still don't know what the exact issue was).

    Both ES and Java have been re-installed. I have matched ES to specific version that is working in my development environment.

    enter image description here

    You can see here that GC finally is working correctly.

    We also got ES direct from source. The previous install was from some random repo.

    I threw in all the attributes that were needed by the company and it didn't even notice - stable and fast.

    Thanks to everybody who helped me go through the steps, as I wouldn't nuke the ES install without knowing I did everything possible to stabilize it.

    This also gave me a lesson in configuring ES :)


  2. a few things

    • high cpu or memory use will not be due to not setting those gateway settings, and as a single node cluster they are somewhat irrelevant
    • we recommend keeping heap <32GB, see https://www.elastic.co/guide/en/elasticsearch/reference/current/advanced-configuration.html#set-jvm-heap-size
    • you can never allocate replica shards on the same node as the primary. thus for a single node cluster you either need to remove replicas (risky), or add another (ideally) 2 nodes to the cluster
    • setting up a multiple node cluster on the same host is a little pointless. sure your replicas will be allocated, but if you lose the host you lose all of the data anyway

    I’d suggest looking at https://www.elastic.co/guide/en/elasticsearch/reference/7.14/bootstrap-checks.html and applying the settings it talks about, because even if you are running a single node those are what we refer to as production-ready settings

    that aside, do you have Monitoring enabled? what do your Elasticsearch logs show? what about hot threads? or slow logs?

    (and to be pendantic, it’s Elasticsearch, the s is not camelcase ;))

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search