skip to Main Content

I need some help, Our service uses the lettuce 5.1.6 version, and a total of 22 docker nodes are deployed.
Whenever the service is deployed, several docker nodes will appear ERROR: READONLY You can’t write against a read only slave.
Restart the problematic docker node ERROR no longer appears

  • redis server configuration:

8 master 8 slave
stop-writes-on-bgsave-error no
slave-serve-stale-data yes
slave-read-only yes
cluster-enabled yes
cluster-config-file "/data/server/redis-cluster/{port}/conf/node.conf"

  • lettuce configuration:
ClientResources res = DefaultClientResources.builder()
        .commandLatencyPublisherOptions(
                DefaultEventPublisherOptions.builder()
                        .eventEmitInterval(Duration.ofSeconds(5))
                        .build()
        )
        .build();
redisClusterClient = RedisClusterClient.create(res, REDIS_CLUSTER_URI);
redisClusterClient.setOptions(
        ClusterClientOptions.builder()
                .maxRedirects(99)
                .socketOptions(SocketOptions.builder().keepAlive(true).build())
                .topologyRefreshOptions(
                        ClusterTopologyRefreshOptions.builder()
                                .enableAllAdaptiveRefreshTriggers()
                                .build())
                .build());
RedisAdvancedClusterCommands<String, String> command = redisClusterClient.connect().sync();
command.setex("some key", 18000, "some value");
  • The Exception that appears:
io.lettuce.core.RedisCommandExecutionException: READONLY You can't write against a read only slave.
    at io.lettuce.core.ExceptionFactory.createExecutionException(ExceptionFactory.java:135)
    at io.lettuce.core.LettuceFutures.awaitOrCancel(LettuceFutures.java:122)
    at io.lettuce.core.cluster.ClusterFutureSyncInvocationHandler.handleInvocation(ClusterFutureSyncInvocationHandler.java:123)
    at io.lettuce.core.internal.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:80)
    at com.sun.proxy.$Proxy135.setex(Unknown Source)
    at com.xueqiu.infra.redis4.RedisClusterImpl.lambda$setex$164(RedisClusterImpl.java:1489)
    at com.xueqiu.infra.redis4.RedisClusterImpl$$Lambda$1422/1017847781.apply(Unknown Source)
    at com.xueqiu.infra.redis4.RedisClusterImpl.execute(RedisClusterImpl.java:526)
    at com.xueqiu.infra.redis4.RedisClusterImpl.executeTotal(RedisClusterImpl.java:491)
    at com.xueqiu.infra.redis4.RedisClusterImpl.setex(RedisClusterImpl.java:1489)

3

Answers


  1. Chosen as BEST ANSWER
    1. In the face of distributed middleware, the client side will put some partitions, sharding and other relationships on the client side for management.

    And lettuce is the slot mapping management of redis cluster:

    The method adopted is to use an array of slotCache, and cache the node corresponding to each slot locally in the form of an array.

    When there is a key that needs to read and write to the server, the slot will be calculated through the CRC16 in the client, and then the node will be obtained in the cache.

    1. When the redis cluster server performs cluster management, it records the mapping relationship between slot and node in the local node.conf of each node.

    When ping pong data is exchanged through the gossip protocol, these metadata information are broadcast to form the final consistent metadata information.

    However, if there is an error in the slot mapping relationship on the server side, the client side will use these wrong data.

    This time the problem appears here. The server part node maps the slot to the slave, so that the slot cached by the client is mapped to the slave node, and the read and write requests are sent to the slave node, resulting in an error.


  2. lettuce source code investigation

    1 lettuce initialization Partitions.java

        /**
            * Update the partition cache. Updates are necessary after the partition details have changed.
            */
        public void updateCache() {
    
            synchronized (partitions) {
    
                if (partitions.isEmpty()) {
                    this.slotCache = EMPTY;
                    this.nodeReadView = Collections.emptyList();
                    return;
                }
    
                RedisClusterNode[] slotCache = new RedisClusterNode[SlotHash.SLOT_COUNT];
                List<RedisClusterNode> readView = new ArrayList<>(partitions.size());
    
                for (RedisClusterNode partition: partitions) {
    
                    readView.add(partition);
                    for (Integer integer: partition.getSlots()) {
                        slotCache[integer.intValue()] = partition;
                    }
                }
    
                this.slotCache = slotCache;
                this.nodeReadView = Collections.unmodifiableCollection(readView);
            }
        }
    

    2 lettuce send command PooledClusterConnectionProvider.java

        private CompletableFuture<StatefulRedisConnection<K, V>> getWriteConnection(int slot) {
    
            CompletableFuture<StatefulRedisConnection<K, V>> writer;// avoid races when reconfiguring partitions.
            synchronized (stateLock) {
                writer = writers[slot];
            }
    
            if (writer == null) {
                RedisClusterNode partition = partitions.getPartitionBySlot(slot);
                if (partition == null) {
                    clusterEventListener.onUncoveredSlot(slot);
                    return Futures.failed(new PartitionSelectorException("Cannot determine a partition for slot "+ slot + ".",
                            partitions.clone()));
                }
    
                // Use always host and port for slot-oriented operations. We don't want to get reconnected on a different
                // host because the nodeId can be handled by a different host.
                RedisURI uri = partition.getUri();
                ConnectionKey key = new ConnectionKey(Intent.WRITE, uri.getHost(), uri.getPort());
    
                ConnectionFuture<StatefulRedisConnection<K, V>> future = getConnectionAsync(key);
    
                return future.thenApply(connection -> {
    
                    synchronized (stateLock) {
                        if (writers[slot] == ​​null) {
                            writers[slot] = CompletableFuture.completedFuture(connection);
                        }
                    }
    
                    return connection;
                }).toCompletableFuture();
            }
    
            return writer;
        }
    

    The sending principle of lettuce:

    1. Load the topology when the client starts, and store the mapping relationship between slot and node locally in an array structure slotCache
    2. When sending, after calculating the CRC16 of the key, go to the array slotCache through slot to get the corresponding node, and continue to get the connection of this node
    3. Note that basically in all middleware of this cluster mode, the logic of the client is to obtain the network topology of the server, and then calculate the mapping logic on the client,
      Compare the performance analysis of Kafka across computer rooms:

    redis cluster information troubleshooting

    ./bin/redis-cli -h 10.10.28.2 -p 25661 cluster info

    cluster_state:ok
    cluster_slots_assigned:16384
    cluster_slots_ok:16384
    cluster_slots_pfail:0
    cluster_slots_fail:0
    cluster_known_nodes:6
    cluster_size: 3
    cluster_current_epoch:8
    cluster_my_epoch:6
    cluster_stats_messages_ping_sent:615483
    cluster_stats_messages_pong_sent:610194
    cluster_stats_messages_meet_sent:3
    cluster_stats_messages_fail_sent:8
    cluster_stats_messages_auth-req_sent:5
    cluster_stats_messages_auth-ack_sent:2
    cluster_stats_messages_update_sent:4
    cluster_stats_messages_sent:1225699
    cluster_stats_messages_ping_received:610188
    cluster_stats_messages_pong_received:603593
    cluster_stats_messages_meet_received:2
    cluster_stats_messages_fail_received:4
    cluster_stats_messages_auth-req_received:2
    cluster_stats_messages_auth-ack_received:2
    cluster_stats_messages_received:1213791
    

    ./bin/redis-cli -h 10.10.28.2 -p 25661 cluster nodes

    5e9d0c185a2ba2fc9564495730c874bea76c15fa 10.10.28.3:25662@35662 slave 2281f330d771ee682221bc6c239afd68e6f20571 0 1595921769000 15 connected
    79cb673db12199c32737b959cd82ec9963106558 10.10.25.2:25651@35651 master - 0 1595921770000 18 connected 4096-6143
    2281f330d771ee682221bc6c239afd68e6f20571 10.10.28.2:25661@35661 myself,master - 0 1595921759000 15 connected 10240-12287
    6a9ea568d6b49360afbb650c712bd7920403ba19 10.10.28.3:25686@35686 master - 0 1595921769000 14 connected 12288-14335
    5a12dd423370e6f4085e593f9cd0b3a4ddfa9757 10.10.27.2:25656@35656 master - 0 1595921771000 13 connected 14336-16383
    f5148dba1127bd9bada8ecc39341a0b72ef25d8e 10.10.25.3:25652@35652 slave 79cb673db12199c32737b959cd82ec9963106558 0 1595921769000 18 connected
    f6788b4829e601642ed4139548153830c430b932 10.10.26.3:25666@35666 master - 0 1595921769870 16 connected 8192-10239
    f54cfebc12c69725f471d16133e7ca3a8567dc18 10.10.28.15:25687@35687 slave 6a9ea568d6b49360afbb650c712bd7920403ba19 0 1595921763000 14 connected
    f09ad21effff245cae23c024a8a886f883634f5c 10.10.28.15:25667@35667 slave f6788b4829e601642ed4139548153830c430b932 0 1595921770870 16 connected
    ff5f5a56a7866f32e84ec89482aabd9ca1f05e20 10.10.25.3:25681@35681 master - 0 1595921773876 0 connected 0-2047
    19c57214e4293b2e37d881534dcd55318fa96a70 10.10.50.16:25677@35677 slave 5f677e012808b09c67316f6ac5bdf0ec005cd598 0 1595921768000 17 connected
    d8b4f99e0f9961f2e866b92e7351760faa3e0f2b 10.10.30.9:25671@35671 master - 0 1595921773000 6 connected 2048-4095
    068e3bc73c27782c49782d30b66aa8b1140666ce 10.10.27.3:25682@35682 slave ff5f5a56a7866f32e84ec89482aabd9ca1f05e20 0 1595921771872 12 connected
    e8b0311aeec4e3d285028abc377f0c277f9a5c74 10.10.49.9:25672@35672 slave d8b4f99e0f9961f2e866b92e7351760faa3e0f2b 0 1595921770000 6 connected
    f03bc2ca91b3012f4612ecbc8c611c9f4a0e1305 10.10.27.3:25657@35657 slave 5a12dd423370e6f4085e593f9cd0b3a4ddfa9757 0 1595921762000 13 connected
    5f677e012808b09c67316f6ac5bdf0ec005cd598 10.10.50.7:25676@35676 master - 0 1595921772873 17 connected 6144-8191
    

    ./bin/redis-cli -h 10.10.28.3 -p 25662 cluster nodes

    f5148dba1127bd9bada8ecc39341a0b72ef25d8e 10.10.25.3:25652@35652 slave 79cb673db12199c32737b959cd82ec9963106558 0 1595921741000 18 connected
    f6788b4829e601642ed4139548153830c430b932 10.10.26.3:25666@35666 master - 0 1595921744000 16 connected 8192-10239
    f03bc2ca91b3012f4612ecbc8c611c9f4a0e1305 10.10.27.3:25657@35657 slave 5a12dd423370e6f4085e593f9cd0b3a4ddfa9757 0 1595921740000 13 connected
    5f677e012808b09c67316f6ac5bdf0ec005cd598 10.10.50.7:25676@35676 master - 0 1595921743127 17 connected 6144-8191
    79cb673db12199c32737b959cd82ec9963106558 10.10.25.2:25651@35651 master - 0 1595921743000 18 connected 4096-6143
    2281f330d771ee682221bc6c239afd68e6f20571 10.10.28.2:25661@35661 master - 0 1595921744129 15 connected 10240-12287
    f09ad21effff245cae23c024a8a886f883634f5c 10.10.28.15:25667@35667 slave f6788b4829e601642ed4139548153830c430b932 0 1595921740000 16 connected
    f54cfebc12c69725f471d16133e7ca3a8567dc18 10.10.28.15:25687@35687 slave 6a9ea568d6b49360afbb650c712bd7920403ba19 0 1595921745130 14 connected
    5e9d0c185a2ba2fc9564495730c874bea76c15fa 10.10.28.3:25662@35662 myself,slave 2281f330d771ee682221bc6c239afd68e6f20571 0 1595921733000 5 connected 0-1820
    068e3bc73c27782c49782d30b66aa8b1140666ce 10.10.27.3:25682@35682 slave ff5f5a56a7866f32e84ec89482aabd9ca1f05e20 0 1595921744000 12 connected
    d8b4f99e0f9961f2e866b92e7351760faa3e0f2b 10.10.30.9:25671@35671 master - 0 1595921739000 6 connected 2048-4095
    5a12dd423370e6f4085e593f9cd0b3a4ddfa9757 10.10.27.2:25656@35656 master - 0 1595921742000 13 connected 14336-16383
    ff5f5a56a7866f32e84ec89482aabd9ca1f05e20 10.10.25.3:25681@35681 master - 0 1595921746131 0 connected 1821-2047
    6a9ea568d6b49360afbb650c712bd7920403ba19 10.10.28.3:25686@35686 master - 0 1595921747133 14 connected 12288-14335
    19c57214e4293b2e37d881534dcd55318fa96a70 10.10.50.16:25677@35677 slave 5f677e012808b09c67316f6ac5bdf0ec005cd598 0 1595921742126 17 connected
    e8b0311aeec4e3d285028abc377f0c277f9a5c74 10.10.49.9:25672@35672 slave d8b4f99e0f9961f2e866b92e7351760faa3e0f2b 0 1595921745000 6 connected
    

    ./bin/redis-cli -h 10.10.49.9 -p 25672 cluster nodes

    d8b4f99e0f9961f2e866b92e7351760faa3e0f2b 10.10.30.9:25671@35671 master - 0 1595921829000 6 connected 2048-4095
    79cb673db12199c32737b959cd82ec9963106558 10.10.25.2:25651@35651 master - 0 1595921830000 18 connected 4096-6143
    ff5f5a56a7866f32e84ec89482aabd9ca1f05e20 10.10.25.3:25681@35681 master - 0 1595921830719 0 connected 0-1820
    f54cfebc12c69725f471d16133e7ca3a8567dc18 10.10.28.15:25687@35687 slave 6a9ea568d6b49360afbb650c712bd7920403ba19 0 1595921827000 14 connected
    5f677e012808b09c67316f6ac5bdf0ec005cd598 10.10.50.7:25676@35676 master - 0 1595921827000 17 connected 6144-8191
    2281f330d771ee682221bc6c239afd68e6f20571 10.10.28.2:25661@35661 master - 0 1595921822000 15 connected 10240-12287
    5e9d0c185a2ba2fc9564495730c874bea76c15fa 10.10.28.3:25662@35662 slave 2281f330d771ee682221bc6c239afd68e6f20571 0 1595921828714 15 connected
    068e3bc73c27782c49782d30b66aa8b1140666ce 10.10.27.3:25682@35682 slave ff5f5a56a7866f32e84ec89482aabd9ca1f05e20 0 1595921832721 12 connected
    6a9ea568d6b49360afbb650c712bd7920403ba19 10.10.28.3:25686@35686 master - 0 1595921825000 14 connected 12288-14335
    f5148dba1127bd9bada8ecc39341a0b72ef25d8e 10.10.25.3:25652@35652 slave 79cb673db12199c32737b959cd82ec9963106558 0 1595921830000 18 connected
    19c57214e4293b2e37d881534dcd55318fa96a70 10.10.50.16:25677@35677 slave 5f677e012808b09c67316f6ac5bdf0ec005cd598 0 1595921829716 17 connected
    e8b0311aeec4e3d285028abc377f0c277f9a5c74 10.10.49.9:25672@35672 myself,slave d8b4f99e0f9961f2e866b92e7351760faa3e0f2b 0 1595921832000 4 connected 1821-2047
    f09ad21effff245cae23c024a8a886f883634f5c 10.10.28.15:25667@35667 slave f6788b4829e601642ed4139548153830c430b932 0 1595921826711 16 connected
    f03bc2ca91b3012f4612ecbc8c611c9f4a0e1305 10.10.27.3:25657@35657 slave 5a12dd423370e6f4085e593f9cd0b3a4ddfa9757 0 1595921829000 13 connected
    f6788b4829e601642ed4139548153830c430b932 10.10.26.3:25666@35666 master - 0 1595921831720 16 connected 8192-10239
    5a12dd423370e6f4085e593f9cd0b3a4ddfa9757 10.10.27.2:25656@35656 master - 0 1595921827714 13 connected 14336-16383
    

    ./bin/redis-trib.rb check 10.10.30.9:25671

    >>> Performing Cluster Check (using node 10.10.30.9:25671)
    M: d8b4f99e0f9961f2e866b92e7351760faa3e0f2b 10.10.30.9:25671
       slots:2048-4095 (2048 slots) master
       1 additional replica(s)
    S: e8b0311aeec4e3d285028abc377f0c277f9a5c74 10.10.49.9:25672
       slots: (0 slots) slave
       ········
       ········
    S: f03bc2ca91b3012f4612ecbc8c611c9f4a0e1305 10.10.27.3:25657
       slots: (0 slots) slave
       replicates 5a12dd423370e6f4085e593f9cd0b3a4ddfa9757
    M: 5a12dd423370e6f4085e593f9cd0b3a4ddfa9757 10.10.27.2:25656
       slots:14336-16383 (2048 slots) master
       1 additional replica(s)
    [ERR] Nodes don't agree about configuration!
    >>> Check for open slots...
    >>> Check slots coverage...
    [OK] All 16384 slots covered.
    

    Be suspicious of everything, be vigilant, diligence can make up for one’s weaknesses

    1. At the beginning, I suspected that the cluster is healthy, but due to online phenomena, most of the nodes are normal, and a few do not have problems, restart the problem fix
    2. In the beginning, the information was checked through normal nodes, and no problems were found. Even if the topological information of several nodes was inconsistent through the logs at the beginning, it would be difficult to see the problem without the mapping relationship of the client.
    3. Comparison found that some slots are mapped to slave nodes
    4. After executing the check, it is found that there is a problem with the cluster, which is explained in the following open source related issues
    Login or Signup to reply.
  3. I saw the same issue as you and tried to investigate that.

    I figured out that caused by lettuce.

    When we run a Redis command, lettuce will analyze and recognize which is the Redis-Endpoint to send the order.

    If READ-COMMAND then it will send to slave-node (By setting ReadFrom.Any_Rep). Please note that the other ReadFrom options may change the behavior.

    If WRITE-COMMAND then it will send to master-node

    To determine what are READ-COMMAND. Lettuce used ReadOnlyCommands class to list all Read commands.

    In my case, I used the EVAL command to write a key value to Redis. But Lettuce determines it is READ-COMMAND then send to slave-node => The exception happens.

    So please check ReadOnlyCommands class and make sure your write-command does not include that. This is a mistake from the Lettuce team and they already fix this issue from newer versions.

    In your version, ReadOnlyCommands for cluster settings is

    class ReadOnlyCommands {
    
    private static final Set<CommandType> READ_ONLY_COMMANDS = EnumSet.noneOf(CommandType.class);
    
    static {
        for (CommandName commandNames : CommandName.values()) {
            READ_ONLY_COMMANDS.add(CommandType.valueOf(commandNames.name()));
        }
    }
    
    /**
     * @param protocolKeyword must not be {@literal null}.
     * @return {@literal true} if {@link ProtocolKeyword} is a read-only command.
     */
    public static boolean isReadOnlyCommand(ProtocolKeyword protocolKeyword) {
        return READ_ONLY_COMMANDS.contains(protocolKeyword);
    }
    
    /**
     * @return an unmodifiable {@link Set} of {@link CommandType read-only} commands.
     */
    public static Set<CommandType> getReadOnlyCommands() {
        return Collections.unmodifiableSet(READ_ONLY_COMMANDS);
    }
    
    enum CommandName {
        ASKING, BITCOUNT, BITPOS, CLIENT, COMMAND, DUMP, ECHO, EVAL, EVALSHA, EXISTS, //
        GEODIST, GEOPOS, GEORADIUS, GEORADIUSBYMEMBER, GEOHASH, GET, GETBIT, //
        GETRANGE, HEXISTS, HGET, HGETALL, HKEYS, HLEN, HMGET, HSCAN, HSTRLEN, //
        HVALS, INFO, KEYS, LINDEX, LLEN, LRANGE, MGET, PFCOUNT, PTTL, //
        RANDOMKEY, READWRITE, SCAN, SCARD, SCRIPT, //
        SDIFF, SINTER, SISMEMBER, SMEMBERS, SRANDMEMBER, SSCAN, STRLEN, //
        SUNION, TIME, TTL, TYPE, ZCARD, ZCOUNT, ZLEXCOUNT, ZRANGE, //
        ZRANGEBYLEX, ZRANGEBYSCORE, ZRANK, ZREVRANGE, ZREVRANGEBYLEX, ZREVRANGEBYSCORE, ZREVRANK, ZSCAN, ZSCORE, //
    
        // Pub/Sub commands are no key-space commands so they are safe to execute on slave nodes
        PUBLISH, PUBSUB, PSUBSCRIBE, PUNSUBSCRIBE, SUBSCRIBE, UNSUBSCRIBE
    

    So you can check easily.

    Solution -> Upgrade version Letture is the best way to do. Or you can try to override this setting

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search