We run a Kubernetes-compatible (OKD 3.11) on-prem / private cloud cluster with backend apps communicating with low-latency Redis databases used as caches and K/V stores. The new architecture design is about to divide worker nodes equally between two geographically distributed data centers ("regions"). We can assume static pairing between node names and regions, an now we have added labeling of nodes with region names as well.
What would be the recommended approach to protect low-latency communication with the in-memory databases, making client apps stick to the same region as the database they are allowed to use? Spinning up additional replicas of the databases is feasible, but does not prevent round-robin routing between the two regions…
2
Answers
So effective region-pinning solution is more complex than just using
nodeAffinity
in the "preferred" version. This alone will cause you a lot of unpredictable surprises due to the opinionated character of Kubernetes that has zone spreading hard-coded, as seen in this Github issue, where they clearly try to put at least some eggs in another basket and see zone selection as an antipattern.In practice the usefulness of
nodeAffinity
alone is restricted to a scenario with a very limited number of pod replicas, because when the pods number exceeds the number of nodes in a region (i.e. typically for the 3rd replica in a 2-nodes / 2-regions setup), scheduler will start "correcting" or "fighting" with user preference weights (even as unbalanced of 100:1) very much in favor of spreading, placing at least one "representative" pod on every node in every region (including the non-preferred ones with minimum possible weights of 1).But this default zone spreading issue can be overcome if you create a single-replica container that will act as a "master" or "anchor" (a natural example being a database). For this single-pod "master"
nodeAffinity
will still work correctly - of course in the HA variant, i.e. "preferred" not "required" version. As for the remaining multi-pod apps, you use something else -podAffinity
(this time in the "required" version), which will make the "slave" pods follow their "master" between zones, because setting any pod-based spreading disables the default zone spreading. You can have as many replicas of the "slave" pods as you want and never run into a single misplaced pod (at least at schedule time), because of the "required" affinity used for "slaves". Note that the known limitation ofnodeAffinity
applies here as well: the number of "master" pod replicas must not exceed the number of nodes in a region, or else "zone spreading" will kick in.And here's an example of how to label the "master" pod correctly for the benefit of
podAffinity
and using a deployment config YAML file: https://stackoverflow.com/a/70041308/9962007Posting this out of comments as community wiki for better visibility, feel free to edit and expand.
Best option to solve this question is to use
istio - Locality Load Balancing
. Major points from the link:Another option that can be considered with traffic balancing adjusting is suggested in comments:
use
nodeAffinity
to achieve consistency between schedulingpods
andnodes
in specific "regions".Update: based on @mirekphd comment, it will still not be fully functioning in a way it was asked to:
Please refer to @mirekphd’s answer