question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[META] Making all copies of shards spread evenly across all Awareness Attribute

See original GitHub issue

Is your feature request related to a problem? Please describe.

In cloud HA deployments , customer usually deploy over multiple zones. zone is usually the awareness.attributes in there . However, there is no enforcement of all copies spread evenly across all zones . This can cause uneven distribution of shards and also create shard hotspots. Failure in a single zone might also cause data loss and unavailability for that shard if the copies aren’t evenly spread out.

Describe the solution you’d like

There are two solutions to this approach :

  1. [Choosen Approach]A boolean cluster level setting routing.allocation.awareness.balance which is false by default . When true, we would validate that total copies is always a maximum of awareness attribute value count . If not, we will throw a validation exception. If there are multiple awareness attributes, the balance needs to ensure that every variant of awareness_attribute is equally balance. For ex, if there are 2 Awareness Attributes, zones and rack ids, each having 2 possible values , total copies needs to be multiple of 2.
  2. A boolean cluster level setting auto_balance_across_awareness_attribute. If this is true, we would increase the total copies to be a multiple of AZ count . For instance, there are 3 AZs and index creation request comes with 7 replica. OpenSearch will create 8 replica, to ensure that there are total 9 copies .

Both the solutions will take in effect only upon cluster.routing.allocation.awareness.attributes and cluster.routing.allocation.awareness.force.zone.values being set . If not, the setting will not take in effect .

Trade offs

First approach : The plugins like ISM, CCR needs to do proactive validation while creation and updation of policy. If not, the actions/replication will fail silently at later point of time. As and when new policies or index creation paths are created , we will need to keep adding the validation there for a good experience.

Second approach : Since the replica count is adjusted by OpenSearch, the plugin and new index creation/modification paths don’t need any handling and is very low maintenance. However, the fact that we are deviating from API supplied parameter may not look like a good user experience.

User Experience

  1. User sets cluster.routing.allocation.awareness.attributes and cluster.routing.allocation.awareness.force.zone.values
  2. If user enables routing.allocation.awareness.balance , the total copy needs to be a maximum of all possible values of awareness attribute. If not , we will do one of the following
  • Reject the create/update index
  • Auto expand the replica count as per need.

Why it should be built

This is to ensure that OpenSearch cluster remains well balanced as well as resilient to failures of zone/Rack etc.

What will it take to execute?

Changes in OpenSearch as well Plugins to honor the new flag .

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:12 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
kartgcommented, Aug 3, 2022

@gbbafna can this issue be closed? I see https://github.com/opensearch-project/OpenSearch/issues/3461 which tracks the first solution here, with https://github.com/opensearch-project/OpenSearch/pull/3462 as the PR to main and https://github.com/opensearch-project/OpenSearch/pull/4086 as the backport to 2.x

1reaction
elfishercommented, Jul 21, 2022

Thanks! I see it now. Can we also open an issue in the docs repo to track any documentation updates that might need to happen for this?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Cluster-level shard allocation and routing settings - Elastic
Shard allocation awareness and Forced awareness control how shards can be distributed across different racks or availability zones.
Read more >
Shard key with mostly even distribution. How to handle outliers?
When you say "there is a very small subset of clients" then most likely these clients are evenly distributed over all shards if...
Read more >
All Things Sharding: Techniques and Real-Life Examples in ...
Each index is a collection of documents spread across multiple logical shards. The shards are distributed across all the nodes in the cluster...
Read more >
Sharding pattern - Azure Architecture Center | Microsoft Learn
In a multi-tenant application all the data for a tenant might be stored ... The chosen hashing function should distribute data evenly across...
Read more >
Creating a replica of an existing shard | CDP Public Cloud
You can create additional replicas of existing shards using the solrctl utility. Replicating shards boosts query throughput and prevents data loss.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found