Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[META] Making all copies of shards spread evenly across all Awareness Attribute

See original GitHub issue

Is your feature request related to a problem? Please describe.

In cloud HA deployments , customer usually deploy over multiple zones. zone is usually the awareness.attributes in there . However, there is no enforcement of all copies spread evenly across all zones . This can cause uneven distribution of shards and also create shard hotspots. Failure in a single zone might also cause data loss and unavailability for that shard if the copies aren’t evenly spread out.

Describe the solution you’d like

There are two solutions to this approach :

[Choosen Approach]A boolean cluster level setting routing.allocation.awareness.balance which is false by default . When true, we would validate that total copies is always a maximum of awareness attribute value count . If not, we will throw a validation exception. If there are multiple awareness attributes, the balance needs to ensure that every variant of awareness_attribute is equally balance. For ex, if there are 2 Awareness Attributes, zones and rack ids, each having 2 possible values , total copies needs to be multiple of 2.
A boolean cluster level setting auto_balance_across_awareness_attribute. If this is true, we would increase the total copies to be a multiple of AZ count . For instance, there are 3 AZs and index creation request comes with 7 replica. OpenSearch will create 8 replica, to ensure that there are total 9 copies .

Both the solutions will take in effect only upon cluster.routing.allocation.awareness.attributes and cluster.routing.allocation.awareness.force.zone.values being set . If not, the setting will not take in effect .

Trade offs

First approach : The plugins like ISM, CCR needs to do proactive validation while creation and updation of policy. If not, the actions/replication will fail silently at later point of time. As and when new policies or index creation paths are created , we will need to keep adding the validation there for a good experience.

Second approach : Since the replica count is adjusted by OpenSearch, the plugin and new index creation/modification paths don’t need any handling and is very low maintenance. However, the fact that we are deviating from API supplied parameter may not look like a good user experience.

User Experience

User sets cluster.routing.allocation.awareness.attributes and cluster.routing.allocation.awareness.force.zone.values
If user enables routing.allocation.awareness.balance , the total copy needs to be a maximum of all possible values of awareness attribute. If not , we will do one of the following

Reject the create/update index
Auto expand the replica count as per need.

Why it should be built

This is to ensure that OpenSearch cluster remains well balanced as well as resilient to failures of zone/Rack etc.

What will it take to execute?

Changes in OpenSearch as well Plugins to honor the new flag .

Issue Analytics

State:
Created a year ago
Comments:12 (6 by maintainers)

Top GitHub Comments

1reaction

kartgcommented, Aug 3, 2022

@gbbafna can this issue be closed? I see https://github.com/opensearch-project/OpenSearch/issues/3461 which tracks the first solution here, with https://github.com/opensearch-project/OpenSearch/pull/3462 as the PR to main and https://github.com/opensearch-project/OpenSearch/pull/4086 as the backport to 2.x

1reaction

elfishercommented, Jul 21, 2022

Thanks! I see it now. Can we also open an issue in the docs repo to track any documentation updates that might need to happen for this?

Top Results From Across the Web

Cluster-level shard allocation and routing settings - Elastic

Shard allocation awareness and Forced awareness control how shards can be distributed across different racks or availability zones.

Shard key with mostly even distribution. How to handle outliers?

When you say "there is a very small subset of clients" then most likely these clients are evenly distributed over all shards if...

All Things Sharding: Techniques and Real-Life Examples in ...

Each index is a collection of documents spread across multiple logical shards. The shards are distributed across all the nodes in the cluster...

Sharding pattern - Azure Architecture Center | Microsoft Learn

In a multi-tenant application all the data for a tenant might be stored ... The chosen hashing function should distribute data evenly across...

Creating a replica of an existing shard | CDP Public Cloud

You can create additional replicas of existing shards using the solrctl utility. Replicating shards boosts query throughput and prevents data loss.

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

[META] Making all copies of shards spread evenly across all Awareness Attribute

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

[BUG] OpenSearch app is not able to pick up empty `node.roles=` environment variable to run as coordinating node.

[BUG] The deprecation message of `DeprecatedRoute` doesn't show in the API response