Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Internal] AllVersionsAndDeletes on change feed pull model for partition split-handling caching design documentation. Compute Gateway.

See original GitHub issue

Purpose statement

This is a document to enhance the Cosmos DB experience by achieving even higher performance.

Description: The plan to improve latency and overall performance for Change feed in Azure Cosmos DB Pull model requests while in AllVersionsAndDeletes (preview) change feed mode by introducing a caching strategy that is local to Compute Gateway for a collection’s physical partition archival lineage. The collection’s physical partition archival lineage is a routing map that instructs Compute Gateway on how to drain documents when a change feed request is received. It is driven by the physical partition’s minimum and maximum log sequence numbers which is a change feed information request to the Backend API. This will support all SDK languages, specifically .NET and Java SDKs. There would be tenant configuration feature flags and additional diagnostic logging that would need to be implemented as well. This issue will be split into multiple PRs, caching, feature flag, and logging). Tenant configuration for feature flag is a fail-safe for if the caching strategy is not working as expected.

Level-setting

What is change feed?
- https://learn.microsoft.com/en-us/azure/cosmos-db/change-feed
- Pull model?
  - Read changes from a particular partition key
  - Control the pace at which your client receives changes for processing
  - Perform a one-time read of the existing data in the change feed (for example, to do a data migration)
  - https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/change-feed-pull-model?tabs=dotnet
What is AllVersionsAndDeletes?
- All versions and deletes mode (preview) is a persistent record of all changes to items from create, update, and delete operations. You get a record of each change to items in the order that it occurred, including intermediate changes to an item between change feed reads.
- https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/change-feed-modes?tabs=latest-version#all-versions-and-deletes-change-feed-mode-preview
What is physical partition split handling?
- When a physical partition is split, the original, now archived parent partition does not exist and will yield a Gone (410) http status code when any change feed requests are attempted by the caller, or SDK client. All change feed requests must be properly routed to the appropriate live child partition in order to receive documents. Additional logic is required in order to request AllVersionsAndDeletes (preview) meta data along with the documents.
- https://learn.microsoft.com/en-us/azure/cosmos-db/partitioning-overview
What is log sequence number, or LSN?
- The last committed transaction number per partition.
- Minimum and maximum
  - On change feed request, a log sequence number is provided. Every physical partition has a minimum and maximum log sequence number. The requested log sequence number will be sent to the physical partition that belongs fits within the range of it’s minimum and maximum log sequence number.
- What is collection’s partition archival lineage?
- The drain route map that instructs the Compute Gateway where to route change feed requests while in AllVersionsAndDeletes (preview) change feed mode for the purpose of draining all documents in a partition.
Why local, and not a globally distributed cache?
- For the sake of simplicity. It may be advantageous to look into a globally distributed caching solution in the future. However, we are looking to focus on making the smallest incremental changes with the greatest over impact to optimal performance.
Tenant configurations
- A Microsoft.Portal/tenantConfigurations resource.
- https://learn.microsoft.com/en-us/azure/templates/microsoft.portal/tenantconfigurations?pivots=deployment-language-arm-template

Tasks

Identifying stakeholders
Understanding source current architecture
Detailing the target future architecture and gaps
Describing the testing strategy
Identifying performance and security concerns
Defining out of scope and future states
Architecture katas
Outlining supportability, manageability, and configuration
Providing visual aids (diagrams, flow charts, C4, etc.)
Estimating time to complete PR
Managing and configuring the feature flag

Stakeholders

Azure Cosmos DB .NET SDK for NoSQL
- Philip Thomas
Compute Gateway
- Philip Thomas
- Dmitri Melnikov
Backend
- Gopal Rander
- Michael Koltachev
Materialized View
- Hemeswari Varada
- Sarwesh Krishnan
- Abhijit P. Pai
Other collaborations
- Kiran Kumar Koli
- Justin Cocchi

Resources

Out of scope

Change feed push model
Incremental (LatestVersion) change feed mode.
Global, or distributed caching.
Time-based, or state-based invalidation.
Removing MAX LSN model was addressed in a previous PR.
There are no public facing contracts that need to be updated, deleted or added with one caveat. There is a possibility that force refresh cache and feature flag could be driven by logic within the SDK client, but more discussions need to be made first. My first reaction would be no, because this is a major performance boost, there is no need to have consumers opt-out.

Scope of work

The Microsoft Azure Cosmos DB .NET SDK Version 3 needs to achieve optimal performance by implementing a local caching strategy in Compute Gateway for all change feed request while in AllVersionsAndDeletes (preview) change feed mode. So, introducing a caching strategy with additional trace logging and a feature flag for the collection’s physical partition archival lineage will unequivocally improve performance by accessing a cache that is local to the Compute Gateway.

Criteria for caching

When a collection’s physical partition has split. The logic to construct the collection’s physical partition archival lineage is solely determined by whether that physical partition has returned an HTTP status code Gone, or 410 when the change feed has requested items.

The IsPassthrough state is false.
The cached item does not have a key the represents the current state of the collection (parent/child partition relationship).

Current baseline architecture

Too many unnecessary Backend request that affect latency and overall performance

Currently, we do not support any caching strategy, and all change feed request while in AllVersionsAndDeletes (preview) change feed mode will request additional minimum and maximum log sequence numbers for every partition that exists within a collection’s partition archival lineage. If change feed requests are being sent to the same collection to exhaust change feed items, then a cache, local to Compute Gateway, of that collection’s physical partition archival lineage should be fetched and used for determining the physical partition routing strategy for draining documents on a live physical partition. Currently, the collection’s physical partition archival lineage is being constructed and traversed for every change feed request while in AllVersionsAndDeletes (preview) change feed mode. The construction of the collection’s physical partition archival lineage increases latency due to its need to make additional network hops to the Backend services to make change feed information request that include minimum and maximum log sequence numbers for a physical partition. For example, if a collection has a physical partition that has split, you now have 2 child physical partitions, and a change feed information request is made 2 times to get minimum and maximum log sequence numbers for each child physical partition. If those child physical partitions split, the number increases, and so on, and so forth. The more splits that occur, the more network hops to the Backend services to request change feed information to return minimum and maximum log sequence numbers, the higher the latency for making change feed request while in AllVersionsAndDeletes (preview) change feed mode.

Proposed solution

Introduce a caching strategy local to the Compute Gateway that includes Feature Flag, EnableFullFidelityChangeFeedSplitHandlingArchivalTreeCaching. This is a tenant level feature flag.
- Although the feature flag is a tenant level configuration, it is not quite determined whether this should be public for consumers via the SDK client, or if there is logic that should exist to trigger this. I will open up dialog to discuss with team. This would require an additional PR for the SDK along with a release. My first reaction would be no, because this is a major performance boost, there is no need to have consumers opt-out.
Invalidation, or eviction policy, based on instance life of the Compute Gateway and optional Force Cache Refresh.
Force Cache Refresh
- Other than having this optional for testability purposes, it is not quite determined whether this should be public for consumers via the SDK client, or if there is logic that should exist to trigger this. I will open up dialog to discuss with team. This would require an additional PR for the SDK along with a release. My first reaction would be no, because this is a major performance boost, there is no need to have consumers opt-out.

graph TD;
    A(get account endpoint)-->B;
    B(get collectionRid)-->C
    C(create containerResourceId)-->D

Branch

msdata
- Caching
  - users/philipthomas/allversionsanddeletes_collection_partition_archival_lineage_caching
- Feature Flag
  - users/philipthomas/allversionsanddeletes_collection_partition_archival_lineage_caching_feature_flag
- Trace Logging
  - users/philipthomas/allversionsanddeletes_collection_partition_archival_lineage_caching_trace_logging

Cache

Key (ContainerResourceId + PartitionKeyRangeIds + Account) {"Value":"pwd(I8Xc*9+=","PartitionKeyRanges":[{"minInclusive":"00","maxExclusive":"MM","ridPrefix":null,"throughputFraction":0.0,"status":"Invalid","lsn":0,"parents":["0"],"id":"1","_rid":null,"_self":null,"_ts":0,"_etag":null},{"minInclusive":"MM","maxExclusive":"FF","ridPrefix":null,"throughputFraction":0.0,"status":"Invalid","lsn":0,"parents":["0"],"id":"2","_rid":null,"_self":null,"_ts":0,"_etag":null}],"AccountEndpoint":"http://testaccount.documents.azure.com"}

How to determine the Caching Key




Scenario1: Container A that has no split partitions.
Scenario 2: Container B that has split partitions.
Scenario 3: Container C that has forest with no split partitions.
Scenario 4: Container D that has forest with split partitions.




Container A and C will not build archival trees because there are no split partitions.
Container A and C will not have caching strategies because there are no archival trees.




Because containers B and D have split partitions, B and D will build archival trees.
Because containers B and D have archival trees, B and D will have caching strategies.




What is the difference between 2 containers that follow type Container B?

The unique identifier of the container.
The unique identifier of the account that the container belongs to.
Refresh if the number of partitions that have been split has changed, or the actual partitionKeyRangeIds
that exists for that container has changed.

NOTE:

The incomingPartitionKeyRangeId does not affect the uniqueness for this case
because they all share the same root parentPartitionKeyRangeId. And since they share the
same parentPartitionKeyRangeId, the archival tree would be the same for every incomingPartitionKeyRangeId.

Proposal:

The caching key should be composed of container identifier and account identifier with a refresh when partitionKeyRangeIds change.

What is the difference between 2 containers that follow type Container D?

The unique identifier of the container.
The unique identifier of the account that the container belongs to.
Refresh if the number of partitions that have been split has changed, or the actual partitionKeyRangeIds
that exists for that container has changed.

NOTE:

If the incomingPartitionKeyRangeIds share the same root parentPartitionKeyRangeId, then incomingPartitionKeyRangeId does not affect the uniqueness.
If the incomingPartitionKeyRangeIds do not share the same root parentPartitionKeyRangeId, then incomingPartitionKeyRangeId does affect incomingPartitionKeyRangeId does not affect the uniqueness.

Proposal

The caching key should be composed of container identifier, account identifier, and incomingPartitionKeyRangeId with a refresh when partitionKeyRangeIds change.

What is the difference between 2 containers where one container follows type Container B and the other follows the structure of Container D?

The unique identifier of the container.
The unique identifier of the account that the container belongs to.
Refresh if the number of partitions that have been split changed, or the actual partitionKeyRangeIds
that exists for that container changed.

NOTE:

If the incomingPartitionKeyRangeIds share the same root parentPartitionKeyRangeId,
then incomingPartitionKeyRangeId does not affect the uniqueness.
If the
incomingPartitionKeyRangeIds do not share the same root parentPartitionKeyRangeId, then incomingPartitionKeyRangeId does affect incomingPartitionKeyRangeId does not affect the uniqueness.

Proposal

The caching key should be composed of container identifier, account identifier, and incomingPartitionKeyRangeId with a refresh when partitionKeyRangeIds change.

Final proposal.




The caching key should be composed of container identifier, account identifier, and incomingPartitionKeyRangeId with a refresh when partitionKeyRangeIds change. This handles all container types.


Concerns.




The duplication of archival tree cached items. Every incomingPartitionKeyRangeId that share the same root parentPartitionKeyRangeId will have a cached item that is the same.

Value

{
	"ContainerResourceId": {
		"Value": "vBVWAK+-HQY="
	},
	"DateCreated": "2022-07-25T11:42:24.5483782Z",
	"DrainRoute": {
		"ParentToRoutePartitionItems": {
			"0": {
				"AdditionalContext": "Partition 0 uses (RouteToPartitionKeyRangeId: 2, UseArchivalPartition: True) and yields a MinLsn of 0 and a MaxLsn of 5000.",
				"CurrentPartitionKeyRangeId": {
					"Value": 0
				},
				"MaxExclusive": "FF",
				"MaxLsn": 5000,
				"MinInclusive": "00",
				"MinLsn": 0,
				"RouteToPartitionKeyRangeId": {
					"Value": 2
				},
				"UseArchivalPartition": true
			},
			"1": {
				"AdditionalContext": "Partition 1 uses (RouteToPartitionKeyRangeId: 4, UseArchivalPartition: True) and yields a MinLsn of 5001 and a MaxLsn of 7500.",
				"CurrentPartitionKeyRangeId": {
					"Value": 1
				},
				"MaxExclusive": "MM",
				"MaxLsn": 7500,
				"MinInclusive": "00",
				"MinLsn": 5001,
				"RouteToPartitionKeyRangeId": {
					"Value": 4
				},
				"UseArchivalPartition": true
			},
			"2": {
				"AdditionalContext": "Partition 2 uses (RouteToPartitionKeyRangeId: 2, UseArchivalPartition: False) and yields a MinLsn of 5001 and a MaxLsn of 10000.",
				"CurrentPartitionKeyRangeId": {
					"Value": 2
				},
				"MaxExclusive": "FF",
				"MaxLsn": 10000,
				"MinInclusive": "MM",
				"MinLsn": 5001,
				"RouteToPartitionKeyRangeId": {
					"Value": 2
				},
				"UseArchivalPartition": false
			},
			"3": {
				"AdditionalContext": "Partition 3 uses (RouteToPartitionKeyRangeId: 3, UseArchivalPartition: False) and yields a MinLsn of 7501 and a MaxLsn of 15000.",
				"CurrentPartitionKeyRangeId": {
					"Value": 3
				},
				"MaxExclusive": "GG",
				"MaxLsn": 15000,
				"MinInclusive": "00",
				"MinLsn": 7501,
				"RouteToPartitionKeyRangeId": {
					"Value": 3
				},
				"UseArchivalPartition": false
			},
			"4": {
				"AdditionalContext": "Partition 4 uses (RouteToPartitionKeyRangeId: 4, UseArchivalPartition: False) and yields a MinLsn of 7501 and a MaxLsn of 20000.",
				"CurrentPartitionKeyRangeId": {
					"Value": 4
				},
				"MaxExclusive": "MM",
				"MaxLsn": 20000,
				"MinInclusive": "GG",
				"MinLsn": 7501,
				"RouteToPartitionKeyRangeId": {
					"Value": 4
				},
				"UseArchivalPartition": false
			}
		},
		"SplitGraph": {
			"Root": {
				"PartitionKeyRangeId": 0,
				"MinInclusive": "00",
				"MaxExclusive": "FF",
				"Children": [
					{
						"PartitionKeyRangeId": 1,
						"MinInclusive": "00",
						"MaxExclusive": "MM",
						"Children": [
							{
								"PartitionKeyRangeId": 3,
								"MinInclusive": "00",
								"MaxExclusive": "GG",
								"Children": []
							},
							{
								"PartitionKeyRangeId": 4,
								"MinInclusive": "GG",
								"MaxExclusive": "MM",
								"Children": []
							}
						]
					},
					{
						"PartitionKeyRangeId": 2,
						"MinInclusive": "MM",
						"MaxExclusive": "FF",
						"Children": []
					}
				]
			},
			"SplitLineages": [
				{
					"Item1": [
						0,
						1
					],
					"Item2": {
						"Id": "1",
						"Parents": [
							"0"
						],
						"MinInclusive": "00",
						"MaxExclusive": "MM"
					}
				},
				{
					"Item1": [
						0,
						2
					],
					"Item2": {
						"Id": "2",
						"Parents": [
							"0"
						],
						"MinInclusive": "MM",
						"MaxExclusive": "FF"
					}
				},
				{
					"Item1": [
						0,
						1,
						3
					],
					"Item2": {
						"Id": "3",
						"Parents": [
							"0",
							"1"
						],
						"MinInclusive": "00",
						"MaxExclusive": "GG"
					}
				},
				{
					"Item1": [
						0,
						1,
						4
					],
					"Item2": {
						"Id": "4",
						"Parents": [
							"0",
							"1"
						],
						"MinInclusive": "GG",
						"MaxExclusive": "MM"
					}
				}
			]
		}
	},
	"IncomingPartitionKeyRangeId": {
		"Value": 1
	}
}

The collection’s partition archival lineage is constructed by collection, or ContainerResourceId. The collection’s partition archival lineage contains the DateCreated, the DrainRoute, and the IncomingPartitionKeyRangeId. There is question as to whether the IncomingPartitionKeyRangeId is necessary, so it may go away. If so, I will update this document accordingly, but at the time of document creation, it exists.

ContainerResourceId {"Value":"pwd(I8Xc*9+=","PartitionKeyRanges"}

DateCreated "DateCreated": "2022-07-25T11:42:24.5483782Z"

DrainRoute

"0": {
				"AdditionalContext": "Partition 0 uses (RouteToPartitionKeyRangeId: 2, UseArchivalPartition: True) and yields a MinLsn of 0 and a MaxLsn of 5000.",
				"CurrentPartitionKeyRangeId": {
					"Value": 0
				},
				"MaxExclusive": "FF",
				"MaxLsn": 5000,
				"MinInclusive": "00",
				"MinLsn": 0,
				"RouteToPartitionKeyRangeId": {
					"Value": 2
				},
				"UseArchivalPartition": true
			},
			"1": {
				"AdditionalContext": "Partition 1 uses (RouteToPartitionKeyRangeId: 4, UseArchivalPartition: True) and yields a MinLsn of 5001 and a MaxLsn of 7500.",
				"CurrentPartitionKeyRangeId": {
					"Value": 1
				},
				"MaxExclusive": "MM",
				"MaxLsn": 7500,
				"MinInclusive": "00",
				"MinLsn": 5001,
				"RouteToPartitionKeyRangeId": {
					"Value": 4
				},
				"UseArchivalPartition": true
			},
			"2": {
				"AdditionalContext": "Partition 2 uses (RouteToPartitionKeyRangeId: 2, UseArchivalPartition: False) and yields a MinLsn of 5001 and a MaxLsn of 10000.",
				"CurrentPartitionKeyRangeId": {
					"Value": 2
				},
				"MaxExclusive": "FF",
				"MaxLsn": 10000,
				"MinInclusive": "MM",
				"MinLsn": 5001,
				"RouteToPartitionKeyRangeId": {
					"Value": 2
				},
				"UseArchivalPartition": false
			},
			"3": {
				"AdditionalContext": "Partition 3 uses (RouteToPartitionKeyRangeId: 3, UseArchivalPartition: False) and yields a MinLsn of 7501 and a MaxLsn of 15000.",
				"CurrentPartitionKeyRangeId": {
					"Value": 3
				},
				"MaxExclusive": "GG",
				"MaxLsn": 15000,
				"MinInclusive": "00",
				"MinLsn": 7501,
				"RouteToPartitionKeyRangeId": {
					"Value": 3
				},
				"UseArchivalPartition": false
			},
			"4": {
				"AdditionalContext": "Partition 4 uses (RouteToPartitionKeyRangeId: 4, UseArchivalPartition: False) and yields a MinLsn of 7501 and a MaxLsn of 20000.",
				"CurrentPartitionKeyRangeId": {
					"Value": 4
				},
				"MaxExclusive": "MM",
				"MaxLsn": 20000,
				"MinInclusive": "GG",
				"MinLsn": 7501,
				"RouteToPartitionKeyRangeId": {
					"Value": 4
				},
				"UseArchivalPartition": false
			}

Performance

Latency
- Latency is decreased due to accessing the Compute Gateway’s local cache to retrieve the collection’s partition archival lineage instead of making a network request to the Backend services to obtain each partition’s minimum and maximum log sequence numbers.

Security

No security concerns at the time of document creation.

Areas of impact

PR#1
- https://msdata.visualstudio.com/DefaultCollection/CosmosDB/_git/CosmosDB/pullrequest/1088615
- Introducing Local Compute Gateway Caching
  - Product/SDK/.net/Microsoft.Azure.Cosmos.Friends/FFCF/ArchivalTree
    - There may be some methods that are no longer relevant that might need cleaning up or refactored for efficiency, but nothing directly related to caching.
  - Product/SDK/.net/Microsoft.Azure.Cosmos.Friends/FFCF/ FullFidelityChangeFeedHandler
    - Life of cache begins at the handler. Introducing AsyncCache.
  - New types
    - Product/SDK/.net/Microsoft.Azure.Cosmos.Friends/FFCF/Cache/AsyncArchivalTreeCache
- Introducing Feature Flag, EnableFullFidelityChangeFeedSplitHandlingArchivalTreeCaching.
  - Product/Cosmos/Compute/Configuration/TenantConfiguration
  - Product/Cosmos/Compute/Configuration/TenantConfigurationKeys
  - Product/Cosmos/Sql/Service/SqlApiOperationHandler
  - Product/Cosmos/Sql/Service/SqlConfiguration
  - Product/Cosmos/Sql/Service/SqlxQueryOperationHandler
  - Product/SDK/.net/Microsoft.Azure.Cosmos.Friends/RawRequestFeatures
  - If we expose this to the SDK client, then changes would have to be made to DocumentServiceRequest
PR#2
- Introducing Caching Trace Logging
  - Product/SDK/.net/Microsoft.Azure.Cosmos.Friends/FFCF/FullFidelityChangeFeedHelper
    - Proposing to add new trace child for building the archival tree when it is not cached labeled, FullFidelityChangeFeedHandler BuildArchivalTreeAsync.
  - Product/SDK/.net/Microsoft.Azure.Cosmos.Friends/FFCF/ArchivalTree
    - AddDatum(“Creating archival tree,”, archivalTree) already exists and will belong to the new trace child.
  - Product/SDK/.net/Microsoft.Azure.Cosmos.Friends/FFCF/FullFidelityChangeFeedHandler
  - Proposal to add new trace child for setting or getting the archival tree from cache labeled, FullFidelityChangeFeedHandler ArchivalTree Cached Item Trace.
PR#3 Removing SplitGraph and SplitLineage from ArchivalTree so that it is not cached.
- There is a significant number of changes required on both the code as well as the tests, so I am creating a PR just to deal with that.
PR#4 Hierarchical Caching
- Section explanation coming soon.
- ArchivalTree refactoring out IncomingPartitionKeyRangeId. @kirankumarkolli suggested this.
- BuildArchivalTree and CreateDrainRoute make async, but why? @kirankumarkolli suggested this.
Benchmarking Compute Gateway, specifically Friends.

Estimation for deliverables

1-2 days coding per PR
1-2 days testing per PR
5+ days awaiting per PR approval

Supportability

Client telemetry

https://github.com/Azure/azure-cosmos-dotnet-v3/blob/master/docs/observability.md

Distributed tracing

TBD

Diagnostic logging

If EnableFullFidelityChangeFeedSplitHandlingArchivalTreeCaching is enabled, then include the following:
- EnableFullFidelityChangeFeedSplitHandlingArchivalTreeCaching EnableFullFidelityChangeFeedSplitHandlingArchivalTreeCaching: true
- The collection’s partition archival lineage to the diagnostic logs.
  - Please refer to Proposed solution, Cache section for a sample of the collection’s partition archival lineage.
Observations. The time spent in Compute Gateway should be significantly shorter, especially if the collection has more splits.

Testing

Use case/scenarios

Test behavior of feature flag when disabled.
Test behavior with feature flag when enabled, force refresh is disabled, and the collection’s partition archival lineage is not cached.
Test behavior with feature flag when enabled, force refresh is disabled, and the collection’s partition archival lineage is cached.
Test behavior with feature flag when enabled, force refresh is enabled, and the collection’s partition archival lineage is not cached.
Test behavior with feature flag when enabled, force refresh is enabled, and the collection’s partition archival lineage is cached.
Test behavior that feature flag and the collection’s partition archival lineage is included in the diagnostic logs when the feature flag is enabled.

Unit (Gated pipeline)

Compute Gateway only, not SDK. However, if forced cache refresh is exposed to the SDK, could open up potential to increase test coverage.

Emulator (Gated pipeline)

Internal emulator that runs with Compute Gateway pipeline, not public emulator with SDK.

Performance/Benchmarking (Gated pipeline)

We need to get a baseline of how change feed requests perform while in AllVersionsAndDeletes (preview) change feed mode.

Security/Penetration (Gated pipeline)

Not required.

Concerns


Why would this flag be set? Because the collection might have been recreated (deleted and created again with the same name), the partitions might be different.

If this is the case, a request to refresh Partitions might fail with a 404 because the RID that we have does not exist anymore.

This is a tricky situation because there are 2 scenarios:

Request has no CollectionRID header: We use this collectionCache and the PKRange cache, if we need to refresh, then refreshing the collectionCache will give us the updated RID and refreshing the PKRange cache the new partitions.
If the request has the CollectionRID header: We do not use the collectionCache. Refreshing the PKRange cache might still result in a 404 for this scenario, so what to do? SDKs have a retry policy that handles this scenario I believe (https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RenameCollectionAwareClientRetryPolicy.java#L59-L61  and https://github.com/Azure/azure-cosmos-dotnet-v3/blob/master/Microsoft.Azure.Cosmos/src/RenameCollectionAwareClientRetryPolicy.cs#L85-L86 ) and it seems that if we return 404/1002 the client will refresh its own RID cache and retry.```

Issue Analytics

State:
Created 8 months ago
Comments:14 (11 by maintainers)

Top GitHub Comments

2reactions

ealsurcommented, Jun 14, 2023

Food for thought

PartitionKeyRange response contains a Parent property but within that, there is no hierarchy.

For example: Let’s say that PKRange 0, split to 1,2, then 2 split in to 3,4 and 1 split into 5 and 6.

If I send a request to 0 (because the client was old) and I get a 410 because it’s gone and then I obtain the PartitionKeyRanges (TryOverlappingRanges) I would get PartitionKeyRange 3,4,5,6 where the Parents would be “0,2” for 3 and 4 and “0,1” for 5 and 6.

So after the 410, I would then attempt to route to, for example, 3 from the client.

If the cache has:

“0” Routes to “2” (some min/max LSN)
“0” Routes to “1” (some other min/max LSN)
“1” Routes to “5” (some min/max LSN)
“1” Routes to “6” (some other min/max LSN)
“2” Routes to “3” (some min/max LSN)
“2” Routes to “4” (some other min/max LSN)
“3” Routes to “3” (some min/max LSN)
“4” Routes to “4” (some min/max LSN)
“5” Routes to “5” (some min/max LSN)
“6” Routes to “6” (some min/max LSN)

I just wondered:

When I come with Incoming 3 with an LSN from before the split, does the algorithm need to be recursive in the sense that, it needs to support going back several levels because maybe some of the partitions it would go to for the Archival also had a split?
If the cache is empty, how would it be constructed after the split (only 3,4,5,6 are live, how does it know that 2 and 1 were before and that 0 was before that if the Parents property only contains a list with no hierarchy level).

If the answer is that we would not support this grandfathering, that’s also ok, just knowing that this is a known gap.

1reaction

philipthomas-MSFTcommented, Jun 15, 2023

Food for thought

PartitionKeyRange response contains a Parent property but within that, there is no hierarchy.

For example: Let’s say that PKRange 0, split to 1,2, then 2 split in to 3,4 and 1 split into 5 and 6.

If I send a request to 0 (because the client was old) and I get a 410 because it’s gone and then I obtain the PartitionKeyRanges (TryOverlappingRanges) I would get PartitionKeyRange 3,4,5,6 where the Parents would be “0,2” for 3 and 4 and “0,1” for 5 and 6.

So after the 410, I would then attempt to route to, for example, 3 from the client.

If the cache has:

“0” Routes to “2” (some min/max LSN)

“0” Routes to “1” (some other min/max LSN)

“1” Routes to “5” (some min/max LSN)

“1” Routes to “6” (some other min/max LSN)

“2” Routes to “3” (some min/max LSN)

“2” Routes to “4” (some other min/max LSN)

“3” Routes to “3” (some min/max LSN)

“4” Routes to “4” (some min/max LSN)

“5” Routes to “5” (some min/max LSN)

“6” Routes to “6” (some min/max LSN)

I just wondered:

When I come with Incoming 3 with an LSN from before the split, does the algorithm need to be recursive in the sense that, it needs to support going back several levels because maybe some of the partitions it would go to for the Archival also had a split?

If the cache is empty, how would it be constructed after the split (only 3,4,5,6 are live, how does it know that 2 and 1 were before and that 0 was before that if the Parents property only contains a list with no hierarchy level).

If the answer is that we would not support this grandfathering, that’s also ok, just knowing that this is a known gap.

cc @ealsur So I just want to correct something first. This is the correct route for your example. I did a ~strikethrough~ for the invalid ones.

~“0” Routes to “2” (some min/max LSN)~
~“0” Routes to “1” (some other min/max LSN)~
“0” Routes to “3” (some other min/max LSN)
~“1” Routes to “5” (some min/max LSN)~
“1” Routes to “6” (some other min/max LSN)
~“2” Routes to “3” (some min/max LSN)~
“2” Routes to “4” (some other min/max LSN)
“3” Routes to “3” (some min/max LSN)
“4” Routes to “4” (some min/max LSN)
“5” Routes to “5” (some min/max LSN)
“6” Routes to “6” (some min/max LSN)

“When I come with Incoming 3 with an LSN from before the split, does the algorithm need to be recursive in the sense that, it needs to support going back several levels because maybe some of the partitions it would go to for the Archival also had a split?”

Irrespective of IncomingPartitionKeyRangeId, the IncomingLSN always tries to find the correct partition’s min/max LSN. So just because the request’s IncomingPartitionKeyRangeId is 3, the IncomingLSN determines where it actually routes to. If the IncomingLSN belongs to the IncomingPartitionKeyRangeId , that is just a normal passthrough.

“If the cache is empty, how would it be constructed after the split (only 3,4,5,6 are live, how does it know that 2 and 1 were before and that 0 was before that if the Parents property only contains a list with no hierarchy level).”

If you look at the tree example in this document, "UseArchivalPartition": true indicates that it was split and is a parent to some child, which is or "UseArchivalPartition": false