question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Design: Query ServiceInterop dependency fallback

See original GitHub issue

Currently the V3 SDK, when running on Windows x64, will attempt to use the ServiceInterop.DLL (reference https://docs.microsoft.com/en-us/azure/cosmos-db/sql/performance-tips-query-sdk?tabs=v3&pivots=programming-language-csharp#use-local-query-plan-generation).

If the DLL is not present or one of its dependencies, the SDK throws:

Unhandled exception. System.DllNotFoundException: Unable to load DLL 'Microsoft.Azure.Cosmos.ServiceInterop.dll' or one of its dependencies: The specified module could not be found. (0x8007007E)
   at Microsoft.Azure.Documents.ServiceInteropWrapper.CreateServiceProvider(String configJsonString, IntPtr& serviceProvider)
   at Microsoft.Azure.Cosmos.Query.Core.QueryPlan.QueryPartitionProvider.Initialize()
   at Microsoft.Azure.Cosmos.Query.Core.QueryPlan.QueryPartitionProvider.TryGetPartitionedQueryExecutionInfoInternal(String querySpecJsonString, PartitionKeyDefinition partitionKeyDefinition, Boolean requireFormattableOrderByQuery, Boolean isContinuationExpected, Boolean allowNonValueAggregateQuery, Boolean hasLogicalPartitionKey, Boolean allowDCount)
   at Microsoft.Azure.Cosmos.Query.Core.QueryPlan.QueryPartitionProvider.TryGetPartitionedQueryExecutionInfo(String querySpecJsonString, PartitionKeyDefinition partitionKeyDefinition, Boolean requireFormattableOrderByQuery, Boolean isContinuationExpected, Boolean allowNonValueAggregateQuery, Boolean hasLogicalPartitionKey, Boolean allowDCount)
   at Microsoft.Azure.Cosmos.CosmosQueryClientCore.TryGetPartitionedQueryExecutionInfoAsync(SqlQuerySpec sqlQuerySpec, ResourceType resourceType, PartitionKeyDefinition partitionKeyDefinition, Boolean requireFormattableOrderByQuery, Boolean isContinuationExpected, Boolean allowNonValueAggregateQuery, Boolean hasLogicalPartitionKey, Boolean allowDCount, CancellationToken cancellationToken)
   at Microsoft.Azure.Cosmos.Query.Core.QueryPlan.QueryPlanHandler.TryGetQueryPlanAsync(SqlQuerySpec sqlQuerySpec, ResourceType resourceType, PartitionKeyDefinition partitionKeyDefinition, QueryFeatures supportedQueryFeatures, Boolean hasLogicalPartitionKey, CancellationToken cancellationToken)
   at Microsoft.Azure.Cosmos.Query.Core.QueryPlan.QueryPlanRetriever.GetQueryPlanWithServiceInteropAsync(CosmosQueryClient queryClient, SqlQuerySpec sqlQuerySpec, ResourceType resourceType, PartitionKeyDefinition partitionKeyDefinition, Boolean hasLogicalPartitionKey, ITrace trace, CancellationToken cancellationToken)
   at Microsoft.Azure.Cosmos.Query.Core.ExecutionContext.CosmosQueryExecutionContextFactory.TryCreateCoreContextAsync(DocumentContainer documentContainer, CosmosQueryContext cosmosQueryContext, InputParameters inputParameters, ITrace trace, CancellationToken cancellationToken)

This would normally be fixed by making sure the deployment process correctly copies all DLLs included in the SDK Nuget pacakge.

The problem is, some customers have solutions that run on Windows x64 and cannot add the ServiceInterop DLL (for whatever reason), in those cases, they currently have no work-around.

There are two alternatives to this address this:

Automatic fallback

If the DLL is not available, fallback to Gateway, but in order to allow customers to understand the problem, leave something in the Diagnostics only in the case where the application is running on Windows x64 (not on Linux or x86).

If the app runs on Windows x64, but the DLL cannot be used, a new node in the Diagnostics will help customers understand the problem in latency, and we automatically fallback to Gateway.

Example of Diagnostics node that states that a Query Plan from Gateway was done due to Service Interop not being available:

{
	"name": "Gateway QueryPlan",
	"id": "1ffd15dd-6d0f-418d-93e6-9fbbd0d75065",
	"start time": "09:36:54:025",
	"duration in milliseconds": 27.1973,
	"data": {
		"ServiceInterop unavailable": "True"
	}
},

PROs:

  • Customers are used to leverage Diagnostics already, they can see the effect of the missing DLL on Windows.
  • Automatic analysis can discover it and produce insight.
  • No more exceptions

CONs:

  • Potentially bloats Diagnostics with for each Query.
  • Discovery is on the Diagnostics, not on an Exception that might be more visible.

Continue to throw but provide options

Provide a CosmosClientOptions configuration that users can leverage to select how they want to interact with ServiceInterop. Something similar to what we have in V2 ConnectionPolicy.QueryPlanGenerationMode, reference: https://docs.microsoft.com/en-us/dotnet/api/microsoft.azure.documents.client.connectionpolicy.queryplangenerationmode?view=azure-dotnet#microsoft-azure-documents-client-connectionpolicy-queryplangenerationmode.

Default behavior would be to throw, users can opt-in to automatically allow Gateway fallback.

PROs:

  • Users can opt-in the automatic fallback explicitly

CONs:

  • The configuration can be set even on environments that don’t apply (like Linux), so the question of “Should I change this setting or not?” will need explaining.
  • Documentation to understand which is the correct value to use is required

Related to https://github.com/Azure/azure-cosmos-dotnet-v3/issues/2366

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:10 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
ealsurcommented, May 27, 2022

There are scenarios where this configuration option might not be viable to take. Integrations like Logic Apps where the customer simply has no access to them or Functions.

Adding an option where the default is the current behavior could potentially break those customers without a workaround.

Implicit model is what the v2 SDK had. It caused a lot of CRIs because people did not know they should copy the service interop and other dlls. It would fail silently in the background and cause high latency CRIs. That is the reason v3 SDK always verifies it exists. Reverting back to the implicit model seems like we are going back to model that we know causes CRIs.

This is mainly because we did not expose any information about what was going on + V2 SDK had multiple bugs regarding Query Plan (like executing the Query on GW if Query Plan was obtained from GW).

I propose adding the Diagnostics to indicate why we went to GW for the Query Plan, Diagnostics can be analyzed by automation and they can also be read and explained by users.

0reactions
ealsurcommented, May 31, 2022

I agree, I wouldn’t expect them to parse the diagnostics either. But the intent is the same as any other scenario where we use the Diagnostics: If the customer is experiencing higher than expected latency, they share the Diagnostics, we analyze and can say (like the cases where the problem is network latency), by looking at the extra Datum, if they are failing to load the DLL and that is the cause. Similar to what we would do analyzing Traces, but Traces is not a reliable source.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Unable to load DLL 'Microsoft.Azure.Documents. ...
Azure.Documents.ServiceInterop.dll being missing even though it's actually one of it's dependencies DocumentDB.Spatial.Sql.dll is missing.
Read more >
XML Schema for cosmosdb
Cosmos DB partition key and query design for sequential access · Limit number of documents in a partition for ... ServiceInterop.dll' [using Microsoft....
Read more >
Granular Context in Collaborative Mobile Environments
PDF | Our research targets collaborative environments with focus on mobility and teams. Teams comprise a number of people working on multiple projects....
Read more >
Multi-layered Blazor WebAssembly template solution (.Net 5.0 ...
In my case I was having some paths in my controller pointing to images. For example I was pointing to "default" images should...
Read more >
sitemap-questions-360.xml
... /18284/best-way-to-begin-learning-web-application-design 2014-08-17 ... .com/questions/316267/help-with-writing-a-sql-query-for-nested-sets 2014-08-17 ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found