Design: Query ServiceInterop dependency fallback
See original GitHub issueCurrently the V3 SDK, when running on Windows x64, will attempt to use the ServiceInterop.DLL (reference https://docs.microsoft.com/en-us/azure/cosmos-db/sql/performance-tips-query-sdk?tabs=v3&pivots=programming-language-csharp#use-local-query-plan-generation).
If the DLL is not present or one of its dependencies, the SDK throws:
Unhandled exception. System.DllNotFoundException: Unable to load DLL 'Microsoft.Azure.Cosmos.ServiceInterop.dll' or one of its dependencies: The specified module could not be found. (0x8007007E)
at Microsoft.Azure.Documents.ServiceInteropWrapper.CreateServiceProvider(String configJsonString, IntPtr& serviceProvider)
at Microsoft.Azure.Cosmos.Query.Core.QueryPlan.QueryPartitionProvider.Initialize()
at Microsoft.Azure.Cosmos.Query.Core.QueryPlan.QueryPartitionProvider.TryGetPartitionedQueryExecutionInfoInternal(String querySpecJsonString, PartitionKeyDefinition partitionKeyDefinition, Boolean requireFormattableOrderByQuery, Boolean isContinuationExpected, Boolean allowNonValueAggregateQuery, Boolean hasLogicalPartitionKey, Boolean allowDCount)
at Microsoft.Azure.Cosmos.Query.Core.QueryPlan.QueryPartitionProvider.TryGetPartitionedQueryExecutionInfo(String querySpecJsonString, PartitionKeyDefinition partitionKeyDefinition, Boolean requireFormattableOrderByQuery, Boolean isContinuationExpected, Boolean allowNonValueAggregateQuery, Boolean hasLogicalPartitionKey, Boolean allowDCount)
at Microsoft.Azure.Cosmos.CosmosQueryClientCore.TryGetPartitionedQueryExecutionInfoAsync(SqlQuerySpec sqlQuerySpec, ResourceType resourceType, PartitionKeyDefinition partitionKeyDefinition, Boolean requireFormattableOrderByQuery, Boolean isContinuationExpected, Boolean allowNonValueAggregateQuery, Boolean hasLogicalPartitionKey, Boolean allowDCount, CancellationToken cancellationToken)
at Microsoft.Azure.Cosmos.Query.Core.QueryPlan.QueryPlanHandler.TryGetQueryPlanAsync(SqlQuerySpec sqlQuerySpec, ResourceType resourceType, PartitionKeyDefinition partitionKeyDefinition, QueryFeatures supportedQueryFeatures, Boolean hasLogicalPartitionKey, CancellationToken cancellationToken)
at Microsoft.Azure.Cosmos.Query.Core.QueryPlan.QueryPlanRetriever.GetQueryPlanWithServiceInteropAsync(CosmosQueryClient queryClient, SqlQuerySpec sqlQuerySpec, ResourceType resourceType, PartitionKeyDefinition partitionKeyDefinition, Boolean hasLogicalPartitionKey, ITrace trace, CancellationToken cancellationToken)
at Microsoft.Azure.Cosmos.Query.Core.ExecutionContext.CosmosQueryExecutionContextFactory.TryCreateCoreContextAsync(DocumentContainer documentContainer, CosmosQueryContext cosmosQueryContext, InputParameters inputParameters, ITrace trace, CancellationToken cancellationToken)
This would normally be fixed by making sure the deployment process correctly copies all DLLs included in the SDK Nuget pacakge.
The problem is, some customers have solutions that run on Windows x64 and cannot add the ServiceInterop DLL (for whatever reason), in those cases, they currently have no work-around.
There are two alternatives to this address this:
Automatic fallback
If the DLL is not available, fallback to Gateway, but in order to allow customers to understand the problem, leave something in the Diagnostics only in the case where the application is running on Windows x64 (not on Linux or x86).
If the app runs on Windows x64, but the DLL cannot be used, a new node in the Diagnostics will help customers understand the problem in latency, and we automatically fallback to Gateway.
Example of Diagnostics node that states that a Query Plan from Gateway was done due to Service Interop not being available:
{
"name": "Gateway QueryPlan",
"id": "1ffd15dd-6d0f-418d-93e6-9fbbd0d75065",
"start time": "09:36:54:025",
"duration in milliseconds": 27.1973,
"data": {
"ServiceInterop unavailable": "True"
}
},
PROs:
- Customers are used to leverage Diagnostics already, they can see the effect of the missing DLL on Windows.
- Automatic analysis can discover it and produce insight.
- No more exceptions
CONs:
- Potentially bloats Diagnostics with for each Query.
- Discovery is on the Diagnostics, not on an Exception that might be more visible.
Continue to throw but provide options
Provide a CosmosClientOptions
configuration that users can leverage to select how they want to interact with ServiceInterop. Something similar to what we have in V2 ConnectionPolicy.QueryPlanGenerationMode
, reference: https://docs.microsoft.com/en-us/dotnet/api/microsoft.azure.documents.client.connectionpolicy.queryplangenerationmode?view=azure-dotnet#microsoft-azure-documents-client-connectionpolicy-queryplangenerationmode.
Default behavior would be to throw, users can opt-in to automatically allow Gateway fallback.
PROs:
- Users can opt-in the automatic fallback explicitly
CONs:
- The configuration can be set even on environments that don’t apply (like Linux), so the question of “Should I change this setting or not?” will need explaining.
- Documentation to understand which is the correct value to use is required
Related to https://github.com/Azure/azure-cosmos-dotnet-v3/issues/2366
Issue Analytics
- State:
- Created a year ago
- Comments:10 (10 by maintainers)
Top GitHub Comments
There are scenarios where this configuration option might not be viable to take. Integrations like Logic Apps where the customer simply has no access to them or Functions.
Adding an option where the default is the current behavior could potentially break those customers without a workaround.
This is mainly because we did not expose any information about what was going on + V2 SDK had multiple bugs regarding Query Plan (like executing the Query on GW if Query Plan was obtained from GW).
I propose adding the Diagnostics to indicate why we went to GW for the Query Plan, Diagnostics can be analyzed by automation and they can also be read and explained by users.
I agree, I wouldn’t expect them to parse the diagnostics either. But the intent is the same as any other scenario where we use the Diagnostics: If the customer is experiencing higher than expected latency, they share the Diagnostics, we analyze and can say (like the cases where the problem is network latency), by looking at the extra Datum, if they are failing to load the DLL and that is the cause. Similar to what we would do analyzing Traces, but Traces is not a reliable source.