[BUG] DataLakeFileClient.QueryAsync fails with "XML specified is not syntactically valid"
See original GitHub issueLibrary name and version
Azure.Storage.Files.DataLake 12.9.0
Describe the bug
I am using the DataLakeFileClient
to query CSV files stored in an Azure Data Lake Gen2 account. However, I recently updated some Azure SDK nuget package references to their latest versions, and after the update my calls to QueryAsync
are failing with an exception. Note that prior to the upgrade all my calls to QueryAsync
were successful and returned the expected results from the CSV files.
The exact package and version changes were: Azure.Identity 1.3.0 -> 1.5.0 Azure.Storage.Blobs 12.8.0 -> 12.11.0 Azure.Storage.Queues 12.6.0 -> 12.9.0 Azure.Storage.Files.DataLake 12.5.0 -> 12.9.0
I was able to do some debugging and step through the data lake SDK code, and it appears that not all of the options provided in the DataLakeQueryOptions
parameter are being included in the request to the blob API. I am calling QueryAsync
with the options below:
var options = new DataLakeQueryOptions
{
InputTextConfiguration = new DataLakeQueryCsvTextOptions
{
HasHeaders = true,
ColumnSeparator = ",",
RecordSeparator = "\n",
QuotationCharacter = '"',
EscapeCharacter = '"'
},
OutputTextConfiguration = new DataLakeQueryJsonTextOptions
{
RecordSeparator = "\n"
}
};
DataLakeFileClient.QueryAsync
transforms the input DataLakeQueryOptions
object to a BlobQueryOptions
object by calling DataLakeExtensions.ToBlobQueryOptions
. Eventually the code path for handling DataLakeQueryCsvTextOptions
ends up in DataLakeExtensions.ToBlobQueryCsvTextConfiguration
which returns an incomplete object. The return type BlobQueryCsvTextOptions
has 5 properties that match the 5 properties from DataLakeQueryCsvTextOptions
, however the value for RecordSeparator
is NOT copied over from the input object.
DataLakeFileClient.ToBlobQueryCsvTextConfiguration
should be updated to copy over the RecordSeparator
value. I was able to prove this by manually copying my provided RecordSeparator
to BlobQueryCsvTextOptions.RecordSeparator
in the debugger. With the RecordSeparator
correctly set, the request to the blob API returned successfully with the expected query results.
Expected behavior
QueryAsync
should return with a non-error response that contains the query results similar to before I upgraded to the latest SDK versions.
Actual behavior
QueryAsync throws the below exception.
Exception message:
XML specified is not syntactically valid.
RequestId:4a20e5e0-801e-0016-4536-481cdc000000
Time:2022-04-04T15:16:27.6395672Z
Status: 400 (XML specified is not syntactically valid.)
ErrorCode: InvalidXmlDocument
Additional Information:
Reason: XML parsing in query blob content threw an exception.
Content:
<?xml version=\"1.0\" encoding=\"utf-8\"?>
<Error><Code>InvalidXmlDocument</Code><Message>XML specified is not syntactically valid.
RequestId:4a20e5e0-801e-0016-4536-481cdc000000
Time:2022-04-04T15:16:27.6395672Z</Message><Reason>XML parsing in query blob content threw an exception.</Reason></Error>
Headers:
Server: Windows-Azure-Blob/1.0,Microsoft-HTTPAPI/2.0
x-ms-error-code: InvalidXmlDocument
x-ms-request-id: 4a20e5e0-801e-0016-4536-481cdc000000
x-ms-version: 2021-04-10
x-ms-client-request-id: 01e1d90d-8e96-4199-b215-9f676bdf5adb
Date: Mon, 04 Apr 2022 15:16:27 GMT
Content-Length: 299
Content-Type: application/xml
Exception stack trace:
at Azure.Storage.Blobs.BlobRestClient.QueryAsync(String snapshot, Nullable`1 timeout, String leaseId, String encryptionKey, String encryptionKeySha256, Nullable`1 encryptionAlgorithm, Nullable`1 ifModifiedSince, Nullable`1 ifUnmodifiedSince, String ifMatch, String ifNoneMatch, String ifTags, QueryRequest queryRequest, CancellationToken cancellationToken)
at Azure.Storage.Blobs.Specialized.BlockBlobClient.QueryInternal(String querySqlExpression, BlobQueryOptions options, Boolean async, CancellationToken cancellationToken)
at Azure.Storage.Blobs.Specialized.BlockBlobClient.QueryAsync(String querySqlExpression, BlobQueryOptions options, CancellationToken cancellationToken)
at Azure.Storage.Files.DataLake.DataLakeFileClient.QueryAsync(String querySqlExpression, DataLakeQueryOptions options, CancellationToken cancellationToken)
...
<my call stack removed>
...
at Microsoft.AspNetCore.Mvc.Infrastructure.ActionMethodExecutor.TaskOfActionResultExecutor.Execute(IActionResultTypeMapper mapper, ObjectMethodExecutor executor, Object controller, Object[] arguments)
at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.<InvokeActionMethodAsync>g__Logged|12_1(ControllerActionInvoker invoker)
at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.<InvokeNextActionFilterAsync>g__Awaited|10_0(ControllerActionInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)
at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Rethrow(ActionExecutedContextSealed context)
at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Next(State& next, Scope& scope, Object& state, Boolean& isCompleted)
at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.<InvokeInnerFilterAsync>g__Awaited|13_0(ControllerActionInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)
at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.<InvokeFilterPipelineAsync>g__Awaited|20_0(ResourceInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)
at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.<InvokeAsync>g__Logged|17_1(ResourceInvoker invoker)
at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.<InvokeAsync>g__Logged|17_1(ResourceInvoker invoker)
at Microsoft.AspNetCore.Routing.EndpointMiddleware.<Invoke>g__AwaitRequestTask|6_0(Endpoint endpoint, Task requestTask, ILogger logger)
at Microsoft.AspNetCore.Authorization.Policy.AuthorizationMiddlewareResultHandler.HandleAsync(RequestDelegate next, HttpContext context, AuthorizationPolicy policy, PolicyAuthorizationResult authorizeResult)
at Microsoft.AspNetCore.Authorization.AuthorizationMiddleware.Invoke(HttpContext context)
at Microsoft.AspNetCore.Authentication.AuthenticationMiddleware.Invoke(HttpContext context)
at Microsoft.AspNetCore.Diagnostics.DeveloperExceptionPageMiddleware.Invoke(HttpContext context)
Reproduction Steps
var blobAccountName = "<your account name here>";
var uri = new Uri($"https://{blobAccountName}.dfs.core.windows.net/");
TokenCredential tokenCredential = null; // Add your credentials here
var client = new DataLakeServiceClient(uri, tokenCredential).GetFileSystemClient(fileSystemName);
var path = "<path to a CSV file in your storage account here>";
var query = "SELECT * FROM BlobStorage LIMIT 25";
DataLakeFileClient file = client.GetFileClient(path);
var options = new DataLakeQueryOptions
{
InputTextConfiguration = new DataLakeQueryCsvTextOptions
{
HasHeaders = true,
ColumnSeparator = ",",
RecordSeparator = "\n",
QuotationCharacter = '"',
EscapeCharacter = '"'
},
OutputTextConfiguration = new DataLakeQueryJsonTextOptions
{
RecordSeparator = "\n"
}
};
FileDownloadInfo result = await file.QueryAsync(query, options);
Environment
PS C:\Users\NPlotts> dotnet --info .NET SDK (reflecting any global.json): Version: 6.0.101 Commit: ef49f6213a
Runtime Environment: OS Name: Windows OS Version: 10.0.19044 OS Platform: Windows RID: win10-x64 Base Path: C:\Program Files\dotnet\sdk\6.0.101\
Host (useful for support): Version: 6.0.1 Commit: 3a25a7f1cc
IDE and version: Seeing this in both JetBrains Rider 2021.3.4
and Microsoft Visual Studio Enterprise 2022 (64-bit) Version 17.0.5
Issue Analytics
- State:
- Created a year ago
- Comments:8 (2 by maintainers)
Top GitHub Comments
I am having a similar issue, my workaround was to load the CSV file and use pandas to filter those matched values, if you could expedite the issue helpful.
We just made the fix, it will be in our next release of the Azure.Storage.Files.DataLake package.
Apologies for the late fix and response, we will make sure next time this issue is triaged properly so it doesn’t fall through the cracks.