question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] DataLakeFileClient.QueryAsync fails with "XML specified is not syntactically valid"

See original GitHub issue

Library name and version

Azure.Storage.Files.DataLake 12.9.0

Describe the bug

I am using the DataLakeFileClient to query CSV files stored in an Azure Data Lake Gen2 account. However, I recently updated some Azure SDK nuget package references to their latest versions, and after the update my calls to QueryAsync are failing with an exception. Note that prior to the upgrade all my calls to QueryAsync were successful and returned the expected results from the CSV files.

The exact package and version changes were: Azure.Identity 1.3.0 -> 1.5.0 Azure.Storage.Blobs 12.8.0 -> 12.11.0 Azure.Storage.Queues 12.6.0 -> 12.9.0 Azure.Storage.Files.DataLake 12.5.0 -> 12.9.0

I was able to do some debugging and step through the data lake SDK code, and it appears that not all of the options provided in the DataLakeQueryOptions parameter are being included in the request to the blob API. I am calling QueryAsync with the options below:

var options = new DataLakeQueryOptions
{
    InputTextConfiguration = new DataLakeQueryCsvTextOptions
    {
        HasHeaders = true,
        ColumnSeparator = ",",
        RecordSeparator = "\n",
        QuotationCharacter = '"',
        EscapeCharacter = '"'
    },
    OutputTextConfiguration = new DataLakeQueryJsonTextOptions
    {
        RecordSeparator = "\n"
    }
};

DataLakeFileClient.QueryAsync transforms the input DataLakeQueryOptions object to a BlobQueryOptions object by calling DataLakeExtensions.ToBlobQueryOptions. Eventually the code path for handling DataLakeQueryCsvTextOptions ends up in DataLakeExtensions.ToBlobQueryCsvTextConfiguration which returns an incomplete object. The return type BlobQueryCsvTextOptions has 5 properties that match the 5 properties from DataLakeQueryCsvTextOptions, however the value for RecordSeparator is NOT copied over from the input object.

DataLakeFileClient.ToBlobQueryCsvTextConfiguration should be updated to copy over the RecordSeparator value. I was able to prove this by manually copying my provided RecordSeparator to BlobQueryCsvTextOptions.RecordSeparator in the debugger. With the RecordSeparator correctly set, the request to the blob API returned successfully with the expected query results.

Expected behavior

QueryAsync should return with a non-error response that contains the query results similar to before I upgraded to the latest SDK versions.

Actual behavior

QueryAsync throws the below exception.

Exception message:

XML specified is not syntactically valid.
RequestId:4a20e5e0-801e-0016-4536-481cdc000000
Time:2022-04-04T15:16:27.6395672Z
Status: 400 (XML specified is not syntactically valid.)
ErrorCode: InvalidXmlDocument

Additional Information:
Reason: XML parsing in query blob content threw an exception.

Content:
<?xml version=\"1.0\" encoding=\"utf-8\"?>
<Error><Code>InvalidXmlDocument</Code><Message>XML specified is not syntactically valid.
RequestId:4a20e5e0-801e-0016-4536-481cdc000000
Time:2022-04-04T15:16:27.6395672Z</Message><Reason>XML parsing in query blob content threw an exception.</Reason></Error>

Headers:
Server: Windows-Azure-Blob/1.0,Microsoft-HTTPAPI/2.0
x-ms-error-code: InvalidXmlDocument
x-ms-request-id: 4a20e5e0-801e-0016-4536-481cdc000000
x-ms-version: 2021-04-10
x-ms-client-request-id: 01e1d90d-8e96-4199-b215-9f676bdf5adb
Date: Mon, 04 Apr 2022 15:16:27 GMT
Content-Length: 299
Content-Type: application/xml

Exception stack trace:

at Azure.Storage.Blobs.BlobRestClient.QueryAsync(String snapshot, Nullable`1 timeout, String leaseId, String encryptionKey, String encryptionKeySha256, Nullable`1 encryptionAlgorithm, Nullable`1 ifModifiedSince, Nullable`1 ifUnmodifiedSince, String ifMatch, String ifNoneMatch, String ifTags, QueryRequest queryRequest, CancellationToken cancellationToken)
at Azure.Storage.Blobs.Specialized.BlockBlobClient.QueryInternal(String querySqlExpression, BlobQueryOptions options, Boolean async, CancellationToken cancellationToken)
at Azure.Storage.Blobs.Specialized.BlockBlobClient.QueryAsync(String querySqlExpression, BlobQueryOptions options, CancellationToken cancellationToken)
at Azure.Storage.Files.DataLake.DataLakeFileClient.QueryAsync(String querySqlExpression, DataLakeQueryOptions options, CancellationToken cancellationToken)
...
<my call stack removed>
...
at Microsoft.AspNetCore.Mvc.Infrastructure.ActionMethodExecutor.TaskOfActionResultExecutor.Execute(IActionResultTypeMapper mapper, ObjectMethodExecutor executor, Object controller, Object[] arguments)
at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.<InvokeActionMethodAsync>g__Logged|12_1(ControllerActionInvoker invoker)
at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.<InvokeNextActionFilterAsync>g__Awaited|10_0(ControllerActionInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)
at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Rethrow(ActionExecutedContextSealed context)
at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Next(State& next, Scope& scope, Object& state, Boolean& isCompleted)
at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.<InvokeInnerFilterAsync>g__Awaited|13_0(ControllerActionInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)
at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.<InvokeFilterPipelineAsync>g__Awaited|20_0(ResourceInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)
at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.<InvokeAsync>g__Logged|17_1(ResourceInvoker invoker)
at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.<InvokeAsync>g__Logged|17_1(ResourceInvoker invoker)
at Microsoft.AspNetCore.Routing.EndpointMiddleware.<Invoke>g__AwaitRequestTask|6_0(Endpoint endpoint, Task requestTask, ILogger logger)
at Microsoft.AspNetCore.Authorization.Policy.AuthorizationMiddlewareResultHandler.HandleAsync(RequestDelegate next, HttpContext context, AuthorizationPolicy policy, PolicyAuthorizationResult authorizeResult)
at Microsoft.AspNetCore.Authorization.AuthorizationMiddleware.Invoke(HttpContext context)
at Microsoft.AspNetCore.Authentication.AuthenticationMiddleware.Invoke(HttpContext context)
at Microsoft.AspNetCore.Diagnostics.DeveloperExceptionPageMiddleware.Invoke(HttpContext context)

Reproduction Steps

var blobAccountName = "<your account name here>";
var uri = new Uri($"https://{blobAccountName}.dfs.core.windows.net/");
TokenCredential tokenCredential = null; // Add your credentials here
var client = new DataLakeServiceClient(uri, tokenCredential).GetFileSystemClient(fileSystemName);

var path = "<path to a CSV file in your storage account here>";
var query = "SELECT * FROM BlobStorage LIMIT 25";
DataLakeFileClient file = client.GetFileClient(path);
var options = new DataLakeQueryOptions
{
    InputTextConfiguration = new DataLakeQueryCsvTextOptions
    {
        HasHeaders = true,
        ColumnSeparator = ",",
        RecordSeparator = "\n",
        QuotationCharacter = '"',
        EscapeCharacter = '"'
    },
    OutputTextConfiguration = new DataLakeQueryJsonTextOptions
    {
        RecordSeparator = "\n"
    }
};

FileDownloadInfo result = await file.QueryAsync(query, options);

Environment

PS C:\Users\NPlotts> dotnet --info .NET SDK (reflecting any global.json): Version: 6.0.101 Commit: ef49f6213a

Runtime Environment: OS Name: Windows OS Version: 10.0.19044 OS Platform: Windows RID: win10-x64 Base Path: C:\Program Files\dotnet\sdk\6.0.101\

Host (useful for support): Version: 6.0.1 Commit: 3a25a7f1cc

IDE and version: Seeing this in both JetBrains Rider 2021.3.4 and Microsoft Visual Studio Enterprise 2022 (64-bit) Version 17.0.5

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:8 (2 by maintainers)

github_iconTop GitHub Comments

4reactions
lbalajicommented, Sep 20, 2022

Hi folks - @sumantmehtams Curious if there are any updates on this issue or suggested workarounds.

I am having a similar issue, my workaround was to load the CSV file and use pandas to filter those matched values, if you could expedite the issue helpful.

2reactions
amnguyecommented, Oct 13, 2022

We just made the fix, it will be in our next release of the Azure.Storage.Files.DataLake package.

Apologies for the late fix and response, we will make sure next time this issue is triaged properly so it doesn’t fall through the cracks.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Azure Datalake query accelaration error XML specified is ...
but getting exception when i run the query. I am using .net core application. Error: {"XML specified is not syntactically valid.\nRequestId:e3204a59-901e-005f ...
Read more >
InvalidXmlDocument: XML specified is not syntactically valid
The error is most probably caused by empty properties in the elements fileServices, queueServices and/or tableServices in your ARM-template.
Read more >
I am getting "XML specified is not syntactically valid" error. ...
I am getting "XML specified is not syntactically valid" error. I tried with both JSON and CSV code. Same issue. any idea why...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found