question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Blob ContentHash appears not to be set/read for large files

See original GitHub issue

Describe the bug Some files in my container appear to have an empty BlobItem.Properties.ContentHash. I assume the problem occurs on upload but it’s possible the issue is with reading files. Small files do not exhibit the problem. Large files do. The threshold appears to be around 50K. It’s possible the problem is actual a side-effect of specifying an InitialTransferLength and MaximumTransferSize of 64K for upload/download in the StorageTransferOptions structure.

Note there is a similar issue at https://github.com/Azure/azure-sdk-for-net/issues/14037 but this describes a different mechanism.

Expected behavior I would expect the MD5 of a blob always to be available (why would you ever NOT want it set?) Even if not set by the client on upload it should be calculated by the server when storing the blob.

Actual behavior (include Exception or Stack Trace)

When reading back the blob properties using BlobContainer.GetBlobsAsync, the returned BlobItem entries contain an empty array for ContentHash when the file is larger than some threshold (64K? MaximumTransferSize? )

To Reproduce

Blobs are stored using this code…

        public async Task StoreBlob(string containerName, string blobName, ImmutableArray<byte> bytes)
        {
            BlockBlobClient blobReference = GetBlobAsync(containerName, blobName);

            var stream = new MemoryStream(bytes.ToArray());
            var options = new BlobUploadOptions
            {
                TransferOptions = TransferOptions
            };
            await blobReference.UploadAsync(stream, options).ConfigureAwait(false);
        }

Blobs are listed using this code …

        public async Task<BlobItem[]> ListBlobs(string containerName)
        {
            BlobContainerClient client = FindContainer(containerName);

            var blobList = client.GetBlobsAsync();
            var blobs= new List<BlobItem>();

            var tasks = blobList.GetAsyncEnumerator();
            while (await tasks.MoveNextAsync().ConfigureAwait(false))
                blobs.Add(tasks.Current);

            // note that the issue exhibits itself as empty Properties.ContentHash fields for BlobItems describing large blobs

            return blobs.ToArray();
        }

where

  private static readonly StorageTransferOptions TransferOptions = new
            StorageTransferOptions
            {
                MaximumConcurrency = 20,
                InitialTransferLength = 0x10000,
                MaximumTransferSize = 0x10000
            };

Environment: Azure.Storage.Blobs 12.7.0 Windows 10 Net Core 3.0

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:8 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
NeilMacMullencommented, Dec 22, 2020

@seanmcc-msft Thanks for the explanation - if I have understood correctly then the workaround is simply to leave InitialTransferLength at its default value?

I appreciate the SDK is a work in progress but the design here appears “less than optimal”. As far as I’m concerned as a user, I’ve just called “Upload” and in some circumstances I get a hashcode and in some cases I don’t even when using exactly the same client code! I can’t see how anyone would think this lack of consistency is a defensible API design. (This also shows in the linked issue where you either get or don’t get a checksum depending on undocumented characteristics of the source stream.)

The SDK uses Put Blob to upload blobs of less than 256MB. If the blob is data is > 256 MB or InitialTransferLength is specified, Put Block and Put Block List is used instead.

On the face of it this seems… not good Based on some opaque and undiscoverable criteria (to the user), the sdk is not only deciding whether or not a checksum will be generated but also changing the semantics of the blob I’m storing. One minute I think I’m storing an immutable blob and then simply by setting a parameter in a structure that I thought was tuning upload rate I’ve made it modifiable in the future!

My strong suggestion would be to make the behaviour less flexible and more predictable. If the checksum can’t be reliably supplied by the server then make it explicit in the API that the client has to supply it.

FWIW the reason that I care about the checksum is that we treat a container as a virtual file-store. We have code for synchronising subsets of that file-store to a local machine or another file-store. Being able to rely on the presence of a checksum is obviously a huge win when doing this since we can avoid transferring blobs which are already present in the target.

0reactions
bhavin07commented, Jan 7, 2021

To add to my previous comment.

work around is to calculate the content MD5 of your file, and then set it in BlobUploadOptions.HttpHeaders.ContentHash, and then call BlockBlobClient.Upload(stream, options)

I tried above for 2.9GB file. I intentionally supplied reversed md5Checksum in BlobUploadOptions.HttpHeaders.ContentHash

.
.
.
            Array.Reverse(mD5Checksum);

            var blobContentInfo = await blobClient.UploadAsync(new FileStream(localFilePath, FileMode.Open, FileAccess.Read), new BlobUploadOptions()
            {
                HttpHeaders = new BlobHttpHeaders() { ContentHash = mD5Checksum }
            });

            return (blobContentInfo.Value.VersionId, blobContentInfo.Value.ContentHash);

I was surprised to see 2 things:

  1. Upload was successful despite wrong md5Cheksum supplied
  2. ContentHash was not returned in the response

I am interested in ContentHash for the upload integrity. If ContentHash is not reliable, what is the best way to check the upload integrity?

@seankane-msft

Read more comments on GitHub >

github_iconTop Results From Across the Web

ContentHash not calculated in Azure Blob Storage v12
Ran into this today. From my digging, it appears this is a symptom of the type of Stream you use to upload, and...
Read more >
Manage properties and metadata for a blob with .NET
Learn how to set and retrieve system properties and store custom metadata on blobs in your Azure Storage account using the .
Read more >
Blob Storage
Blob storage is for storing "blobs" of data, that is a raw stream of bytes like files on a filesystem. For blob storage...
Read more >
Calculate & Validate MD5 hashes on Azure blob storage ...
I was recently asked to implement a solution to verify the integrity of a file stored in an Azure blob storage container.
Read more >
Upload Large Files to Azure Blob Storage with Python
When I use the following Python code to upload a CSV file to Azure Blob container. ... Its answer looks promising, will it...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found