Blob ContentHash appears not to be set/read for large files
See original GitHub issueDescribe the bug Some files in my container appear to have an empty BlobItem.Properties.ContentHash. I assume the problem occurs on upload but it’s possible the issue is with reading files. Small files do not exhibit the problem. Large files do. The threshold appears to be around 50K. It’s possible the problem is actual a side-effect of specifying an InitialTransferLength and MaximumTransferSize of 64K for upload/download in the StorageTransferOptions structure.
Note there is a similar issue at https://github.com/Azure/azure-sdk-for-net/issues/14037 but this describes a different mechanism.
Expected behavior I would expect the MD5 of a blob always to be available (why would you ever NOT want it set?) Even if not set by the client on upload it should be calculated by the server when storing the blob.
Actual behavior (include Exception or Stack Trace)
When reading back the blob properties using BlobContainer.GetBlobsAsync, the returned BlobItem entries contain an empty array for ContentHash when the file is larger than some threshold (64K? MaximumTransferSize? )
To Reproduce
Blobs are stored using this code…
public async Task StoreBlob(string containerName, string blobName, ImmutableArray<byte> bytes)
{
BlockBlobClient blobReference = GetBlobAsync(containerName, blobName);
var stream = new MemoryStream(bytes.ToArray());
var options = new BlobUploadOptions
{
TransferOptions = TransferOptions
};
await blobReference.UploadAsync(stream, options).ConfigureAwait(false);
}
Blobs are listed using this code …
public async Task<BlobItem[]> ListBlobs(string containerName)
{
BlobContainerClient client = FindContainer(containerName);
var blobList = client.GetBlobsAsync();
var blobs= new List<BlobItem>();
var tasks = blobList.GetAsyncEnumerator();
while (await tasks.MoveNextAsync().ConfigureAwait(false))
blobs.Add(tasks.Current);
// note that the issue exhibits itself as empty Properties.ContentHash fields for BlobItems describing large blobs
return blobs.ToArray();
}
where
private static readonly StorageTransferOptions TransferOptions = new
StorageTransferOptions
{
MaximumConcurrency = 20,
InitialTransferLength = 0x10000,
MaximumTransferSize = 0x10000
};
Environment: Azure.Storage.Blobs 12.7.0 Windows 10 Net Core 3.0
Issue Analytics
- State:
- Created 3 years ago
- Comments:8 (3 by maintainers)
Top GitHub Comments
@seanmcc-msft Thanks for the explanation - if I have understood correctly then the workaround is simply to leave InitialTransferLength at its default value?
I appreciate the SDK is a work in progress but the design here appears “less than optimal”. As far as I’m concerned as a user, I’ve just called “Upload” and in some circumstances I get a hashcode and in some cases I don’t even when using exactly the same client code! I can’t see how anyone would think this lack of consistency is a defensible API design. (This also shows in the linked issue where you either get or don’t get a checksum depending on undocumented characteristics of the source stream.)
On the face of it this seems… not good Based on some opaque and undiscoverable criteria (to the user), the sdk is not only deciding whether or not a checksum will be generated but also changing the semantics of the blob I’m storing. One minute I think I’m storing an immutable blob and then simply by setting a parameter in a structure that I thought was tuning upload rate I’ve made it modifiable in the future!
My strong suggestion would be to make the behaviour less flexible and more predictable. If the checksum can’t be reliably supplied by the server then make it explicit in the API that the client has to supply it.
FWIW the reason that I care about the checksum is that we treat a container as a virtual file-store. We have code for synchronising subsets of that file-store to a local machine or another file-store. Being able to rely on the presence of a checksum is obviously a huge win when doing this since we can avoid transferring blobs which are already present in the target.
To add to my previous comment.
I tried above for 2.9GB file. I intentionally supplied reversed md5Checksum in BlobUploadOptions.HttpHeaders.ContentHash
I was surprised to see 2 things:
I am interested in ContentHash for the upload integrity. If ContentHash is not reliable, what is the best way to check the upload integrity?
@seankane-msft