[BUG] Blob Trigger Scan continuously firing, and processing the same blobs
See original GitHub issueLibrary name and version
Azure.Storage.Blobs 12.15.1
Describe the bug
I have a blob trigger set up to montitor one of my containers in Azure Storage. The associated scan over the blobs is firing non-stop, looping over the same blobs again and again, causing excessive operations on the storage account.
Expected behavior
Would expect blobs already processed to be skipped.
Actual behavior
A scan is firing immediately after the last one has finished. I’m not sure if that’s expected behaviour, but what is not expected, is that the same blobs are being detected by the scan over and over again.
I have noticed that at the start of each scan, this log message is printed to the function logs:
[Verbose] Poll for blobs newer than '0001-01-01T00:00:00.000' in container '<container>' with ClientRequestId 'xxx' found 1839 blobs in 302 ms. ContinuationToken: False
Comparing with other blob triggers, I have discovered that the relevant scaninfo file seems to not have been created/updated. If I manually create the scaninfo file with a date/time, then the next scan will poll for blobs newer than that time, but the scaninfo file is then not updated again automatically.
Blob receipts are being created, and each blob is processed with a message similar to the following:
[Verbose] Blob '<blob name>' will be skipped for function 'FillBlobMetaData' because this blob with ETag '"0x8D8798F7B192688"' has already been processed. PollId: '566d148e-11d0-4a80-86a0-c7d4cfab3b58'. Source: 'ContainerScan'.
Reproduction Steps
Enable my function within the function app on the Azure Portal.
Environment
.NET 6.0 app Azure Functions, runtime version 4.16.5.20396
Microsoft.Azure.WebJobs.Extensions.Storage.Blobs v5.1.1
Issue Analytics
- State:
- Created 6 months ago
- Reactions:2
- Comments:7 (2 by maintainers)
Top GitHub Comments
+1 to this problem, also applies to Node.js based functions
I encountered the same issue with my dotnet function. I eventually tracked it down to the
ScanBlobScanLogHybridPollingStrategy.PollNewBlobsAsync
method.This was causing millions of transactions per day
The blob paged api has a page continuation token value to determine if the current page is the last page. That token is an empty string for the last page, but the code was checking against
null
. This resulted in the date time never being assigned and was always DateTime.Min. Thus every blob get checked again and again against that dateI noticed someone else fixed the issue here: https://github.com/Azure/azure-sdk-for-net/commit/a69839f4a33ab39a7d1441e36415f08e00b76ff0
I think Microsoft.Azure.WebJobs.Extensions.Storage.Blobs 5.1.2 introduced the fix
edit: After switching to the latest version (5.1.3) I still found this polling mechanism continuously checks the most recent batch of files due to these lines:
https://github.com/Azure/azure-sdk-for-net/blob/c4fb4c52117622c2107163ba6d3efbb20743836d/sdk/storage/Microsoft.Azure.WebJobs.Extensions.Storage.Blobs/src/Listeners/ScanBlobScanLogHybridPollingStrategy.cs#L265-L268
Above
CurrentSweepCycleLatestModified
is set to the most recent last modified date. This becomes theLastSweepCycleLatestModified
afterwards.If there are no new files coming in, then
LastSweepCycleLatestModified
remains assigned to the last modified date of the most recent file - still unchanged.As a result, those most recent files will be checked over and over again, continuously:
https://github.com/Azure/azure-sdk-for-net/blob/c4fb4c52117622c2107163ba6d3efbb20743836d/sdk/storage/Microsoft.Azure.WebJobs.Extensions.Storage.Blobs/src/Listeners/ScanBlobScanLogHybridPollingStrategy.cs#L270-L275
According to the comment earlier about rounding, it seems like you cant really mess with the logic.
In my case I had copied over a batch of 60 files at the same time (out of ~700 in the container), and they were continuously being checked every 10 seconds (
PollingInterval
)