CosmosDB trigger loses document if function app crashes
See original GitHub issueIf the function app crashes (is stopped/killed) during processing of a document/batch (using CosmosDB trigger change feed), the trigger will continue with the next document/batch once the app is up again, effectively losing that batch.
For mission critical operations, this is not really acceptable considering there just isn’t a way to handle this situation.
Repro steps
- Start a function with the cosmos trigger
- Update a document in cosmos
- End the function process (kill it) while it’s processing the event
- Start it again
- Notice that it does not retry the same document again.
Expected behavior
The trigger should send the same document again until the function returns.
A good middle ground is making it work the way the Event Hub trigger handles it (Event Hub trigger retries the same event only if the app does not respond at all).
Without knowing the details, it seems to me that the trigger should only update the checkpoint after the function has completed, not before (which it looks like to me).
Actual behavior
The function does not receive the same document event again. Rather, the next document change is triggered. So the document change is lost.
Known workarounds
It seems if you add a retry policy like [FixedDelayRetry]
to your function, the checkpoint is kept properly. This works if you ctrl-C
the app at least.
Related information
microsoft.azure.webjobs.extensions.cosmosdb\3.0.10
Issue Analytics
- State:
- Created 2 years ago
- Reactions:3
- Comments:13 (5 by maintainers)
Top GitHub Comments
@ealsur your are right about Cosmos and deletes, I forget since we always use soft-delete with TTL when we need to record deletes in the ChangeFeed.
Having to call the management API just to disable a function, I think defeats the purpose of Functions and Triggers being easy and fast to develop and use. I’m not saying it can’t be done but its a lot of hoops to go through for something that should be build in.
The Circuit breaker pattern is old and it should have been implemented into the Function runtime if you ask me, at least as a optional configuration. Now it seems the team that writes the Triggers have to balance protecting unwitting developers from runaway cost on consumption plans vs delivering resilient and easy to use features.
I don’t think it is possible to code custom logic that makes a function disable itself via the management API, because such logic will have to sit in the outermost try-catch to be robust. That try-catch is in the Functions runtime, so it would need to be implemented in the runtime itself.