question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Large import failing mysteriously

See original GitHub issue

Using .NET Core 3.1 + Microsoft.Azure.Cosmos 3.15.0 Azure Function > with BlobTrigger Function listens to a blob size 0.5GB and imports around 2M documents to cosmosDB Azure Function running under Premium plan > No timeout - always running App insights integrated with Azure Function.

This is not a static Azure function. There is a known issues with DI with Azure Function V3. So I am using my custom implementation of DI and using only one instance of CosmosClient (a singleton).

Some perf tuning on SDK: Current setting: Bulk is on, Retry default Set EnableContentResponseOnWrite = false

Cosmos DB RU = Autoscale with 20K Function App running in premium plan with sufficient nodes/CPU

Issues/Observations

  • Azure Function in Azure cloud quietly dies/unresponsive/timing out (no application insights logs) even when logging explicitly. See below code.
  • When ran Azure Function locally, it throws 408 (CPU high observed locally) - so this is fine
  • CPU/Memory on Azure was high during the import but after scaling up and out, this is within the limits
  • CosmosDB Metrics shows 429’s and max RUs were around 20K. So, looks like bulk import is hitting too hard? It’s not auto-throttling. How can we find out what is the batch size it’s hitting? How can we throttle? How can we find out what is going on under the hood?
  • This is very unpredictable. Some imports are successful while others failing

I cannot provide the complete code but here are some snippets that are in discussion.

Code:

CosmosClient client = clientBuilder.WithBulkExecution(true)
                                .WithConnectionModeDirect()
                                .Build();

Here’s my bulk import code:

 List<Task> concurrentTasks = new List<Task>();
                foreach (var contact in contacts)
                {
                    concurrentTasks.Add(cosmosClient_registry.UpdateItemAsync(contact.contactnumber, contact).ContinueWith(async item => await AddImportMetadata(item, import)));
                }
                await Task.WhenAll(concurrentTasks);

Here’s my logging code:

private async Task AddImportMetadata(Task<ItemResponse<Contact>> task, ContactImport import)
       object lockObj = new object();
            lock (lockObj)
            {
                try
                {
                    if (task.IsCompletedSuccessfully)
                    {
                        var response = task.Result;
                        import.TotalRequestUnitsUsed += response.RequestCharge;
                        import.TotalRecordsProcessed += 1;
                    }
                    else
                    {
                        log.LogError("Error saving Record # " + import.TotalRecordsProcessed);
                        AggregateException aggregateException = task.Exception;
                        foreach (var exception in aggregateException.InnerExceptions)
                        {
                            log.LogError($"Record Error {exception.Message} with id {task.Id}");
                        }
                    }
                }
                catch (Exception ex)
                {
                    log.LogError("Error saving Record # " + import.TotalRecordsProcessed);
                    log.LogError($"Record Error: " + ex.Message);
                }
            }
        }

Update code:

public async Task<ItemResponse<T>> UpdateItemAsync(string id, T item)
        {
            ItemRequestOptions itemRequestOptions = new ItemRequestOptions() { EnableContentResponseOnWrite = false };
            ItemResponse<T> itemResponse = await this._container.UpsertItemAsync<T>(item, new PartitionKey(id), itemRequestOptions);
            itemResponse.Diagnostics.ToString();
            return itemResponse;
        }

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
ealsurcommented, Dec 17, 2020

AppInsight won’t log TCP requests as far as I know, it only automatically logs HTTP requests.

The reason you are getting 429s is something I cannot know from the SDK perspective, from the library perspective we don’t generate these 429s, we are just receiving it from the service. If they are there in the metrics, it means the client is receiving it, and retrying based on your configuration. The client cannot modify the backend behavior, if the backend returns 429 (for whatever reason), then the client needs to retry.

I don’t know the Autoscale semantics regarding RU distribution, maybe @ThomasWeiss has more insights.

If you are trying to insert 2M items, assuming a single item consumes the less possible RU (5 RU/s for a 1Kb item), you are looking at 10M RU/s if you were to send them all at the same time. With 30K RU, it could potentially process 6K items per second without throttling (this is an estimation, and assumes a partition key value with high cardinality), but if the volume of data per second is higher, then you’d get throttled.

I would increase the retry count to 100, if the provisioned RU/s is well under the need for the volume of data.

1reaction
ealsurcommented, Dec 17, 2020

Have you contacted Azure Functions support? The fact that the environment shuts down should be investigated. What I have seen, depending on the Functions tier you are on, is that if you violate any of the limits, the execution is terminated.

A couple of things wrong with the code:

  1. Why are you only setting to retry once on 429s?

.WithThrottlingRetryOptions(TimeSpan.FromMinutes(10),1)

That configuration says that your max retry period is 10 minutes, but within those 10 minutes, you want to retry 1 time. So if you get 2 429s in 30 seconds, it will throw the exception. In my experience, if your provisioned RU is quite under the volume of data you are trying to save, you should instead use a high number of retries, because you will get throttled.

  1. Your Task extension is modifying a shared variable ContactImport and increasing values or modifying it, you need to be aware that these Tasks are executing in parallel, and there will be concurrency, and your code is not guarding against that.

  2. Is your Function code (outside of the code you shared) capturing unhandled exceptions with a global try/catch just in case any of your logic is failing?

  3. Also, your ContinueWith is not observing the Task state, instead of merely testing for an Exception, verify task.IsCompletedSuccessfully and keep in mind that exceptions could be AggregateExceptions, in which case you’d need to Flatten and find the inner one.

Regarding your question on Autoscale: Bulk will send the operations as fast as it can, it is meant to exhaust the available throughput. If your provisioned throughput is not enough to handle the volume of data, you should either increase the available throughput or reduce the volume of data you process at a given time. SDK does not interact in any particular way with Autoscale, and Autoscale mechanism increasing the available RU might kick in independently. From the SDK we will receive 429s and retry on them, and that’s it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

FM Crashes During Large Import (Over the Limit)??
I am having crashing issues importing large files into FileMaker. ... On three of the computers, Filemaker closed without giving an error message....
Read more >
Need help with mysterious CSV import errors/warnings
Solved: I have been helping a team with CSV imports and I am getting errors and warnings like: Destination issue not found Component...
Read more >
Phpmyadmin fails to import sql files
I gave up to try to import the single tables. I solved it by exporting the database as a compressed .zip file. I...
Read more >
IMPORTXML error alert: “Internal import error” - How to ...
Solution I'm looking for: Since IMPORTXML is failing to import the data, is there any way to import the data directly via script...
Read more >
Import Fail Due to Large CSV file
A large CSV file with thousands of records causes the import process to fail. This is because the CSV file may not have...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found