MAJOR Perf Regression Using The Same Binaries From Ubuntu 20.04 -> 22.04
See original GitHub issueDescribe the bug I’m running the latest NuGet version, I just upgraded to dotnet7, and all systems are 100% updated. I have a service that handles active server connections, so when it restarts to update, all of the current servers connect also reconnect. Usually, that’s somewhere between 4k-6k connections trying to re-connect in 10-30 seconds. When the client reconnects, I need to look up its information in my CosmosDB database, so I do a simple query for an indexed unique field representing the client.
Results Running the exact same binaries under dotnet 7: On Ubuntu 20.04: The ~4k connections take about 20 seconds to full reconnect, and my 4 core VM hovers around 20% CPU usage. On Ubuntu 22.04: The ~4k connections take about 5 minutes to fully reconnect, and my 4 core VM is maxed out at around 100% CPU usage.
This is a really bad regression, and it must be related to the OS change. Again, I’m using the exact same complied binaries of the server and CosmosDB SDK. The two VMs are setup identically, the only difference is the OS version. I noticed this issue after upgrading to Ubuntu 22.04; my server got really slow. To validate it, I then created a new VM using the old OS, where I’m able to switch traffic between.
I know that the pref regression is in CosmosDB, because the lookup call is what’s taking the majority of the time and seems to be the cause of the 100% CPU load. It seems really suspicious because I get that a large influx of lookup queries all at once might max out the DB, but it wouldn’t expect it to max out the client-side CPU, all of the pending client-side tasks should just be awaiting the DB response.
To Reproduce Breaking down my code, basically, it’s equivalent to calling this small sudo code block in a concurrent loop about 3-4k times in 20 seconds.
var container = Db.GetOctoClientMetaContainer();
try
{
var it = container.GetItemQueryIterator<OctoClientMeta>($"SELECT * FROM OctoClients u WHERE u.id='{id.ToUpper()}'");
while (it.HasMoreResults)
{
foreach (var client in await it.ReadNextAsync())
{
return new UpdateResult(true, client);
}
}
}
catch (Exception e)
{
...
}
Expected behavior Like on Ubuntu 20.04, this should be expensive, but it shouldn’t peak out the CPU.
Actual behavior On Ubuntu 22.04, it peaks the CPU and takes a lot longer for all 4k lookups to occur.
Environment summary SDK Version: CosmosDB 3.31.2 OS Version Ubuntu 20.04 and Ubuntu 22.04
Additional context I would love to help gather anymore info that I can, perf traces or whatnot, just let me know.
Issue Analytics
- State:
- Created 10 months ago
- Comments:15 (7 by maintainers)
Top GitHub Comments
They are digital ocean VMs, CPU optimized 4 core. I’m pretty sure it’s not a problem with the VM, because I have 7+ VMs all running the same setup worldwide. None of them had a problem on 20.04, but they all have it now after the update to 22.04.
Digital Ocean also doesn’t move around VMs as Azure can between hosts when they are turned off or restarted, so they should be running on the same hardware from the 20.04 -> 22.04 move.
Yeah, will do.
Thanks for all of the info and help. I will take my finding and traces to open a bug elsewhere. 😃