Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

High CPU on SQL requests after Azure SQL Server switch (eg. maintenance, upgrade)

See original GitHub issue

After Azure SQL database switch (because of internal upgrade of SQL engine/VM or configuration change like update from S2 to S3), we observe a CPU used spike after 3 to 10 minutes after the switch. If the connections are dropped and new ones are created, everything is going back to normal. If there will be only old connections, the high cpu consumption per request will stay for hours (it is not just a bum parter the switch).

The amount of CPU spike is proportional to SQL request traffic, for example if database CPU was 5% it goes up to 50%, not 100%. All requests are still working properly and return the data as they should.

I’ve attached the script (https://gist.github.com/gluwer/8556ac515de53e497f611d0984627b86) that allows to test the situtation, but it requires creating a simple Azure SQL DB. It was tested on S0 and S1, but we observe the same CPU spikes on P2 databases too. Script is a bit dummy, but allows to reproduce the issue easily. Script contains only CRUD operations, and no stored procedures. Below is a log from one of the runs of it:

100
5 5
Database cpu 60 (3.923)
...
Database cpu 60 (18.817)
1100
1200
6 5
Database cpu 60 (18.479)
1300
1400
6 5
Database cpu 60 (18.43)
Pool error (bint_test) ESOCKET
Error: read ECONNRESET
...
Ended connection
40613 : Database 'test' on server 'test' is not currently available.  Please retry the connection later.  If the problem persists, contact customer support, and provide them the session tracing ID of '333333'.
...
1600
...
Database cpu 60 (36.05125)
2400
2500
6 5
Database cpu 60 (35.611)
...
Database cpu 60 (35.411)
4800
6 5
Database cpu 60 (48.291)
6 5
Database cpu 60 (61.44)
Cpu above threshold, use new connection pool
Database bint_test new pool started
Ended connection
Ended connection
Ended connection
Ended connection
Ended connection
Ended connection
Database bint_test oldPool close
...
5600
5 5
Database cpu 60 (52.196)
5701 : Changed database context to 'bint_test'.
5703 : Changed language setting to us_english.
5703 : Changed language setting to us_english.
5700
5800
6 5
Database cpu 60 (39.065)
5900
6000
6 5
Database cpu 60 (35.522)
6100
6 5
Database cpu 60 (34.978)
6300
6400
6 5
Database cpu 60 (33.996)

The problem occures almost always. We have used newest tedious 9.0.1 and nodejs 12.16.3.

Below is how it all looks from database perspective:

end_time	avg_cpu_percent	avg_data_io_percent	avg_log_write_percent	avg_memory_usage_percent	xtp_storage_percent	max_worker_percent	max_session_percent
2020-08-06 09:15:21.760	99.78	0.00	16.01	1.50	0.00	6.66	1.33
2020-08-06 09:15:06.750	99.51	0.00	18.10	1.49	0.00	3.33	1.33
2020-08-06 09:14:51.733	99.65	0.00	15.90	1.49	0.00	6.66	1.33
2020-08-06 09:14:36.727	80.05	0.00	28.06	1.49	0.00	8.33	1.33
2020-08-06 09:14:21.680	33.85	0.00	56.81	1.49	0.00	8.33	1.33
2020-08-06 09:14:06.683	33.88	0.00	54.06	1.48	0.00	8.33	1.33

When there is only one connection and cpu bumps only to 50% for example, all other stats stay the same, only cpu jumps.

What we have checked looking for a possible solution:

tedious debug and requests content (before and when there is high cpu all packets look exactly the same including state switches reported etc.)
tedious protocol and some other options (checked 7.3 and 7.2, but problem still exists there)
database comptatibility level (from 100 to 140, no effect)
connections don’t have to exist when there is a switch, eg. you can make a switch and start new conections seconds later and the high cpu still occurs (but if you connect after 20 mins, everything will be fine)
there can be only one connection with some load that will make cpu go from 5% to 50% on S0 database
doing a second switch when first one high cpu occures does not make it happen again even after 20 minutes
database still reports proper number of connections, they are not doubled etc.
reported in log analytics query cpu is the same as before, but query duration jumps (normally both are very similar)
when high cpu, there is much more SOS_SCHEDULER wait type than normally

We have also checked what happens when if we use .net entity framework with similar load and do the switch - there is no high cpu within next 20 minutes. At the beginning we have reported the issue on mssql (https://github.com/tediousjs/node-mssql/issues/1067), as we do not use tedious directly.

Conslusions

We at the moment do not know if the problem is related to node, tedious, MS SQL server or the Azure SQL Gateway. It looks like gateway or SQL server is doing something minutes after the switch (maybe on low level) which makes this strange high cpu spike.

Issue Analytics

State:
Created 3 years ago
Comments:8 (2 by maintainers)

Top GitHub Comments

1reaction

gluwercommented, Aug 14, 2020

We have made one last test today, and no high cpu occured, so it looks like sql engine bug fixed in newest update. Because of this I’m closing the issue. Thank you for all the help.

0reactions

David-Engelcommented, Aug 13, 2020

@gluwer No, unfortunately there is not. It’s possible it’s the same bug. The two do share a lot of the same code and the timing is interesting. I don’t have visibility to that, though.