question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

cloud.Service update takes a long time (~8 minutes)

See original GitHub issue

In the PPC program cloud.Service, we see the following on updates:

2017-12-15 17:26:13 -0800 service has reached a steady state.
2017-12-15 17:25:51 -0800 service has stopped 1 running tasks: task 7891c7d6-8bb3-43d4-a33b-1066e577a65e.
2017-12-15 17:22:44 -0800 service has begun draining connections on 1 tasks.
2017-12-15 17:22:44 -0800 service deregistered 1 targets in target-group pulumi-te-ni0cce5b1af
2017-12-15 17:18:05 -0800 service registered 1 targets in target-group pulumi-te-ni0cce5b1af
2017-12-15 17:17:34 -0800 service has started 1 tasks: task c1f0173a-5b79-4116-965f-dee9a7ee00d9.
2017-12-15 17:00:07 -0800 service has reached a steady state.

This takes 519 seconds from update start to update steady state.

There is a 279 second delay between registering the new tasks and deregistering the old tasks. Then a 187 second delay between deregistering the old task and stopping it. The latter is due to deregistrationDelay being set to 180 seconds. The former though is a very long wait, during which both services are potentially reachable.

Contrast that with the Pulumi API Service, which is built on ECS but does not use the cloud.Service abstraction - where we see this:

2017-12-15 16:43:42 -0800 service has reached a steady state.
2017-12-15 16:43:17 -0800 service has stopped 2 running tasks: task ce1d3329-fc1e-4e73-8532-be4488d9fc6c task a5b4a734-0903-4881-97a8-e62538f10856.
2017-12-15 16:42:41 -0800 service has begun draining connections on 2 tasks.
2017-12-15 16:42:41 -0800 service deregistered 2 targets in target-group apiTG66f890d7
2017-12-15 16:42:17 -0800 service registered 2 targets in target-group apiTG66f890d7
2017-12-15 16:42:06 -0800 service has started 2 tasks: task 948a0462-184f-40c2-8b76-ce0cd085578d task 40ff51f0-e153-4b70-9c10-fd406a338964.
2017-12-15 12:28:44 -0800 service has reached a steady state.

This takes only 96 seconds from update start to update steady state.

Notably, there is always exactly 24 seconds between registering the new targets and deregistering the old targets - which is great. Then there is just over 30 seconds between deregistering the old task and stopping it. The latter is due to deregistrationDelay being set to 30 in the Pulumi API Service scripts, which explains the difference there. But for the former, there does not appear to be any configuration which controls this, and it’s unclear why there is such a big difference between these two cases.

My guess is that this is somehow related to waiting for health checks to pass on the new service before deregistering the old service. And perhaps the health checks are taking longer on the PPCD process?

Also, the PPCD process is using an NLB, while the API service is using an ALB, which may be a source of difference here.

Issue Analytics

  • State:open
  • Created 6 years ago
  • Reactions:1
  • Comments:14 (14 by maintainers)

github_iconTop GitHub Comments

2reactions
lukehobancommented, Jun 4, 2018

I would love to spend some time looking at what options we have to improve the performance here. This is one of the most significant pain points of working with containers in Pulumi today, due to the very long delay on any updates to the container by default. I believe this is not actually the expected/intended behaviour of the underlying services we are depending on, and that the combination and parameterization we are using just leads to a particularly long delay. If we could cut this in half, it would have a big impact on interactive development of container-based Pulumi programs.

That said, we have no concrete known action here yet, and it may be risky to make any change in the near term. Ideally we would make some progress on understanding options and see if there is anything low-hanging and low-risk we could do.

0reactions
joeduffycommented, Jun 4, 2018

Is there anything actionable we plan to do here before ship?

Read more comments on GitHub >

github_iconTop Results From Across the Web

why does google appengine deployment take several minutes ...
However, the step 'Updating service [SimpleExpressServer]' takes several minutes. Is there anyway to optimize this step? enter image description ...
Read more >
Research shows Windows updates can take six hours to ...
Devices running Windows 10 and 11 can take up to eight hours to fully download and apply software updates, according to a new...
Read more >
Trouble installing Surface updates? - Microsoft Support
Update stays on "Please wait while we install a system update" screen for more than 20 minutes. Update history shows "pending restart". Update...
Read more >
If your iPhone or iPad won't update - Apple Support
Your device doesn't support the latest software; There isn't enough available storage space on your device; The update takes a long time to ......
Read more >
Incidents - Google Cloud Service Health
Learn more about what's posted on the dashboard in this FAQ. ... hour, 47 minutes. Google Cloud Infrastructure is failing to push cloud...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found