cloud.Service update takes a long time (~8 minutes)
See original GitHub issueIn the PPC program cloud.Service, we see the following on updates:
2017-12-15 17:26:13 -0800 service has reached a steady state.
2017-12-15 17:25:51 -0800 service has stopped 1 running tasks: task 7891c7d6-8bb3-43d4-a33b-1066e577a65e.
2017-12-15 17:22:44 -0800 service has begun draining connections on 1 tasks.
2017-12-15 17:22:44 -0800 service deregistered 1 targets in target-group pulumi-te-ni0cce5b1af
2017-12-15 17:18:05 -0800 service registered 1 targets in target-group pulumi-te-ni0cce5b1af
2017-12-15 17:17:34 -0800 service has started 1 tasks: task c1f0173a-5b79-4116-965f-dee9a7ee00d9.
2017-12-15 17:00:07 -0800 service has reached a steady state.
This takes 519 seconds from update start to update steady state.
There is a 279 second delay between registering the new tasks and deregistering the old tasks. Then a 187 second delay between deregistering the old task and stopping it. The latter is due to deregistrationDelay being set to 180 seconds. The former though is a very long wait, during which both services are potentially reachable.
Contrast that with the Pulumi API Service, which is built on ECS but does not use the cloud.Service abstraction - where we see this:
2017-12-15 16:43:42 -0800 service has reached a steady state.
2017-12-15 16:43:17 -0800 service has stopped 2 running tasks: task ce1d3329-fc1e-4e73-8532-be4488d9fc6c task a5b4a734-0903-4881-97a8-e62538f10856.
2017-12-15 16:42:41 -0800 service has begun draining connections on 2 tasks.
2017-12-15 16:42:41 -0800 service deregistered 2 targets in target-group apiTG66f890d7
2017-12-15 16:42:17 -0800 service registered 2 targets in target-group apiTG66f890d7
2017-12-15 16:42:06 -0800 service has started 2 tasks: task 948a0462-184f-40c2-8b76-ce0cd085578d task 40ff51f0-e153-4b70-9c10-fd406a338964.
2017-12-15 12:28:44 -0800 service has reached a steady state.
This takes only 96 seconds from update start to update steady state.
Notably, there is always exactly 24 seconds between registering the new targets and deregistering the old targets - which is great. Then there is just over 30 seconds between deregistering the old task and stopping it. The latter is due to deregistrationDelay being set to 30 in the Pulumi API Service scripts, which explains the difference there. But for the former, there does not appear to be any configuration which controls this, and it’s unclear why there is such a big difference between these two cases.
My guess is that this is somehow related to waiting for health checks to pass on the new service before deregistering the old service. And perhaps the health checks are taking longer on the PPCD process?
Also, the PPCD process is using an NLB, while the API service is using an ALB, which may be a source of difference here.
Issue Analytics
- State:
- Created 6 years ago
- Reactions:1
- Comments:14 (14 by maintainers)

Top Related StackOverflow Question
I would love to spend some time looking at what options we have to improve the performance here. This is one of the most significant pain points of working with containers in Pulumi today, due to the very long delay on any updates to the container by default. I believe this is not actually the expected/intended behaviour of the underlying services we are depending on, and that the combination and parameterization we are using just leads to a particularly long delay. If we could cut this in half, it would have a big impact on interactive development of container-based Pulumi programs.
That said, we have no concrete known action here yet, and it may be risky to make any change in the near term. Ideally we would make some progress on understanding options and see if there is anything low-hanging and low-risk we could do.
Is there anything actionable we plan to do here before ship?