Behaviour of agent and standard-controller on qdrouter restarts and many addresses
See original GitHub issueI’ve done a test with roughly 10k addresses - half brokered, half standard-anycast, using 1 qdrouter and 1 broker instance.
After all addresses were ready, I killed the qdrouter instance.
There are now 2 possible outcomes of such a test, depending on timing:
1. Standard Controller doesn’t notice the router addresses being gone
In this case, the agent re-creates the qdrouter addresses while the StandardController is doing its 30s waiting interval in between checks (possibly with one check interrupted like this: Error requesting router status from qdrouterd-65c79fc64c-qq2x4. Ignoring
).
In this case, the situation resolves itself quite seamlessly - after 25s all is back to normal. Here’s log output from the agent:
2018-04-05T06:44:45.012Z agent info probe request: /probe
2018-04-05T06:44:50.204Z agent info probe request: /probe
[qdrouter pod killed at 06:45:00]
2018-04-05T06:45:00.862Z agent info [Router.qdrouterd-65c79fc64c-qq2x4] aborting pending requests: disconnected
2018-04-05T06:45:00.862Z agent info router Router.qdrouterd-65c79fc64c-qq2x4 disconnected
2018-04-05T06:45:01.877Z agent info Router connected from Router.qdrouterd-65c79fc64c-npjlz
2018-04-05T06:45:01.884Z agent info [Router.qdrouterd-65c79fc64c-npjlz] router ready
2018-04-05T06:45:01.925Z agent info Router.qdrouterd-65c79fc64c-npjlz not ready for connectivity check: false false
2018-04-05T06:45:01.925Z agent info Router.qdrouterd-65c79fc64c-npjlz not ready for connectivity check: false true
2018-04-05T06:45:01.951Z agent info [Router.qdrouterd-65c79fc64c-npjlz] defining address event/BCX18
2018-04-05T06:45:01.952Z agent info [Router.qdrouterd-65c79fc64c-npjlz] defining in autolink for event/BCX18
2018-04-05T06:45:01.952Z agent info [Router.qdrouterd-65c79fc64c-npjlz] defining out autolink for event/BCX18
[all addresses get re-defined in the qdrouter]
2018-04-05T06:45:03.316Z agent info [Router.qdrouterd-65c79fc64c-npjlz] defining address telemetry/test3_9
2018-04-05T06:45:03.316Z agent info [Router.qdrouterd-65c79fc64c-npjlz] updating addresses...
2018-04-05T06:45:15.012Z agent info probe request: /probe
2018-04-05T06:45:20.204Z agent info probe request: /probe
2018-04-05T06:45:45.012Z agent info probe request: /probe
Adding addresses in new Qdrouter “qdrouterd-65c79fc64c-npjlz” looks done at 6:45:23 - last entry:
2018-04-05 06:45:23.760802 +0000 AGENT (warning) The 'dir' attribute of autoLink has been deprecated. Use 'direction' instead
2018-04-05 06:45:23.760924 +0000 ROUTER_CORE (info) Auto Link Activated 'autoLinkOutevent/test3_9' on container broker-0
Log excerpt from the StandardController: standardController_noConfigMapUpdates.txt
.
2. Standard Controller updates all 10k ConfigMaps with unready state and then again with ready state
Here the StandardController is busy for 10min, causing also the load of the K8s APIServer to go up to 3 with 60k requests (2xGET, 1xPUT per address for setting isReady=false, then the same for setting isReady=true). StandardController log excerpt:
2018-04-04 14:54:35.033 [Thread-4] INFO AddressController:155 - onUpdate: done; total: 5591 ms, resolvedPlan: 2 ms, calculatedUsage: 4 ms, checkedQuota: 0 ms, listClusters: 544 ms, provisionResources: 0 ms, checkStatuses: 5032 ms, deprovisionUnused: 5 ms, replaceAddresses: 1 ms, gcTerminating: 0 ms
[at 14:55 qdrouterd pod was deleted]
2018-04-04 14:55:54.815 [Thread-4] INFO AddressController:155 - onUpdate: done; total: 49780 ms, resolvedPlan: 2 ms, calculatedUsage: 7 ms, checkedQuota: 0 ms, listClusters: 681 ms, provisionResources: 0 ms, checkStatuses: 601 ms, deprovisionUnused: 4 ms, replaceAddresses: 48480 ms, gcTerminating: 1 ms
..
2018-04-04 15:03:20.387 [Thread-4] INFO AddressController:155 - onUpdate: done; total: 111772 ms, resolvedPlan: 6 ms, calculatedUsage: 14 ms, checkedQuota: 0 ms, listClusters: 582 ms, provisionResources: 0 ms, checkStatuses: 14059 ms, deprovisionUnused: 4 ms, replaceAddresses: 97103 ms, gcTerminating: 1 ms
..
2018-04-04 15:04:41.562 [Thread-4] INFO AddressController:155 - onUpdate: done; total: 81174 ms, resolvedPlan: 4 ms, calculatedUsage: 5 ms, checkedQuota: 0 ms, listClusters: 679 ms, provisionResources: 0 ms, checkStatuses: 5359 ms, deprovisionUnused: 4 ms, replaceAddresses: 75118 ms, gcTerminating: 1 ms
2018-04-04 15:04:47.044 [Thread-4] INFO AddressController:155 - onUpdate: done; total: 5481 ms, resolvedPlan: 2 ms, calculatedUsage: 3 ms, checkedQuota: 0 ms, listClusters: 555 ms, provisionResources: 0 ms, checkStatuses: 4914 ms, deprovisionUnused: 4 ms, replaceAddresses: 0 ms, gcTerminating: 0 ms
The agent pod also has to work heavily to process these updates. It gets restarted a few times because the liveness probes (periodSeconds=30, timeoutSeconds=5) don’t get handled while address updates are handled:
2018-04-04T14:54:32.058Z agent info probe request: /probe
[at 14:55 qdrouterd pod was deleted]
2018-04-04T14:55:01.132Z agent info [Router.qdrouterd-65c79fc64c-hwvp7] aborting pending requests: disconnected
2018-04-04T14:55:01.132Z agent info router Router.qdrouterd-65c79fc64c-hwvp7 disconnected
2018-04-04T14:55:02.001Z agent info probe request: /probe
2018-04-04T14:55:02.034Z agent info probe request: /probe
[broker-0] failed to retrieve addresses: ["AMQ119067: Cannot find resource with name queue.event/test1_1097"]
2018-04-04T14:55:02.070Z agent error [broker-0] failed to retrieve addresses: ["AMQ119067: Cannot find resource with name queue.event/test1_1097"]
[broker-0] failed to retrieve addresses: ["AMQ119067: Cannot find resource with name address.event/test1_22"]
2018-04-04T14:55:05.775Z agent error [broker-0] failed to retrieve addresses: ["AMQ119067: Cannot find resource with name address.event/test1_22"]
2018-04-04T14:55:06.474Z agent info addresses_defined: modified ["event/SMOKE_TEST_TENANT"]
2018-04-04T14:55:06.846Z agent info addresses_defined: modified event/test1_500, event/test1_501, event/test1_502, event/test1_503, event/test1_504 and 29 more
2018-04-04T14:55:11.021Z agent info addresses_defined: modified event/test1_522, event/test1_523, event/test1_524, event/test1_525, event/test1_528 and 336 more
2018-04-04T14:55:49.694Z agent info [broker-0] connection error: {"type":"error#1d","condition":"amqp:resource-limit-exceeded","description":"local-idle-timeout expired"}
2018-04-04T14:55:49.711Z agent info [broker-0] connection closed
2018-04-04T14:55:49.711Z agent info broker disconnected: broker-0
[broker-0] failed to retrieve addresses: undefined
2018-04-04T14:55:49.737Z agent error [broker-0] failed to retrieve addresses: undefined
2018-04-04T14:55:49.848Z agent info addresses_defined: modified event/PERF_TEST_TENANT, event/test1_700, event/test1_701, event/test1_702, event/test1_703 and 337 more
2018-04-04T14:56:26.822Z agent info probe request: /probe
2018-04-04T14:56:26.825Z agent info probe request: /probe
2018-04-04T14:56:26.933Z agent info addresses_defined: modified event/test1_2000, event/test1_2001, event/test1_2002, event/test1_2003, event/test1_2004 and 336 more
2018-04-04T14:57:04.140Z agent info probe request: /probe
2018-04-04T14:57:04.143Z agent info probe request: /probe
2018-04-04T14:57:04.248Z agent info addresses_defined: modified event/test1_2050, event/test1_2051, event/test1_2052, event/test1_2053, event/test1_2054 and 337 more
[pod gets killed here because of failing livenessProbes]
Next agent log:
2018-04-04T14:57:07.799Z agent info GET /oapi/v1/namespaces/hono/routes/messaging => 404
2018-04-04T14:57:07.804Z agent info could not retrieve messaging route hostname
2018-04-04T14:57:07.823Z agent info Router agent listening on 55671
2018-04-04T14:57:08.701Z agent info GET /api/v1/namespaces/hono/configmaps?labelSelector=type%3Daddress-config => 200
2018-04-04T14:57:08.988Z agent info addresses_defined: undefined
2018-04-04T14:57:09.121Z agent info addresses_ready: undefined
2018-04-04T14:57:09.121Z agent info triggering address configuration check
2018-04-04T14:57:09.134Z agent info address configuration check triggered
2018-04-04T14:57:09.147Z agent info GET /api/v1/watch/namespaces/hono/configmaps?labelSelector=type%3Daddress-config => 200
2018-04-04T14:57:12.311Z agent info addresses_defined: modified telemetry/test2_1642, telemetry/test2_1643, telemetry/test2_1644, telemetry/test2_1645, telemetry/test2_1647 and 32 more
2018-04-04T14:57:25.321Z agent info addresses_defined: modified telemetry/IOT_ACADEMY, telemetry/test1_900, telemetry/test1_901, telemetry/test1_902, telemetry/test1_903 and 83 more
2018-04-04T14:58:00.356Z agent info addresses_defined: modified telemetry/test1_765, telemetry/test1_766, telemetry/test1_767, telemetry/test1_768, telemetry/test1_769 and 198 more
[pod again killed]
This all leads me to question, whether the ready state of addresses really should be persisted in the ConfigMap. Also whether there should be some kind of synchronizing the actions of agent and StandardController - or consolidating address provisioning and broker/qdrouter scaling into one component.
Issue Analytics
- State:
- Created 5 years ago
- Comments:16 (16 by maintainers)
Top GitHub Comments
I think the evidence is still inconclusive on that point. The memory use is high at present because the agent is doing a lot of unnecessary stuff (39k request/responses to get stats for 3k queues!). Once this is fixed I hope the picture will change.
That said I think the reason for the current choices was always really just developer preference rather any proper comparative analysis.
I think that would be the worst of both worlds!
We could also port the agent/console-server code to java, but I’m not a big fan of java TBH 😃
Actually you could even just avoid creating the listeners until the initial configuration is complete. That would automatically mean it is not available until then. (i.e. create the listeners dynamically rather than statically as we do now)