Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Behaviour of agent and standard-controller on qdrouter restarts and many addresses

See original GitHub issue

I’ve done a test with roughly 10k addresses - half brokered, half standard-anycast, using 1 qdrouter and 1 broker instance.

After all addresses were ready, I killed the qdrouter instance.

There are now 2 possible outcomes of such a test, depending on timing:

1. Standard Controller doesn’t notice the router addresses being gone In this case, the agent re-creates the qdrouter addresses while the StandardController is doing its 30s waiting interval in between checks (possibly with one check interrupted like this: Error requesting router status from qdrouterd-65c79fc64c-qq2x4. Ignoring).

In this case, the situation resolves itself quite seamlessly - after 25s all is back to normal. Here’s log output from the agent:

2018-04-05T06:44:45.012Z agent info probe request: /probe
2018-04-05T06:44:50.204Z agent info probe request: /probe
 [qdrouter pod killed at 06:45:00]
2018-04-05T06:45:00.862Z agent info [Router.qdrouterd-65c79fc64c-qq2x4] aborting pending requests: disconnected
2018-04-05T06:45:00.862Z agent info router Router.qdrouterd-65c79fc64c-qq2x4 disconnected
2018-04-05T06:45:01.877Z agent info Router connected from Router.qdrouterd-65c79fc64c-npjlz
2018-04-05T06:45:01.884Z agent info [Router.qdrouterd-65c79fc64c-npjlz] router ready
2018-04-05T06:45:01.925Z agent info Router.qdrouterd-65c79fc64c-npjlz not ready for connectivity check: false false
2018-04-05T06:45:01.925Z agent info Router.qdrouterd-65c79fc64c-npjlz not ready for connectivity check: false true
2018-04-05T06:45:01.951Z agent info [Router.qdrouterd-65c79fc64c-npjlz] defining address event/BCX18
2018-04-05T06:45:01.952Z agent info [Router.qdrouterd-65c79fc64c-npjlz] defining in autolink for event/BCX18
2018-04-05T06:45:01.952Z agent info [Router.qdrouterd-65c79fc64c-npjlz] defining out autolink for event/BCX18
 [all addresses get re-defined in the qdrouter]
2018-04-05T06:45:03.316Z agent info [Router.qdrouterd-65c79fc64c-npjlz] defining address telemetry/test3_9
2018-04-05T06:45:03.316Z agent info [Router.qdrouterd-65c79fc64c-npjlz] updating addresses...
2018-04-05T06:45:15.012Z agent info probe request: /probe
2018-04-05T06:45:20.204Z agent info probe request: /probe
2018-04-05T06:45:45.012Z agent info probe request: /probe

Adding addresses in new Qdrouter “qdrouterd-65c79fc64c-npjlz” looks done at 6:45:23 - last entry:

2018-04-05 06:45:23.760802 +0000 AGENT (warning) The 'dir' attribute of autoLink has been deprecated. Use 'direction' instead
2018-04-05 06:45:23.760924 +0000 ROUTER_CORE (info) Auto Link Activated 'autoLinkOutevent/test3_9' on container broker-0

Log excerpt from the StandardController: standardController_noConfigMapUpdates.txt

2. Standard Controller updates all 10k ConfigMaps with unready state and then again with ready state

Here the StandardController is busy for 10min, causing also the load of the K8s APIServer to go up to 3 with 60k requests (2xGET, 1xPUT per address for setting isReady=false, then the same for setting isReady=true). StandardController log excerpt:

2018-04-04 14:54:35.033 [Thread-4] INFO  AddressController:155 - onUpdate: done; total: 5591 ms, resolvedPlan: 2 ms, calculatedUsage: 4 ms, checkedQuota: 0 ms, listClusters: 544 ms, provisionResources: 0 ms, checkStatuses: 5032 ms, deprovisionUnused: 5 ms, replaceAddresses: 1 ms, gcTerminating: 0 ms
 [at 14:55 qdrouterd pod was deleted]
2018-04-04 14:55:54.815 [Thread-4] INFO  AddressController:155 - onUpdate: done; total: 49780 ms, resolvedPlan: 2 ms, calculatedUsage: 7 ms, checkedQuota: 0 ms, listClusters: 681 ms, provisionResources: 0 ms, checkStatuses: 601 ms, deprovisionUnused: 4 ms, replaceAddresses: 48480 ms, gcTerminating: 1 ms
..
2018-04-04 15:03:20.387 [Thread-4] INFO  AddressController:155 - onUpdate: done; total: 111772 ms, resolvedPlan: 6 ms, calculatedUsage: 14 ms, checkedQuota: 0 ms, listClusters: 582 ms, provisionResources: 0 ms, checkStatuses: 14059 ms, deprovisionUnused: 4 ms, replaceAddresses: 97103 ms, gcTerminating: 1 ms
..
2018-04-04 15:04:41.562 [Thread-4] INFO  AddressController:155 - onUpdate: done; total: 81174 ms, resolvedPlan: 4 ms, calculatedUsage: 5 ms, checkedQuota: 0 ms, listClusters: 679 ms, provisionResources: 0 ms, checkStatuses: 5359 ms, deprovisionUnused: 4 ms, replaceAddresses: 75118 ms, gcTerminating: 1 ms
2018-04-04 15:04:47.044 [Thread-4] INFO  AddressController:155 - onUpdate: done; total: 5481 ms, resolvedPlan: 2 ms, calculatedUsage: 3 ms, checkedQuota: 0 ms, listClusters: 555 ms, provisionResources: 0 ms, checkStatuses: 4914 ms, deprovisionUnused: 4 ms, replaceAddresses: 0 ms, gcTerminating: 0 ms

The agent pod also has to work heavily to process these updates. It gets restarted a few times because the liveness probes (periodSeconds=30, timeoutSeconds=5) don’t get handled while address updates are handled:

2018-04-04T14:54:32.058Z agent info probe request: /probe
 [at 14:55 qdrouterd pod was deleted]
2018-04-04T14:55:01.132Z agent info [Router.qdrouterd-65c79fc64c-hwvp7] aborting pending requests: disconnected
2018-04-04T14:55:01.132Z agent info router Router.qdrouterd-65c79fc64c-hwvp7 disconnected
2018-04-04T14:55:02.001Z agent info probe request: /probe
2018-04-04T14:55:02.034Z agent info probe request: /probe
[broker-0] failed to retrieve addresses: ["AMQ119067: Cannot find resource with name queue.event/test1_1097"]
2018-04-04T14:55:02.070Z agent error [broker-0] failed to retrieve addresses: ["AMQ119067: Cannot find resource with name queue.event/test1_1097"]
[broker-0] failed to retrieve addresses: ["AMQ119067: Cannot find resource with name address.event/test1_22"]
2018-04-04T14:55:05.775Z agent error [broker-0] failed to retrieve addresses: ["AMQ119067: Cannot find resource with name address.event/test1_22"]
2018-04-04T14:55:06.474Z agent info addresses_defined: modified ["event/SMOKE_TEST_TENANT"]
2018-04-04T14:55:06.846Z agent info addresses_defined: modified event/test1_500, event/test1_501, event/test1_502, event/test1_503, event/test1_504 and 29 more
2018-04-04T14:55:11.021Z agent info addresses_defined: modified event/test1_522, event/test1_523, event/test1_524, event/test1_525, event/test1_528 and 336 more
2018-04-04T14:55:49.694Z agent info [broker-0] connection error: {"type":"error#1d","condition":"amqp:resource-limit-exceeded","description":"local-idle-timeout expired"}
2018-04-04T14:55:49.711Z agent info [broker-0] connection closed
2018-04-04T14:55:49.711Z agent info broker disconnected: broker-0
[broker-0] failed to retrieve addresses: undefined
2018-04-04T14:55:49.737Z agent error [broker-0] failed to retrieve addresses: undefined
2018-04-04T14:55:49.848Z agent info addresses_defined: modified event/PERF_TEST_TENANT, event/test1_700, event/test1_701, event/test1_702, event/test1_703 and 337 more
2018-04-04T14:56:26.822Z agent info probe request: /probe
2018-04-04T14:56:26.825Z agent info probe request: /probe
2018-04-04T14:56:26.933Z agent info addresses_defined: modified event/test1_2000, event/test1_2001, event/test1_2002, event/test1_2003, event/test1_2004 and 336 more
2018-04-04T14:57:04.140Z agent info probe request: /probe
2018-04-04T14:57:04.143Z agent info probe request: /probe
2018-04-04T14:57:04.248Z agent info addresses_defined: modified event/test1_2050, event/test1_2051, event/test1_2052, event/test1_2053, event/test1_2054 and 337 more
[pod gets killed here because of failing livenessProbes]

Next agent log:

2018-04-04T14:57:07.799Z agent info GET /oapi/v1/namespaces/hono/routes/messaging => 404 
2018-04-04T14:57:07.804Z agent info could not retrieve messaging route hostname
2018-04-04T14:57:07.823Z agent info Router agent listening on 55671
2018-04-04T14:57:08.701Z agent info GET /api/v1/namespaces/hono/configmaps?labelSelector=type%3Daddress-config => 200 
2018-04-04T14:57:08.988Z agent info addresses_defined: undefined
2018-04-04T14:57:09.121Z agent info addresses_ready: undefined
2018-04-04T14:57:09.121Z agent info triggering address configuration check
2018-04-04T14:57:09.134Z agent info address configuration check triggered
2018-04-04T14:57:09.147Z agent info GET /api/v1/watch/namespaces/hono/configmaps?labelSelector=type%3Daddress-config => 200 
2018-04-04T14:57:12.311Z agent info addresses_defined: modified telemetry/test2_1642, telemetry/test2_1643, telemetry/test2_1644, telemetry/test2_1645, telemetry/test2_1647 and 32 more
2018-04-04T14:57:25.321Z agent info addresses_defined: modified telemetry/IOT_ACADEMY, telemetry/test1_900, telemetry/test1_901, telemetry/test1_902, telemetry/test1_903 and 83 more
2018-04-04T14:58:00.356Z agent info addresses_defined: modified telemetry/test1_765, telemetry/test1_766, telemetry/test1_767, telemetry/test1_768, telemetry/test1_769 and 198 more
[pod again killed]

This all leads me to question, whether the ready state of addresses really should be persisted in the ConfigMap. Also whether there should be some kind of synchronizing the actions of agent and StandardController - or consolidating address provisioning and broker/qdrouter scaling into one component.

Issue Analytics

State:
Created 5 years ago
Comments:16 (16 by maintainers)

Top GitHub Comments

1reaction

grscommented, Apr 5, 2018

Another point was that the node runtime seemed to require less resources at the time. From your experiments, I think that the latter argument becomes somewhat void since I think the node process requires just as much memory as the java parts when you start doing more stuff.

I think the evidence is still inconclusive on that point. The memory use is high at present because the agent is doing a lot of unnecessary stuff (39k request/responses to get stats for 3k queues!). Once this is fixed I hope the picture will change.

That said I think the reason for the current choices was always really just developer preference rather any proper comparative analysis.

We could in theory have a look at running both agent and standard-controller on the same runtime using vertx

I think that would be the worst of both worlds!

We could also port the standard-controller code to javascript, but I’m not a big fan of javascript TBH 😃

We could also port the agent/console-server code to java, but I’m not a big fan of java TBH 😃

1reaction

grscommented, Apr 5, 2018

Actually you could even just avoid creating the listeners until the initial configuration is complete. That would automatically mean it is not available until then. (i.e. create the listeners dynamically rather than statically as we do now)

Top Results From Across the Web

How to Reboot Router: Power Cycling Your Router & Modem

How to reboot your router: Know the simple solution that solves most connection problems. Learn how to power cycle your router and modem...

How often should you reboot your router? | CenturyLink

Learn why and how you should reboot your router regularly from CenturyLink. Restarting routers regularly helps keep your WiFi running ...

How Often Should You Reboot Your Router?

A good rule of thumb is to reboot your router or wireless gateway once a month to clear out its memory and refresh...

This Is How Often You Should Reboot Router Settings

“From a performance perspective, restarting your router every so often (once every one or two months) can help maintain the reliability of your ......

Video: Covert Splinter Cell Blacklist Wii U GamePad Features ...

The advantages of using the gamepad instead of a standard controller are pretty huge. I'm really looking forward to this game's release.