[Bug] CoAP observation drops
See original GitHub issueBug CoAP observation relationship is “lost” ~1min after being established. While trying to Observe coap://host/api/v1/$ACCESS_TOKEN/rpc
Server Confirmed on
- demo.thingsboard.io and
- ThingsBoard PE 3.3.0 running on Ubuntu 20.04.3 (Docker monolith)
Your Device
- Connectivity
- CoAP Reproducible with “coap-client” from libcoap 4.3.0 running on linux (Ubuntu 20.04.3)
To Reproduce Steps to reproduce the behavior:
- Using the sample widget “send rpc” on a dashboard. Attempt to send an RPC command to coap-client, which is subscribed by launching the process with the following syntax:
./coap-client -m get coap://demo.thingsboard.io/api/v1/$ACCESS_TOKEN/rpc -s 720 -B 720
- Click on “send rpc”. Which should result in the following output being printed (stdout) by the coap-client process:
{"id":1,"method":"rpcCommand","params":{}}
- Wait ~1min or longer and click “send rpc”. This command will never reach the coap-client.
Expected behavior CoAP observation should not be dropped without notifying the observer.
Relevant logs
2021-09-21 13:37:38,274 [DefaultTransportService-22-6] INFO o.e.californium.core.CoapResource - successfully established observe relation between 172.19.0.1:36342#BEEFFEED and resource /api/v1 (Exchange[R1132], size 33)
after failing to send RPC to device
2021-09-21 13:45:06,488 [CoapServer(main)#2] INFO o.e.c.c.network.stack.ObserveLayer - notification for token [Token=BEEFFEED] timed out. Canceling all relations with source [/172.19.0.1:36342] 2021-09-21 13:45:06,489 [CoapServer(main)#2] INFO o.e.californium.core.CoapResource - remove observe relation between 172.19.0.1:36342#BEEFFEED and resource /api/v1 (Exchange[R1132, complete], size 32)
2021-09-21 13:45:06,489 [CoapServer(main)#2] ERROR o.e.c.c.n.stack.ReliabilityLayer - Exception for Exchange[R1132, complete] in MessageObserver: null java.lang.NullPointerException: null at org.thingsboard.server.transport.coap.client.DefaultCoapClientContext.cancelRpcSubscription(DefaultCoapClientContext.java:741) at org.thingsboard.server.transport.coap.client.DefaultCoapClientContext.deregisterObserveRelation(DefaultCoapClientContext.java:176) at org.thingsboard.server.transport.coap.CoapTransportResource$CoapResourceObserver.removedObserveRelation(CoapTransportResource.java:504) at org.eclipse.californium.core.CoapResource.removeObserveRelation(CoapResource.java:778) at org.eclipse.californium.core.observe.ObserveRelation.cancel(ObserveRelation.java:151) at org.eclipse.californium.core.observe.ObservingEndpoint.cancelAll(ObservingEndpoint.java:74) at org.eclipse.californium.core.observe.ObserveRelation.cancelAll(ObserveRelation.java:162) at org.eclipse.californium.core.network.stack.ObserveLayer$NotificationController.onTimeout(ObserveLayer.java:233) at org.eclipse.californium.core.coap.Message.setTimedOut(Message.java:954) at org.eclipse.californium.core.network.Exchange.setTimedOut(Exchange.java:707) at org.eclipse.californium.core.network.stack.ReliabilityLayer$RetransmissionTask.retry(ReliabilityLayer.java:524) at org.eclipse.californium.core.network.stack.ReliabilityLayer$RetransmissionTask.access$200(ReliabilityLayer.java:430) at org.eclipse.californium.core.network.stack.ReliabilityLayer$RetransmissionTask$1.run(ReliabilityLayer.java:467) at org.eclipse.californium.elements.util.SerialExecutor$1.run(SerialExecutor.java:289) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834)
Additional context Somewhat noteworthy is the fact that monitoring the packets in/out we could not observe any outgoing packets for the failed RPC to device. Additionally, this interaction is on IPv6, and yet the log shows the observation being mapped to IPv4.
Issue Analytics
- State:
- Created 2 years ago
- Comments:13 (4 by maintainers)
Hi @jairohg , I will provide more comments about this issue tomorrow morning. This is indeed related to NAT but not only to NAT. It is also about the routing tables on many load balancers. It is a long story but we have a solution. Stay tuned for updates.
Hi @WillNilges , my 2 cents about our “coap.thingsboard.cloud” setup: at the moment the LB is installed on AWS Ubuntu VMs with elastic IPs. It forwards the traffic to LwM2M pods using Node Port. The LB is “remembering” the routing table, which consists of: A) Source IP and port of the device. B) Dest IP and port of the Node; The LB is configured to remember the sessions for 1 hour. So, when the Node has an update, we make sure we push it from the correct LB IP and Port, and not from the AWS NAT Gateway.
Before the LB, we were still publishing the update from the node, but it was sent from the wrong IP (not from the LB IP, which received the packet, but from the AWS NAT Gateway). The client was ignoring the update since the IP of the originator of the CoAP/UDP packet was different.