question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Last packets discarded in edge-triggered epoll(again?)

See original GitHub issue

Expected behavior

Hi, After a little break I am having trouble again with last packets being discarded and it seriously looks like a netty bug(again?) that may be a regression after updating netty after a year of 4.1 updates, but I am unable to tell for sure.

I am using edge-triggered epoll and I am running a custom proxy application for a game (still the same one that I’ve been targeting with my netty bugfixes a while ago)

So the schema is client <-> proxy <-> server

The problem is that when the client is kicked this way:

channel.writeAndFlush(kickPacket).addListener(Listeners.CLOSE).;

There’s a 10% chance the client will not receive the packet and only connection reset. The only working workaround I was able to create is to call writeAndFlush and call channel.close() 50ms later…

I seriously have no idea what could be the problem here, but I remember fixing bugs related to this problem a year or two ago.

Another side of the problem is that recently I started seeing this problem on the “server” side so when the server sends a kick packet to the proxy with disconnection reason, the packet is not received before “channelInactive” is called.

The client<->proxy connection is autoread: true, while proxy<->server is on autoread: false to avoid possible backpressure, but both of them are creating a similar problem. The read() algorithm for the proxy<->server connection is to call channel.read() after last batch of proxied packets have been written via a ChannelPromise (I am using a dumb promise with empty packet at the end of the flushed batch, so I get a callback after everything has been written)

But even with autoread: false, the logic should be that everything should be read from socket before handling channelActive, unless I am the one calling channel.disconnect(), right?

Also, the proxy<->client connections are using a watermark like this:

                .childOption(ChannelOption.WRITE_BUFFER_HIGH_WATER_MARK, 1024* 1024*5)
                .childOption(ChannelOption.WRITE_BUFFER_LOW_WATER_MARK, 512*1024*2)

Any help on where should I start looking for problems(again)?

Actual behavior

Steps to reproduce

I am not sure I am able to produce a good reproducer yet before investigating more.

Minimal yet complete reproducer code (or URL to code)

Netty version

4.1.9-FINAL-Snapshot as of 4 weeks ago(9ee4cc0ada3d8e46b139671dabfbfd35c8be3308) + write squasher (source below)

JVM version (e.g. java -version)

openjdk version “1.8.0_121” OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13) OpenJDK 64-Bit Server VM (build 25.121-b13, mixed mode)

OS version (e.g. uname -a)

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:11 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
normanmaurercommented, Mar 21, 2017

Please open another bug with only the remaining problem explained

Am 21.03.2017 um 06:34 schrieb ninja notifications@github.com:

Yes, it probably fixes the #6303 which is nice. The other half of the problem still remains though 😕

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

1reaction
johnoucommented, Mar 17, 2017
Read more comments on GitHub >

github_iconTop Results From Across the Web

Epoll TCP edge-triggered necessity of last read(2) call
For packet/token-oriented files (e.g., datagram socket, terminal in canonical mode), the only way to detect the end of the read/write I/O space is...
Read more >
RFC-0009: Edge triggered async_wait - Fuchsia
Use of ZX_WAIT_ASYNC_EDGE in epoll with EPOLLET edge triggering ... I/O) to become inactive and subsequently active again, queuing a packet ...
Read more >
epoll can fail to report EPOLLOUT when unix datagram socket ...
Re: Bug report: epoll can fail to report EPOLLOUT when unix datagram socket peer is closed — Linux Network Development.
Read more >
801987 – epoll appears to be unfair in rawhide - Red Hat Bugzilla
i can´t say for sure where the problem is located but it appears that when there is a high IPC load, network packets...
Read more >
The method to epoll's madness. My previous post covered the…
Edge triggered epoll ... By default, epoll provides level-triggered notifications. Every call to epoll_wait only returns the subset of file ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found