question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

File descriptor leak caused by clients prematurely closing connections

See original GitHub issue

Hi! 👋We’ve been using JMX exporter to instrument Cassandra (using the javaagent on version 0.3.1).

We recently had an incident caused by Cassandra running out of file descriptors. We found these had been gradually leaking over time (metric here is node_filefd_allocated from node_exporters on those instances - the FD limit we set for Cassandra is 100k): screen shot 2018-10-17 at 11 21 57

We’d been seeing some issues with Prometheus timing out whilst scraping these nodes, and found that the majority of open FDs were orphaned TCP sockets in CLOSE_WAIT. Thread dumps showed that all 5 JMX exporter threads on these nodes seemed to be stuck writing to the socket:

"pool-1-thread-1" - Thread t@84
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:65)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
        - locked <2e084219> (a java.lang.Object)
        at sun.net.httpserver.Request$WriteStream.write(Request.java:391)
        - locked <368dd754> (a sun.net.httpserver.Request$WriteStream)
        at sun.net.httpserver.ChunkedOutputStream.writeChunk(ChunkedOutputStream.java:125)
        at sun.net.httpserver.ChunkedOutputStream.write(ChunkedOutputStream.java:87)
        at sun.net.httpserver.PlaceholderOutputStream.write(ExchangeImpl.java:444)
        at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
        at java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
        at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145)
        at java.io.ByteArrayOutputStream.writeTo(ByteArrayOutputStream.java:167)
        - locked <fd8ef5> (a java.io.ByteArrayOutputStream)
        at io.prometheus.jmx.shaded.io.prometheus.client.exporter.HTTPServer$HTTPMetricHandler.handle(HTTPServer.java:74)
        at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:79)
        at sun.net.httpserver.AuthFilter.doFilter(AuthFilter.java:83)
        at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:82)
        at sun.net.httpserver.ServerImpl$Exchange$LinkHandler.handle(ServerImpl.java:675)
        at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:79)
        at sun.net.httpserver.ServerImpl$Exchange.run(ServerImpl.java:647)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
   Locked ownable synchronizers:
        - locked <467e300a> (a java.util.concurrent.ThreadPoolExecutor$Worker)

Putting these two bits of information together gives us this theory:

  1. Prometheus scrapes the node - sends a HTTP request to JMX exporter
  2. JMX exporter collects metrics, but takes a long time to do so (this is occasionally expected in our case, our nodes export thousands of JMX metrics)
  3. Prometheus reaches the scrape timeout, and cancels the request with a TCP FIN
  4. JMX exporter finishes collecting metrics, and attempts to write the output to the socket (https://github.com/prometheus/client_java/blob/parent-0.3.0/simpleclient_httpserver/src/main/java/io/prometheus/client/exporter/HTTPServer.java#L78). The other side of the TCP connection has been closed.
  5. This call blocks forever, and we never reach https://github.com/prometheus/client_java/blob/parent-0.3.0/simpleclient_httpserver/src/main/java/io/prometheus/client/exporter/HTTPServer.java#L80 which closes the socket

It looks like simpleclient_httpserver doesn’t have good semantics around handling closed connections.

We don’t have a minimal reproduction of this, but tcpdumps back this up. We’re considering forking the jmx_exporter to use simpleclient_jetty instead, but we wondered if anyone else had come across this?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:2
  • Comments:25 (9 by maintainers)

github_iconTop GitHub Comments

2reactions
jaseemabidcommented, Jan 14, 2019

@brian-brazil We moved the main jmx exporter out of the Cassandra process to an external HTTP server because we couldn’t afford to have fd leaks and restart Cassandra once in a while. Now we run 2 copies of the JMX exporter with the in process version scraping only the minimal jvm metrics and the external exporter scraping much more detailed Cassandra metrics. Since it’s an external process which gets restarted automatically by systemd if it crashes, we haven’t really looked into it since. Its not an optimal solution, but saved us from the problematic db restarts.

1reaction
milesbxfcommented, Oct 17, 2018

Yes of course - I’ve clarified in the description, thanks.

I don’t have netstat output for the affected nodes, though I’ll grab it when the issue reoccurs. We did analyse the leak with lsof, and found thousands of entries like:

java    17678 cassandra  526u     IPv4          148742999        0t0        TCP hostname-a:http-alt->hostname-b:36374 (CLOSE_WAIT)

hostname-a is the hostname of the Cassandra node we ran this on, and hostname-b is one of the Prometheus hosts. All of the connections in CLOSE_WAIT showed a connection to a Prometheus host and we’re exposing the JMX exporter on http-alt port (8080), so these are definitely connections handled by the JMX exporter.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Fixing File Descriptor Leaks - DSpace@MIT
When a client connects, the main thread creates a new thread and goes to the beginning of the loop to wait for connections....
Read more >
Leaking filedescriptors in Plug.Static through a Phoenix app
Hi, I am leaking file descriptors in my phoenix 1.4.5 app. ... These early disconnects are causing a file handle leak and eventually...
Read more >
File descriptor leak after disconnects - Oracle Communities
The issue we're having is that if clients do not gracefully close the connection the server leaks file descriptors.
Read more >
JDK-6215050 (so) SocketChannel created in CLOSE_WAIT ...
JDK-6215050 : (so) SocketChannel created in CLOSE_WAIT and never cleaned up.. File Descriptor leak. Type: Bug; Component: core-libs; Sub-Component: java.nio ...
Read more >
416971 – Jetty Leaks Connections in Time Wait - Bugs - Eclipse
Greg, Is it possible that jetty is shutting down the inbound stream prematurely, getting a sun ssl exception, and then losing track of...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found