question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Too many open files under RustBoard (EMFILE)

See original GitHub issue

I am getting a lot of warnings about too many open files – is there a way to reduce or cap the number of open file descriptors?

2021-05-11T14:31:46Z WARN rustboard_core::run] Failed to open event file EventFileBuf("[RUN NAME]"): Os { code: 24, kind: Other, message: "Too many open files" }

I don’t have that many runs (~2000), so it shouldn’t really be an issue. Using lsof to count the number of open FDs shows over 12k being used…

>> lsof | awk '{print $1}' | sort | uniq -c | sort -r -n | head
   6210 tokio-run
   6210 Reloader-
   1035 StdinWatc
   1035 server
   1035 Reloader
    184 gmain
    168 gdbus
    134 grpc_glob
     85 bash
     80 snapd

Compared to <500 in “slow” mode.

>> lsof | awk '{print $1}' | sort | uniq -c | sort -r -n | head
    427 tensorboa
    184 gmain
    168 gdbus
     85 bash
     80 snapd
     72 systemd
     71 screen
     52 dconf\x20
     51 dbus-daem
     48 llvmpipe-

In my case, the “slow” mode actually loads files faster since it doesn’t run into this issue.

_Originally posted by @Raphtor in https://github.com/tensorflow/tensorboard/issues/4784#issuecomment-838599948_

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
janachtcommented, Apr 21, 2022

I’ve also encountered the problem and found that raising the “open files” limit by executing e.g.

ulimit -n 50000

solves the problem for me (without requiring superuser permissions).

1reaction
wchargincommented, May 11, 2021

Hi @Raphtor! Thanks for the report and the helpful info. Some questions:

  • Could you run diagnose_tensorboard.py in the same environment from which you usually run TensorBoard and post the full output in a comment (sanitizing as desired if you want to redact anything)?

  • Are you able to share the log directory with us? If not, could you describe the structure of the event files? You say that you only have ~2000 runs, but I wonder if each run tends to have many event files (can happen if your training workers restart a lot). If so, it’s possible that that explains the difference, since the particulars around how we handle multiple event files in the same directory differ somewhat.

    Broadly, there are three potential behaviors. In all cases, we read all event files in lexicographical order. When we hit EOF on an event file, we keep polling it iff…

    • in all-files mode: always keep polling it
    • in last-file mode: keep polling iff it’s the last file
    • in multifile mode: keep polling iff its last event was not too long ago (defaults to 86400 seconds = 1 day)

    TensorBoard with --load_fast=false uses last-file mode by default (and can also be told to use multifile mode), but with --load_fast=true uses all-files mode.

  • Can you also reproduce the issue when running TensorBoard with

    --load_fast=false --reload_multifile=true --reload_multifile_inactive_secs=-1
    

    ? Same train of thought as above; this enables multifile mode with an unbounded age threshold, making it equivalent to all-files mode. If this reproduces the issue, we can probably fix this by making --load_fast=true also implement last-file and/or multifile modes, which would be nice, anyway.

  • What lsof do you have? My lsof (4.93.2, Linux) uses the first column for the command name, but (e.g.) tensorboard and bash are process names whereas Reloader and StdinWatcher are thread names. So my lsof output has lines like:

    COMMAND    PID     USER   FD      TYPE  DEVICE SIZE/OFF     NODE NAME
    server  692802 wchargin   11r      REG   254,1 11096888 15361542 /HOMEDIR/tensorboard_data/mnist/lr_1E-03,conv=2,fc=2/events.out.tfevents.1563406405.HOSTNAME
    

    …and I don’t see how your lsof | awk '{ print $1 }' is giving the output that you’re seeing. Probably just a reporting thing, but I’d like to be able to reproduce your interaction if possible.

Read more comments on GitHub >

github_iconTop Results From Across the Web

node and Error: EMFILE, too many open files - Stack Overflow
I used this command to test the number of files that were opened after doing various events in my app. lsof -i -n...
Read more >
"EMFILE: too many open files" after upgrading to 0.15.6
After upgrading to 0.15.6, Obsidian can't open any vaults, showing the “An error occurred while loading Obsidian. EMFILE: too many open ...
Read more >
async_listen::errors - Rust - Docs.rs
List of errors having a hint: Too many open files / EMFILE; Too many open files in system / ENFILE. Too Many Open...
Read more >
How to Fix the 'Too Many Open Files' Error in Linux?
It means that a process has opened too many files (file descriptors) and cannot open new ones. On Linux, the “max open file...
Read more >
Unable to deploy Angular .NET Core 3,1 Web App to Azure ASE
Your open channel to Microsoft engineering teams ... Error: EMFILE: too many open files, open ... Thanks for posting in Developer Community.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found