Fails to show new data when event files are replaced (e.g. w/ rsync)
See original GitHub issueProblem summary
We have a problem that tensorboard (1.2.1) does not show new run data when we switched to syncing the data from Google Cloud Storage to local directory.
It looks like when the tensorboard reads an event file from local directory - it will not notice that the event file was deleted and recreated (which is quite valid case when you are using gsutil rsync to sync the data from Google Cloud Storage). I have a short-term workaround (to be tested) but I think there is a need to fiix it permanently in tensorboard - it’s quite a valid case to handle due to another bug - #158)
Some context and problem manifestation
Initially we stored data from our ML runs in Google Cloud Storage and pointed our Tensorboard to reead the the data directly from there (logdir=gs://BUCKET/FOLDER) . We run the Tensorboard in Kubernetes cluster and made it available to all our team members via oauth proxy. This is great as we can always see fresh data and any member of the team can access it there. We wanted the data to refresh quite frequently - every few minutes. This worked very well but we found that there is a cost downside to it.
Due to #158 we had a lot of GCS accesses generated by TB just scanning GCS (several millions of accesses/day). This was not only slow but also incurred relatively high GCS cost (20 USD/4 days!).
Then we switched to another approach - we sync the data to a local folder from gcs using gsutil -m rsync -d gs://BUCKET/FOLDER LOCAL_FOLDER
and then tensorboard points to that locally downloaded folder instead of pointing to gs:// .
Additional information - we have the whole tree structure within that logdir directory so that we can see and compare all the different runs we’ve already produced.
Unfortunately after switching to rsync we seem to have big problem. The new runs (in new subfolders) are showing up amongst the others after the runs are executed, but they never show any data for those runs until tensorboard gets restarted. It looks like tensorboard picks up some initially synced partial results from the run - but not enough to show the data - and then it somehow blocks from reading any new data for that folder.
When I start an exact copy of Tensorboard at the same machine, it shows the data for those new runs properly - you can see the screenshot below showing the two tensorboards side-by-side. Tensorboard on the left is freshly started, tensorboard on the right has been run before the run has been started.
Result of investigation
I read in detail how gsutl rsync synchronizes the data and I guessed what might be wrong. I confirmed that by looking at the lsof
output for the long-running tensorboard instance - and it becomes quite obvious (look at the “deleted” entries):
tensorboa 191 root 30r REG 8,1 1085346 8881152 /tensorboard-data/dexnet2/convnet/20171108_kcjniy6q/logs/events.out.tfevents.1502483404.sylvester (deleted)
tensorboa 191 root 31r REG 8,1 1148462 8880988 /tensorboard-data/dexnet2/convnet/20171108_b2ere44n/logs/events.out.tfevents.1502483298.arnold (deleted)
tensorboa 191 root 32r REG 8,1 715534 8881201 /tensorboard-data/dexnet2/convnet/20171108_silfxox7/logs/events.out.tfevents.1502402857.sylvester
tensorboa 191 root 33r REG 8,1 930378 8880956 /tensorboard-data/dexnet2/convnet/20171108_61rgid3z/logs/events.out.tfevents.1502403170.sylvester
tensorboa 191 root 34r REG 8,1 934806 8881119 /tensorboard-data/dexnet2/convnet/20171108_k9en6v98/logs/events.out.tfevents.1502403287.sylvester
tensorboa 191 root 35r REG 8,1 1744410 8881036 /tensorboard-data/dexnet2/convnet/20171108_g2unt5hi/logs/events.out.tfevents.1502444237.sylvester (deleted)
tensorboa 191 root 36r REG 8,1 5844954 8881430 /tensorboard-data/dexnet2/resnet/logs_9bli9id2/logs/events.out.tfevents.1501736695.arnold
The gsutil rsync works in the way that it always deletes and recreates files if they change - it never appends changes to the files. They state it in their manual, and it comes from the fact that the GCS files are in fact objects not files, and are immutable, so you don’t even have a possibilty to transfer it partially - it’s always all-or-nothing.
On the other hand it’s quite reasonable what tensorboard does: when it has logdir set to the local folder, it opens the events* file and expects that tensorflow-based machine learning will continue appending data to it rather than delete and recreate the file when new data appears. It’s completely resonable assumption - and I guess it’s quite different when it accesses GCS directly (for the same reasons gsutil rsync only synchronizes full files).
You could see in the lsof
output that we have problem with 20171108_b2ere44n and 20171108_kcjniy6q - in both cases the file is already deleted, but tensorboard still has it open (which is quite normal behaviour in linux/ext4 - the file/inode will only be physically removed after last process keeps it open).
Potential solutions
I see several potential solutions to the problem, and it would be great if someone from tensorboard team can confirm whether the long term solution I proposed is good.
-
Short term workaround: I will setup double syncing
gs -> gsutil rsync -> linux rsync --inplace -> logdir
. I tested that I can make the standard linux rsync works in the way that it will not delete/recreate the files but it will replace the data in place (–inplace switch) - I also considered and tested behaviour of --append and --append-verify, but I think it’s safer bet to do --inplace (append-verify will silently refuse syncin for file that does not match the content prefix in the old file). The process will sync the data first to some folder withgsutil rsync
and then sync the data to final destination with the linux’srsync --inplace
. The obvious downsides of this workaround is that it makes our process more complex and prone to error, and we have to keep two copies of our whole log directory in our kubernetes pod. Thats’s quite bad long-term when we will have a lot more data to look at. -
Proper fix: I think tensorboard should be smarter in reading local data and check if the file has been deleted since it was last read and close/reopen/reread the file when it was. You could use fstat for that (see the second answer here https://stackoverflow.com/questions/12690281/check-if-an-open-file-has-been-deleted-after-open-in-python) Can this be done in next version please? Or maybe there is already a possible solution for current tensorboard (1.2.1) ?
Issue Analytics
- State:
- Created 6 years ago
- Reactions:5
- Comments:16 (2 by maintainers)
Top GitHub Comments
I’m re-opening this issue since we’ve gotten a few other reports and they don’t always involve GCS - this applies to usage of regular rsync and other syncing programs as well if they update files via replacement rather than by appending. We may or may not fix this but it’s fair to track as an open issue.
I’m using
1.11.0
and I have this exact issue when usingrsync
to download data from a GPU cluster. I currently restarttensorboard
every time I usersync
to get the updates.Seems one could use suggestions from here to watch for updates in folders from
--logdir
.I’d be happy to contribute if there’s a fix that the devs agree upon.