No Profile when using capture_tpu_profile
See original GitHub issueConsider first reaching out to Stack Overflow for support—they have a larger community with better searchability:
https://stackoverflow.com/questions/tagged/tensorboard
Environment information (required)
Please run diagnose_tensorboard.py
(link below) in the same
environment from which you normally run TensorFlow/TensorBoard, and
paste the output here:
--- check: autoidentify
INFO: diagnose_tensorboard.py version 393931f9685bd7e0f3898d7dcdf28819fef54c43
--- check: general
INFO: sys.version_info: sys.version_info(major=3, minor=5, micro=3, releaselevel='final', serial=0)
INFO: os.name: posix
INFO: os.uname(): posix.uname_result(sysname='Linux', nodename='d-len', release='4.9.0-8-amd64', version='#1 SMP Debian 4.9.130-2 (2018-10-27)', machine='x86_64')
INFO: sys.getwindowsversion(): N/A
--- check: package_management
INFO: has conda-meta: False
INFO: $VIRTUAL_ENV: '/home/len/venv'
--- check: installed_packages
INFO: installed: tensorboard==1.13.1
INFO: installed: tensorflow==1.13.1
INFO: installed: tensorflow-estimator==1.13.0
--- check: tensorboard_python_version
INFO: tensorboard.version.VERSION: '1.13.1'
--- check: tensorflow_python_version
INFO: tensorflow.__version__: '1.13.1'
INFO: tensorflow.__git_version__: "b'v1.13.1-0-g6612da8951'"
--- check: tensorboard_binary_path
INFO: which tensorboard: b'/home/len/venv/bin/tensorboard\n'
--- check: readable_fqdn
INFO: socket.getfqdn(): 'd-len.us-central1-b.c.lofty-outcome-860.internal'
--- check: stat_tensorboardinfo
INFO: directory: /tmp/.tensorboard-info
INFO: os.stat(...): os.stat_result(st_mode=16877, st_ino=5636621, st_dev=2049, st_nlink=2, st_uid=1017, st_gid=1018, st_size=4096, st_atime=1568125440, st_mtime=1568135743, st_ctime=1568135743)
INFO: mode: 0o40755
--- check: source_trees_without_genfiles
INFO: tensorboard_roots (1): ['/home/len/venv/lib/python3.5/site-packages']; bad_roots (0): []
Steps to reproduce (required)
First, I start training the model.
Then, I run capture_tpu_profile --tpu=[TPU_NAME] --logdir=[MODEL_DIR] --duration_ms=50000
and I get
TensorFlow version 1.13.1 detected
WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
* https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
* https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.
Welcome to the Cloud TPU Profiler v1.14.0
Starting to profile TPU traces for 50000 ms. Remaining attempt(s): 3
usage: /home/len/venv/lib/python3.5/site-packages/cloud_tpu_profiler/data/capture_tpu_profile
Flags:
--service_addr="" string Address of TPU profiler service e.g. localhost:8466
--workers_list="" string The list of worker TPUs that we are about to profile in the current session.
--logdir="" string Path of TensorBoard log directory e.g. /tmp/tb_log, gs://tb_bucket
--duration_ms=0 int32 Duration of tracing or monitoring in ms. Default is 2000ms for tracing and 1000ms for monitoring.
--num_tracing_attempts=3 int32 Automatically retry N times when no trace event is collected. Default is 3.
--include_dataset_ops=true bool Set to false to profile longer TPU device traces.
--monitoring_level=0 int32 Choose a monitoring level between 1 and 2 to monitor your TPU job continuously. Level 2 is more verbose than level 1 and shows more metrics.
--timestamp=false bool Set to true to display timestamp in monitoring results.
--num_queries=100 int32 This script will run monitoring for num_queries before it stops.
Then, I run tensorboard passing the same MODEL_DIR to --logdir.
Tensorboard is running: loss and accuracy is being captured, but there is nothing on the profile page. I would like to use this tool to check underutilization of TPU, etc. Any ideas what could be going on?
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
Profile your model with Cloud TPU tools
Capture a profile programmatically; Capture profile common problems ... To profile your model you use TensorBoard and the Cloud TPU TensorBoard plug-in.
Read more >Cloud TPU tools not generating profile - Stack Overflow
1 Answer 1 · It doesn't seem to work. With Cloud TPU Profiler v1. · If you start trace capture too early, the...
Read more >TPU Training. Harnessing the power of dedicated DNN…
The TPU documentation includes a detailed guide on capturing a profile on TPU and analyzing the results in TensorBoard. The documentation also ...
Read more >Profiling JAX programs
Tensorboard is a great way to acquire and visualize performance traces and profiles of your program, including activity on GPU and TPU.
Read more >TensorFlow Profiler: Profile model performance | TensorBoard
Use the TensorFlow Profiler to profile the execution of your TensorFlow ... of the TensorFlow Profiler by capturing the performance profile ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Yes, as @stephanwlee said, this is a version problem. From the logs, you have tensorflow 1.13.1 and capture_tpu_profile 1.14.0. What’s your TPU version? Please make sure the tensorflow version, capture_tpu_profile version and the TPU version are the same, 1.13 or 1.14 for all.
After
pip uninstall cloud-tpu-profiler
and installing the version that corresponds to my tpu withpip install cloud-tpu-profiler==1.13
everything is up and running.Thanks for your help @qiuminxu and @stephanwlee !