question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

No Profile when using capture_tpu_profile

See original GitHub issue

Consider first reaching out to Stack Overflow for support—they have a larger community with better searchability:

https://stackoverflow.com/questions/tagged/tensorboard

Environment information (required)

Please run diagnose_tensorboard.py (link below) in the same environment from which you normally run TensorFlow/TensorBoard, and paste the output here:

--- check: autoidentify
INFO: diagnose_tensorboard.py version 393931f9685bd7e0f3898d7dcdf28819fef54c43

--- check: general
INFO: sys.version_info: sys.version_info(major=3, minor=5, micro=3, releaselevel='final', serial=0)
INFO: os.name: posix
INFO: os.uname(): posix.uname_result(sysname='Linux', nodename='d-len', release='4.9.0-8-amd64', version='#1 SMP Debian 4.9.130-2 (2018-10-27)', machine='x86_64')
INFO: sys.getwindowsversion(): N/A

--- check: package_management
INFO: has conda-meta: False
INFO: $VIRTUAL_ENV: '/home/len/venv'

--- check: installed_packages
INFO: installed: tensorboard==1.13.1
INFO: installed: tensorflow==1.13.1
INFO: installed: tensorflow-estimator==1.13.0

--- check: tensorboard_python_version
INFO: tensorboard.version.VERSION: '1.13.1'

--- check: tensorflow_python_version
INFO: tensorflow.__version__: '1.13.1'
INFO: tensorflow.__git_version__: "b'v1.13.1-0-g6612da8951'"

--- check: tensorboard_binary_path
INFO: which tensorboard: b'/home/len/venv/bin/tensorboard\n'

--- check: readable_fqdn
INFO: socket.getfqdn(): 'd-len.us-central1-b.c.lofty-outcome-860.internal'

--- check: stat_tensorboardinfo
INFO: directory: /tmp/.tensorboard-info
INFO: os.stat(...): os.stat_result(st_mode=16877, st_ino=5636621, st_dev=2049, st_nlink=2, st_uid=1017, st_gid=1018, st_size=4096, st_atime=1568125440, st_mtime=1568135743, st_ctime=1568135743)
INFO: mode: 0o40755

--- check: source_trees_without_genfiles
INFO: tensorboard_roots (1): ['/home/len/venv/lib/python3.5/site-packages']; bad_roots (0): []

Steps to reproduce (required)

First, I start training the model.

Then, I run capture_tpu_profile --tpu=[TPU_NAME] --logdir=[MODEL_DIR] --duration_ms=50000 and I get

TensorFlow version 1.13.1 detected

WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

Welcome to the Cloud TPU Profiler v1.14.0
Starting to profile TPU traces for 50000 ms. Remaining attempt(s): 3

usage: /home/len/venv/lib/python3.5/site-packages/cloud_tpu_profiler/data/capture_tpu_profile
Flags:
        --service_addr=""                       string  Address of TPU profiler service e.g. localhost:8466
        --workers_list=""                       string  The list of worker TPUs that we are about to profile in the current session.
        --logdir=""                             string  Path of TensorBoard log directory e.g. /tmp/tb_log, gs://tb_bucket
        --duration_ms=0                         int32   Duration of tracing or monitoring in ms. Default is 2000ms for tracing and 1000ms for monitoring.
        --num_tracing_attempts=3                int32   Automatically retry N times when no trace event is collected. Default is 3.
        --include_dataset_ops=true              bool    Set to false to profile longer TPU device traces.
        --monitoring_level=0                    int32   Choose a monitoring level between 1 and 2 to monitor your TPU job continuously. Level 2 is more verbose than level 1 and shows more metrics.
        --timestamp=false                       bool    Set to true to display timestamp in monitoring results.
        --num_queries=100                       int32   This script will run monitoring for num_queries before it stops.

Then, I run tensorboard passing the same MODEL_DIR to --logdir.

Tensorboard is running: loss and accuracy is being captured, but there is nothing on the profile page. I would like to use this tool to check underutilization of TPU, etc. Any ideas what could be going on?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
qiuminxucommented, Sep 10, 2019

Yes, as @stephanwlee said, this is a version problem. From the logs, you have tensorflow 1.13.1 and capture_tpu_profile 1.14.0. What’s your TPU version? Please make sure the tensorflow version, capture_tpu_profile version and the TPU version are the same, 1.13 or 1.14 for all.

0reactions
ljstrnadiiicommented, Sep 10, 2019

After pip uninstall cloud-tpu-profiler and installing the version that corresponds to my tpu with pip install cloud-tpu-profiler==1.13 everything is up and running.

Thanks for your help @qiuminxu and @stephanwlee !

Read more comments on GitHub >

github_iconTop Results From Across the Web

Profile your model with Cloud TPU tools
Capture a profile programmatically; Capture profile common problems ... To profile your model you use TensorBoard and the Cloud TPU TensorBoard plug-in.
Read more >
Cloud TPU tools not generating profile - Stack Overflow
1 Answer 1 · It doesn't seem to work. With Cloud TPU Profiler v1. · If you start trace capture too early, the...
Read more >
TPU Training. Harnessing the power of dedicated DNN…
The TPU documentation includes a detailed guide on capturing a profile on TPU and analyzing the results in TensorBoard. The documentation also ...
Read more >
Profiling JAX programs
Tensorboard is a great way to acquire and visualize performance traces and profiles of your program, including activity on GPU and TPU.
Read more >
TensorFlow Profiler: Profile model performance | TensorBoard
Use the TensorFlow Profiler to profile the execution of your TensorFlow ... of the TensorFlow Profiler by capturing the performance profile ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found