Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

No Profile when using capture_tpu_profile

See original GitHub issue

Consider first reaching out to Stack Overflow for support—they have a larger community with better searchability:

https://stackoverflow.com/questions/tagged/tensorboard

Environment information (required)

Please run diagnose_tensorboard.py (link below) in the same environment from which you normally run TensorFlow/TensorBoard, and paste the output here:

--- check: autoidentify
INFO: diagnose_tensorboard.py version 393931f9685bd7e0f3898d7dcdf28819fef54c43

--- check: general
INFO: sys.version_info: sys.version_info(major=3, minor=5, micro=3, releaselevel='final', serial=0)
INFO: os.name: posix
INFO: os.uname(): posix.uname_result(sysname='Linux', nodename='d-len', release='4.9.0-8-amd64', version='#1 SMP Debian 4.9.130-2 (2018-10-27)', machine='x86_64')
INFO: sys.getwindowsversion(): N/A

--- check: package_management
INFO: has conda-meta: False
INFO: $VIRTUAL_ENV: '/home/len/venv'

--- check: installed_packages
INFO: installed: tensorboard==1.13.1
INFO: installed: tensorflow==1.13.1
INFO: installed: tensorflow-estimator==1.13.0

--- check: tensorboard_python_version
INFO: tensorboard.version.VERSION: '1.13.1'

--- check: tensorflow_python_version
INFO: tensorflow.__version__: '1.13.1'
INFO: tensorflow.__git_version__: "b'v1.13.1-0-g6612da8951'"

--- check: tensorboard_binary_path
INFO: which tensorboard: b'/home/len/venv/bin/tensorboard\n'

--- check: readable_fqdn
INFO: socket.getfqdn(): 'd-len.us-central1-b.c.lofty-outcome-860.internal'

--- check: stat_tensorboardinfo
INFO: directory: /tmp/.tensorboard-info
INFO: os.stat(...): os.stat_result(st_mode=16877, st_ino=5636621, st_dev=2049, st_nlink=2, st_uid=1017, st_gid=1018, st_size=4096, st_atime=1568125440, st_mtime=1568135743, st_ctime=1568135743)
INFO: mode: 0o40755

--- check: source_trees_without_genfiles
INFO: tensorboard_roots (1): ['/home/len/venv/lib/python3.5/site-packages']; bad_roots (0): []

Steps to reproduce (required)

First, I start training the model.

Then, I run capture_tpu_profile --tpu=[TPU_NAME] --logdir=[MODEL_DIR] --duration_ms=50000 and I get

TensorFlow version 1.13.1 detected

WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

Welcome to the Cloud TPU Profiler v1.14.0
Starting to profile TPU traces for 50000 ms. Remaining attempt(s): 3

usage: /home/len/venv/lib/python3.5/site-packages/cloud_tpu_profiler/data/capture_tpu_profile
Flags:
        --service_addr=""                       string  Address of TPU profiler service e.g. localhost:8466
        --workers_list=""                       string  The list of worker TPUs that we are about to profile in the current session.
        --logdir=""                             string  Path of TensorBoard log directory e.g. /tmp/tb_log, gs://tb_bucket
        --duration_ms=0                         int32   Duration of tracing or monitoring in ms. Default is 2000ms for tracing and 1000ms for monitoring.
        --num_tracing_attempts=3                int32   Automatically retry N times when no trace event is collected. Default is 3.
        --include_dataset_ops=true              bool    Set to false to profile longer TPU device traces.
        --monitoring_level=0                    int32   Choose a monitoring level between 1 and 2 to monitor your TPU job continuously. Level 2 is more verbose than level 1 and shows more metrics.
        --timestamp=false                       bool    Set to true to display timestamp in monitoring results.
        --num_queries=100                       int32   This script will run monitoring for num_queries before it stops.

Then, I run tensorboard passing the same MODEL_DIR to --logdir.

Tensorboard is running: loss and accuracy is being captured, but there is nothing on the profile page. I would like to use this tool to check underutilization of TPU, etc. Any ideas what could be going on?

Issue Analytics

State:
Created 4 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

qiuminxucommented, Sep 10, 2019

Yes, as @stephanwlee said, this is a version problem. From the logs, you have tensorflow 1.13.1 and capture_tpu_profile 1.14.0. What’s your TPU version? Please make sure the tensorflow version, capture_tpu_profile version and the TPU version are the same, 1.13 or 1.14 for all.

0reactions

ljstrnadiiicommented, Sep 10, 2019

After pip uninstall cloud-tpu-profiler and installing the version that corresponds to my tpu with pip install cloud-tpu-profiler==1.13 everything is up and running.

Thanks for your help @qiuminxu and @stephanwlee !

Top Results From Across the Web

Profile your model with Cloud TPU tools

Capture a profile programmatically; Capture profile common problems ... To profile your model you use TensorBoard and the Cloud TPU TensorBoard plug-in.

Cloud TPU tools not generating profile - Stack Overflow

1 Answer 1 · It doesn't seem to work. With Cloud TPU Profiler v1. · If you start trace capture too early, the...

TPU Training. Harnessing the power of dedicated DNN…

The TPU documentation includes a detailed guide on capturing a profile on TPU and analyzing the results in TensorBoard. The documentation also ...

Profiling JAX programs

Tensorboard is a great way to acquire and visualize performance traces and profiles of your program, including activity on GPU and TPU.

TensorFlow Profiler: Profile model performance | TensorBoard

Use the TensorFlow Profiler to profile the execution of your TensorFlow ... of the TensorFlow Profiler by capturing the performance profile ...