Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TensorBoard logs: "Unexpected error: corrupted record at 0"

See original GitHub issue

/kind bug

What steps did you take and what happened:

I am trying to construct a simple example MNIST example that makes use Keras with TFJob and logs out metrics using the TensorBoard callback. However, the TensorFlowEvent collector is unable to pick up the logs.

The workers reach completed stage. Calling kubectl -n kubeflow logs pod/tfjob-example-tf-events-xxxxxxx-worker-0 metrics-collector then yields the following

/tensorboard/train/events.out.tfevents.1574336795.tfjob-example-tf-events-xxxxxxx-worker-0.8.140821.v2 will be parsed.
/tensorboard/train/events.out.tfevents.1574336799.tfjob-example-tf-events-xxxxxxx-worker-0.profile-empty will be parsed.
/tensorboard/train/plugins/profile/2019-11-21_11-46-39/local.trace will be parsed.
Unexpected error: corrupted record at 0
In tfjob-example-tf-events-xxxxxxx 0 metrics will be reported.

Below follows the full code and details of what I am doing

1. model.py

This sits within a Docker image (let’s just call it my_images/keras_mnist for simplicity sake) running TensorFlow 2.0 (its. based on tensorflow/tensorflow:latest-gpu-py3).

It is based on the official tutorial for running Keras in a distributed manner, as found here: https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/distribute/multi_worker_with_keras.ipynb#scrollTo=xIY9vKnUU82o

import tensorflow_datasets as tfds
tfds.disable_progress_bar()
import tensorflow as tf
import argparse


FLAGS = None

BUFFER_SIZE = 10000

class StdOutCallback(tf.keras.callbacks.ProgbarLogger):
    # a simple callback that picky-backs of the progress bar callback. It prints metrics to StdOut.
    def on_batch_end(self, batch, logs=None):
        logs = logs or {}
        for k in self.params['metrics']:
            if k in logs:
                print("{}={}".format(k,logs[k]))

def build_and_compile_cnn_model(dropout_rate, lr):
  model = tf.keras.Sequential([
      tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(64, activation='relu'),
      tf.keras.layers.Dropout(rate=dropout_rate),
      tf.keras.layers.Dense(10, activation='softmax')
  ])

  model.compile(
      loss=tf.keras.losses.sparse_categorical_crossentropy,
      optimizer=tf.keras.optimizers.SGD(learning_rate=lr),
      metrics=['accuracy']
  )
  return model

def make_datasets_unbatched(dataset_name):
  # Scaling MNIST data from (0, 255] to (0., 1.]
  def scale(image, label):
    image = tf.cast(image, tf.float32)
    image /= 255
    return image, label

  datasets, info = tfds.load(name='mnist',
                            with_info=True,
                            as_supervised=True)

  return datasets[dataset_name].map(scale).cache().shuffle(BUFFER_SIZE)


def parse_arguments():
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_size_x', type=int, default=10),
    parser.add_argument('--input_size_y', type=int, default=10)
    parser.add_argument('--lr', type=float, default=0.01)
    parser.add_argument('--dropout_rate', type=float, default=0.4)
    parser.add_argument('--log_dir', type=str, default="./tensorboard/metrics")
    parser.add_argument('--number_of_workers', type=int, default=1)
    args, unparsed = parser.parse_known_args()

    return args, unparsed

def main():
    FLAGS, unparsed = parse_arguments()
    GLOBAL_BATCH_SIZE = 512 * FLAGS.number_of_workers
    
    strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()

    tensorboard = tf.keras.callbacks.TensorBoard(log_dir=FLAGS.log_dir, update_freq="batch")      
    std_out = StdOutCallback()

    with strategy.scope():
        train_datasets = make_datasets_unbatched("train").batch(GLOBAL_BATCH_SIZE)
        multi_worker_model = build_and_compile_cnn_model(FLAGS.dropout_rate, FLAGS.lr)
        
    multi_worker_model.fit(x=train_datasets,
                           epochs=3,
                           callbacks=[tensorboard, std_out],
                           verbose=0  # to disable progress bar
                           )

if __name__ == '__main__':
    main()

The above code has two callbacks: the vanilla TensorBoard callback, which writes to FLAGS.log_dir and is meant to be picked up by a collector of kind TensorFlowEvent. The second is a little custom StdOutCallback, used for testing purposes only, which writes out metrics to standard out in the format acc=0.71. This is meant to be picked up by a collector of type StdOut.

2. YAML files

I use two different YAML files. The first is with a StdOut collector, which runs through without any problems. It looks like this:

std_out.yaml:

apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
metadata:
  namespace: kubeflow
  name: tfjob-example-std-out
spec:
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
  objective:
    type: minimize
    goal: 0.1
    objectiveMetricName: loss
    additionalMetricNames:
      - acc
  algorithm:
    algorithmName: random
  metricsCollectorSpec:
    collector:
      kind: StdOut
  parameters:
    - name: --lr
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.05"
    - name: --dropout_rate
      parameterType: double
      feasibleSpace:
        min: "0.3"
        max: "0.7"
  trialTemplate:
    goTemplate:
        rawTemplate: |-
          apiVersion: "kubeflow.org/v1"
          kind: TFJob
          metadata:
            name: {{.Trial}}
            namespace: {{.NameSpace}}
          spec:
           tfReplicaSpecs:
            Worker:
              replicas: 1
              restartPolicy: OnFailure
              template:
                spec:
                  containers:
                    - name: tensorflow
                      image: my_images/keras_mnist:latest
                      imagePullPolicy: Always
                      command:
                        - "python"
                        - "model.py"
                        - "--input_size_x=10"
                        - "--input_size_y=10"
                        - "--log_dir=/tensorboard"
                        - "--number_of_workers=1"
                        {{- with .HyperParameters}}
                        {{- range .}}
                        - "{{.Name}}={{.Value}}"
                        {{- end}}
                        {{- end}}

The above is only really used for testing (and out of curiosity for how the different kinds of collectors work). What I would prefer would be to catch the TensorBoard logs. For that I’ve set up the following yaml:

‘tf_events.yaml’:

apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
metadata:
  namespace: kubeflow
  name: tfjob-example-tf-event
spec:
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
  objective:
    type: minimize
    goal: 0.01
    objectiveMetricName: loss
    additionalMetricNames:
      - acc
  algorithm:
    algorithmName: random
  metricsCollectorSpec:
    source:
      fileSystemPath:
        path: /tensorboard
        kind: Directory
    collector:
      kind: TensorFlowEvent
  parameters:
    - name: --lr
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.05"
    - name: --dropout_rate
      parameterType: double
      feasibleSpace:
        min: "0.3"
        max: "0.7"
  trialTemplate:
    goTemplate:
        rawTemplate: |-
          apiVersion: "kubeflow.org/v1"
          kind: TFJob
          metadata:
            name: {{.Trial}}
            namespace: {{.NameSpace}}
          spec:
           tfReplicaSpecs:
            Worker:
              replicas: 1
              restartPolicy: OnFailure
              template:
                spec:
                  containers:
                    - name: tensorflow
                      image: my_images/keras_mnist:latest
                      imagePullPolicy: Always
                      command:
                        - "python"
                        - "model.py"
                        - "--input_size_x=10"
                        - "--input_size_y=10"
                        - "--log_dir=/tensorboard"
                        - "--number_of_workers=1"
                        {{- with .HyperParameters}}
                        {{- range .}}
                        - "{{.Name}}={{.Value}}"
                        {{- end}}
                        {{- end}}

Running kubectl apply -f tf_events.yaml results in the error being logged in the metrics-collector sidecar as written at the top of this post.

What did you expect to happen:

The metrics-collector’s logs seem to suggest that it was able to find the tensorboard logs and that it will attempt to parse them. I would expect the parsing to work (or at the very least to receive a message explaining why it doesn’t work)

Environment:

TensorFlow version: 2.0
Kubeflow version: 0,7.0
Kubernetes version: (use kubectl version): 1.12

Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.9", GitCommit:"16236ce91790d4c75b79f6ce96841db1c843e7d2", GitTreeState:"clean", BuildDate:"2019-03-27T14:42:18Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.10-eks-afa464", GitCommit:"afa464ce9760bd08a53a207f505b133b93366ea3", GitTreeState:"clean", BuildDate:"2019-10-22T21:42:57Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}