question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TensorBoard logs: "Unexpected error: corrupted record at 0"

See original GitHub issue

/kind bug

What steps did you take and what happened:

I am trying to construct a simple example MNIST example that makes use Keras with TFJob and logs out metrics using the TensorBoard callback. However, the TensorFlowEvent collector is unable to pick up the logs.

The workers reach completed stage. Calling kubectl -n kubeflow logs pod/tfjob-example-tf-events-xxxxxxx-worker-0 metrics-collector then yields the following

/tensorboard/train/events.out.tfevents.1574336795.tfjob-example-tf-events-xxxxxxx-worker-0.8.140821.v2 will be parsed.
/tensorboard/train/events.out.tfevents.1574336799.tfjob-example-tf-events-xxxxxxx-worker-0.profile-empty will be parsed.
/tensorboard/train/plugins/profile/2019-11-21_11-46-39/local.trace will be parsed.
Unexpected error: corrupted record at 0
In tfjob-example-tf-events-xxxxxxx 0 metrics will be reported.

Below follows the full code and details of what I am doing

1. model.py

This sits within a Docker image (let’s just call it my_images/keras_mnist for simplicity sake) running TensorFlow 2.0 (its. based on tensorflow/tensorflow:latest-gpu-py3).

It is based on the official tutorial for running Keras in a distributed manner, as found here: https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/distribute/multi_worker_with_keras.ipynb#scrollTo=xIY9vKnUU82o

import tensorflow_datasets as tfds
tfds.disable_progress_bar()
import tensorflow as tf
import argparse


FLAGS = None

BUFFER_SIZE = 10000

class StdOutCallback(tf.keras.callbacks.ProgbarLogger):
    # a simple callback that picky-backs of the progress bar callback. It prints metrics to StdOut.
    def on_batch_end(self, batch, logs=None):
        logs = logs or {}
        for k in self.params['metrics']:
            if k in logs:
                print("{}={}".format(k,logs[k]))

def build_and_compile_cnn_model(dropout_rate, lr):
  model = tf.keras.Sequential([
      tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(64, activation='relu'),
      tf.keras.layers.Dropout(rate=dropout_rate),
      tf.keras.layers.Dense(10, activation='softmax')
  ])

  model.compile(
      loss=tf.keras.losses.sparse_categorical_crossentropy,
      optimizer=tf.keras.optimizers.SGD(learning_rate=lr),
      metrics=['accuracy']
  )
  return model

def make_datasets_unbatched(dataset_name):
  # Scaling MNIST data from (0, 255] to (0., 1.]
  def scale(image, label):
    image = tf.cast(image, tf.float32)
    image /= 255
    return image, label

  datasets, info = tfds.load(name='mnist',
                            with_info=True,
                            as_supervised=True)

  return datasets[dataset_name].map(scale).cache().shuffle(BUFFER_SIZE)


def parse_arguments():
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_size_x', type=int, default=10),
    parser.add_argument('--input_size_y', type=int, default=10)
    parser.add_argument('--lr', type=float, default=0.01)
    parser.add_argument('--dropout_rate', type=float, default=0.4)
    parser.add_argument('--log_dir', type=str, default="./tensorboard/metrics")
    parser.add_argument('--number_of_workers', type=int, default=1)
    args, unparsed = parser.parse_known_args()

    return args, unparsed

def main():
    FLAGS, unparsed = parse_arguments()
    GLOBAL_BATCH_SIZE = 512 * FLAGS.number_of_workers
    
    strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()

    tensorboard = tf.keras.callbacks.TensorBoard(log_dir=FLAGS.log_dir, update_freq="batch")      
    std_out = StdOutCallback()

    with strategy.scope():
        train_datasets = make_datasets_unbatched("train").batch(GLOBAL_BATCH_SIZE)
        multi_worker_model = build_and_compile_cnn_model(FLAGS.dropout_rate, FLAGS.lr)
        
    multi_worker_model.fit(x=train_datasets,
                           epochs=3,
                           callbacks=[tensorboard, std_out],
                           verbose=0  # to disable progress bar
                           )

if __name__ == '__main__':
    main()

The above code has two callbacks: the vanilla TensorBoard callback, which writes to FLAGS.log_dir and is meant to be picked up by a collector of kind TensorFlowEvent. The second is a little custom StdOutCallback, used for testing purposes only, which writes out metrics to standard out in the format acc=0.71. This is meant to be picked up by a collector of type StdOut.

2. YAML files

I use two different YAML files. The first is with a StdOut collector, which runs through without any problems. It looks like this:

std_out.yaml:

apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
metadata:
  namespace: kubeflow
  name: tfjob-example-std-out
spec:
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
  objective:
    type: minimize
    goal: 0.1
    objectiveMetricName: loss
    additionalMetricNames:
      - acc
  algorithm:
    algorithmName: random
  metricsCollectorSpec:
    collector:
      kind: StdOut
  parameters:
    - name: --lr
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.05"
    - name: --dropout_rate
      parameterType: double
      feasibleSpace:
        min: "0.3"
        max: "0.7"
  trialTemplate:
    goTemplate:
        rawTemplate: |-
          apiVersion: "kubeflow.org/v1"
          kind: TFJob
          metadata:
            name: {{.Trial}}
            namespace: {{.NameSpace}}
          spec:
           tfReplicaSpecs:
            Worker:
              replicas: 1
              restartPolicy: OnFailure
              template:
                spec:
                  containers:
                    - name: tensorflow
                      image: my_images/keras_mnist:latest
                      imagePullPolicy: Always
                      command:
                        - "python"
                        - "model.py"
                        - "--input_size_x=10"
                        - "--input_size_y=10"
                        - "--log_dir=/tensorboard"
                        - "--number_of_workers=1"
                        {{- with .HyperParameters}}
                        {{- range .}}
                        - "{{.Name}}={{.Value}}"
                        {{- end}}
                        {{- end}}

The above is only really used for testing (and out of curiosity for how the different kinds of collectors work). What I would prefer would be to catch the TensorBoard logs. For that I’ve set up the following yaml:

‘tf_events.yaml’:

apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
metadata:
  namespace: kubeflow
  name: tfjob-example-tf-event
spec:
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
  objective:
    type: minimize
    goal: 0.01
    objectiveMetricName: loss
    additionalMetricNames:
      - acc
  algorithm:
    algorithmName: random
  metricsCollectorSpec:
    source:
      fileSystemPath:
        path: /tensorboard
        kind: Directory
    collector:
      kind: TensorFlowEvent
  parameters:
    - name: --lr
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.05"
    - name: --dropout_rate
      parameterType: double
      feasibleSpace:
        min: "0.3"
        max: "0.7"
  trialTemplate:
    goTemplate:
        rawTemplate: |-
          apiVersion: "kubeflow.org/v1"
          kind: TFJob
          metadata:
            name: {{.Trial}}
            namespace: {{.NameSpace}}
          spec:
           tfReplicaSpecs:
            Worker:
              replicas: 1
              restartPolicy: OnFailure
              template:
                spec:
                  containers:
                    - name: tensorflow
                      image: my_images/keras_mnist:latest
                      imagePullPolicy: Always
                      command:
                        - "python"
                        - "model.py"
                        - "--input_size_x=10"
                        - "--input_size_y=10"
                        - "--log_dir=/tensorboard"
                        - "--number_of_workers=1"
                        {{- with .HyperParameters}}
                        {{- range .}}
                        - "{{.Name}}={{.Value}}"
                        {{- end}}
                        {{- end}}

Running kubectl apply -f tf_events.yaml results in the error being logged in the metrics-collector sidecar as written at the top of this post.

What did you expect to happen:

The metrics-collector’s logs seem to suggest that it was able to find the tensorboard logs and that it will attempt to parse them. I would expect the parsing to work (or at the very least to receive a message explaining why it doesn’t work)

Environment:

  • TensorFlow version: 2.0
  • Kubeflow version: 0,7.0
  • Kubernetes version: (use kubectl version): 1.12
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.9", GitCommit:"16236ce91790d4c75b79f6ce96841db1c843e7d2", GitTreeState:"clean", BuildDate:"2019-03-27T14:42:18Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.10-eks-afa464", GitCommit:"afa464ce9760bd08a53a207f505b133b93366ea3", GitTreeState:"clean", BuildDate:"2019-10-22T21:42:57Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:11 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
karlschriekcommented, Dec 4, 2019

@eriklincoln I haven’t really had a chance to try it yet. Also unlikely that I’ll get to it before january

0reactions
stale[bot]commented, Dec 19, 2020

This issue has been automatically closed because it has not had recent activity. Please comment “/reopen” to reopen it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

DataLossError: corrupted record at 0 when using TFRecords ...
My dataset does contain “background” images, where no objects to be detected are shown.
Read more >
How to fix truncated tfrecords for tensorflow? - Stack Overflow
The message means what it says --- the TFRecord file seems to end unexpectedly part way through a record. If you want to...
Read more >
Troubleshoot Dataflow errors - Google Cloud
Some of these errors are permanent, such as errors caused by corrupt or unparseable input data, or null pointers during computation.
Read more >
Tensorboard quick start in 5 minutes. - Anthony Sarkis - Medium
Start Tensorboard server (< 1 min). Open a terminal window in your root project directory. Run: tensorboard --logdir logs/1. Go to the URL...
Read more >
TensorFlow问题处理:DataLossError: corrupted record at XXX
DataLossError: logging.info("skip data loss error!") 1; 2; 3. 跳过tensorflow TFRecord读取异常的数据。
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found