TensorBoard logs: "Unexpected error: corrupted record at 0"
See original GitHub issue/kind bug
What steps did you take and what happened:
I am trying to construct a simple example MNIST example that makes use Keras with TFJob and logs out metrics using the TensorBoard callback. However, the TensorFlowEvent
collector is unable to pick up the logs.
The workers reach completed
stage. Calling kubectl -n kubeflow logs pod/tfjob-example-tf-events-xxxxxxx-worker-0 metrics-collector
then yields the following
/tensorboard/train/events.out.tfevents.1574336795.tfjob-example-tf-events-xxxxxxx-worker-0.8.140821.v2 will be parsed.
/tensorboard/train/events.out.tfevents.1574336799.tfjob-example-tf-events-xxxxxxx-worker-0.profile-empty will be parsed.
/tensorboard/train/plugins/profile/2019-11-21_11-46-39/local.trace will be parsed.
Unexpected error: corrupted record at 0
In tfjob-example-tf-events-xxxxxxx 0 metrics will be reported.
Below follows the full code and details of what I am doing
1. model.py
This sits within a Docker image (let’s just call it my_images/keras_mnist
for simplicity sake) running TensorFlow 2.0 (its. based on tensorflow/tensorflow:latest-gpu-py3
).
It is based on the official tutorial for running Keras in a distributed manner, as found here: https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/distribute/multi_worker_with_keras.ipynb#scrollTo=xIY9vKnUU82o
import tensorflow_datasets as tfds
tfds.disable_progress_bar()
import tensorflow as tf
import argparse
FLAGS = None
BUFFER_SIZE = 10000
class StdOutCallback(tf.keras.callbacks.ProgbarLogger):
# a simple callback that picky-backs of the progress bar callback. It prints metrics to StdOut.
def on_batch_end(self, batch, logs=None):
logs = logs or {}
for k in self.params['metrics']:
if k in logs:
print("{}={}".format(k,logs[k]))
def build_and_compile_cnn_model(dropout_rate, lr):
model = tf.keras.Sequential([
tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dropout(rate=dropout_rate),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(
loss=tf.keras.losses.sparse_categorical_crossentropy,
optimizer=tf.keras.optimizers.SGD(learning_rate=lr),
metrics=['accuracy']
)
return model
def make_datasets_unbatched(dataset_name):
# Scaling MNIST data from (0, 255] to (0., 1.]
def scale(image, label):
image = tf.cast(image, tf.float32)
image /= 255
return image, label
datasets, info = tfds.load(name='mnist',
with_info=True,
as_supervised=True)
return datasets[dataset_name].map(scale).cache().shuffle(BUFFER_SIZE)
def parse_arguments():
parser = argparse.ArgumentParser()
parser.add_argument('--input_size_x', type=int, default=10),
parser.add_argument('--input_size_y', type=int, default=10)
parser.add_argument('--lr', type=float, default=0.01)
parser.add_argument('--dropout_rate', type=float, default=0.4)
parser.add_argument('--log_dir', type=str, default="./tensorboard/metrics")
parser.add_argument('--number_of_workers', type=int, default=1)
args, unparsed = parser.parse_known_args()
return args, unparsed
def main():
FLAGS, unparsed = parse_arguments()
GLOBAL_BATCH_SIZE = 512 * FLAGS.number_of_workers
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
tensorboard = tf.keras.callbacks.TensorBoard(log_dir=FLAGS.log_dir, update_freq="batch")
std_out = StdOutCallback()
with strategy.scope():
train_datasets = make_datasets_unbatched("train").batch(GLOBAL_BATCH_SIZE)
multi_worker_model = build_and_compile_cnn_model(FLAGS.dropout_rate, FLAGS.lr)
multi_worker_model.fit(x=train_datasets,
epochs=3,
callbacks=[tensorboard, std_out],
verbose=0 # to disable progress bar
)
if __name__ == '__main__':
main()
The above code has two callbacks: the vanilla TensorBoard
callback, which writes to FLAGS.log_dir
and is meant to be picked up by a collector of kind TensorFlowEvent
. The second is a little custom StdOutCallback
, used for testing purposes only, which writes out metrics to standard out in the format acc=0.71
. This is meant to be picked up by a collector of type StdOut.
2. YAML files
I use two different YAML files. The first is with a StdOut
collector, which runs through without any problems. It looks like this:
std_out.yaml
:
apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
metadata:
namespace: kubeflow
name: tfjob-example-std-out
spec:
parallelTrialCount: 3
maxTrialCount: 12
maxFailedTrialCount: 3
objective:
type: minimize
goal: 0.1
objectiveMetricName: loss
additionalMetricNames:
- acc
algorithm:
algorithmName: random
metricsCollectorSpec:
collector:
kind: StdOut
parameters:
- name: --lr
parameterType: double
feasibleSpace:
min: "0.01"
max: "0.05"
- name: --dropout_rate
parameterType: double
feasibleSpace:
min: "0.3"
max: "0.7"
trialTemplate:
goTemplate:
rawTemplate: |-
apiVersion: "kubeflow.org/v1"
kind: TFJob
metadata:
name: {{.Trial}}
namespace: {{.NameSpace}}
spec:
tfReplicaSpecs:
Worker:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: tensorflow
image: my_images/keras_mnist:latest
imagePullPolicy: Always
command:
- "python"
- "model.py"
- "--input_size_x=10"
- "--input_size_y=10"
- "--log_dir=/tensorboard"
- "--number_of_workers=1"
{{- with .HyperParameters}}
{{- range .}}
- "{{.Name}}={{.Value}}"
{{- end}}
{{- end}}
The above is only really used for testing (and out of curiosity for how the different kinds of collectors work). What I would prefer would be to catch the TensorBoard logs. For that I’ve set up the following yaml:
‘tf_events.yaml’:
apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
metadata:
namespace: kubeflow
name: tfjob-example-tf-event
spec:
parallelTrialCount: 3
maxTrialCount: 12
maxFailedTrialCount: 3
objective:
type: minimize
goal: 0.01
objectiveMetricName: loss
additionalMetricNames:
- acc
algorithm:
algorithmName: random
metricsCollectorSpec:
source:
fileSystemPath:
path: /tensorboard
kind: Directory
collector:
kind: TensorFlowEvent
parameters:
- name: --lr
parameterType: double
feasibleSpace:
min: "0.01"
max: "0.05"
- name: --dropout_rate
parameterType: double
feasibleSpace:
min: "0.3"
max: "0.7"
trialTemplate:
goTemplate:
rawTemplate: |-
apiVersion: "kubeflow.org/v1"
kind: TFJob
metadata:
name: {{.Trial}}
namespace: {{.NameSpace}}
spec:
tfReplicaSpecs:
Worker:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: tensorflow
image: my_images/keras_mnist:latest
imagePullPolicy: Always
command:
- "python"
- "model.py"
- "--input_size_x=10"
- "--input_size_y=10"
- "--log_dir=/tensorboard"
- "--number_of_workers=1"
{{- with .HyperParameters}}
{{- range .}}
- "{{.Name}}={{.Value}}"
{{- end}}
{{- end}}
Running kubectl apply -f tf_events.yaml
results in the error being logged in the metrics-collector
sidecar as written at the top of this post.
What did you expect to happen:
The metrics-collector
’s logs seem to suggest that it was able to find the tensorboard logs and that it will attempt to parse them. I would expect the parsing to work (or at the very least to receive a message explaining why it doesn’t work)
Environment:
- TensorFlow version: 2.0
- Kubeflow version: 0,7.0
- Kubernetes version: (use
kubectl version
): 1.12
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.9", GitCommit:"16236ce91790d4c75b79f6ce96841db1c843e7d2", GitTreeState:"clean", BuildDate:"2019-03-27T14:42:18Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.10-eks-afa464", GitCommit:"afa464ce9760bd08a53a207f505b133b93366ea3", GitTreeState:"clean", BuildDate:"2019-10-22T21:42:57Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
Issue Analytics
- State:
- Created 4 years ago
- Comments:11 (1 by maintainers)
Top GitHub Comments
@eriklincoln I haven’t really had a chance to try it yet. Also unlikely that I’ll get to it before january
This issue has been automatically closed because it has not had recent activity. Please comment “/reopen” to reopen it.