ParameterServerStrategy stuck in "Feeding partition"
See original GitHub issueEnvironment:
- Python version [2.7]
- Spark version [2.3.2]
- TensorFlow version [1.13.1]
- TensorFlowOnSpark version [1.4.3]
- Cluster version [Hadoop 3.x]
Describe the bug: Using the examples provided I have been attempting to port one of our existing spark-ml jobs to Tfos. These jobs all run in a dedicated Yarn cluster(cpu-only).
As part of my POC, I am attempting to create a 2-worker, 1-ps, 1-master Tensor-cluster which trains a simple Keras model(converted to tf estimator), using a distributed strategy of ParameterServerStrategy.
When I start this up, the cluster “establishes” itself, but I see the master gets stuck trying to feed the queue. I have run the same example in non-distributed mode, and it worked fine(multiple instances doing the same thing).
Thanks in advance for any help.
Here is the relevant snippet of code running
` model = Sequential() model.add(Dense(64, input_dim=num_features, activation=‘sigmoid’)) model.add(Dropout(0.2)) model.add(Dense(64, activation=‘sigmoid’)) model.add(Dropout(0.2)) model.add(Dense(1, activation=‘sigmoid’)) model.compile(loss=‘binary_crossentropy’, optimizer= tf.train.AdamOptimizer(), metrics=[‘accuracy’]) model.summary()
distribution_strategy = tf.contrib.distribute.ParameterServerStrategy()
config = tf.estimator.RunConfig(
train_distribute=distribution_strategy, eval_distribute=distribution_strategy)
estimator = tf.keras.estimator.model_to_estimator(model, model_dir=model_dir, config=config)
def generate_rdd_data(tf_feed):
while not tf_feed.should_stop():
batch = tf_feed.next_batch(1)
if len(batch) > 0:
record = batch[0]
features = numpy.array(record[0]).astype(numpy.float32)
label = numpy.array([record[1]]).astype(numpy.float32)
yield (features, label)
else:
return
def train_input_fn():
ds = tf.data.Dataset.from_generator(generator,
(tf.float32, tf.float32),
(tf.TensorShape([num_features]), tf.TensorShape([1])))
ds = ds.batch(args.batch_size)
return ds
# add a hook to terminate the RDD data feed when the session ends
hooks = [StopFeedHook(tf_feed)]
# train model
estimator.train(input_fn=train_input_fn, max_steps=steps_per_epoch, `hooks=hooks)`
Logs:
Master Logs:
2019-05-23 18:49:54,434 INFO (MainThread-53285) 1: ======== master:0 ======== 2019-05-23 18:49:54,434 INFO (MainThread-53285) 1: Cluster spec: {'worker': ['10.90.28.232:34493', '10.90.28.252:38762'], 'ps': ['10.90.28.222:45450'], 'master': ['10.90.28.230:42065']} 2019-05-23 18:49:54,435 INFO (MainThread-53285) 1: Using CPU 19/05/23 18:49:54 INFO TorrentBroadcast: Started reading broadcast variable 110 19/05/23 18:49:54 INFO MemoryStore: Block broadcast_110_piece0 stored as bytes in memory (estimated size 658.9 KB, free 1970.6 MB) 19/05/23 18:49:54 INFO TorrentBroadcast: Reading broadcast variable 110 took 28 ms 19/05/23 18:49:54 INFO MemoryStore: Block broadcast_110 stored as values in memory (estimated size 1434.9 KB, free 1969.2 MB) 2019-05-23 18:49:54.479917: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300070000 Hz 2019-05-23 18:49:54.481187: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x44295a0 executing computations on platform Host. Devices: 2019-05-23 18:49:54.481225: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined> 2019-05-23 18:49:54.483937: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:252] Initialize GrpcChannelCache for job master -> {0 -> localhost:42065} 2019-05-23 18:49:54.483963: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:252] Initialize GrpcChannelCache for job ps -> {0 -> 10.90.28.222:45450} 2019-05-23 18:49:54.483990: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:252] Initialize GrpcChannelCache for job worker -> {0 -> 10.90.28.232:34493, 1 -> 10.90.28.252:38762} 2019-05-23 18:49:54.486650: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:391] Started server with target: grpc://localhost:42065 19/05/23 18:49:54 INFO BlockManager: Found block rdd_412_3 locally 19/05/23 18:49:54 INFO CodeGenerator: Code generated in 21.755654 ms 19/05/23 18:49:54 INFO Executor: Finished task 3.0 in stage 73.0 (TID 777). 2805 bytes result sent to driver 19/05/23 18:49:54 INFO CoarseGrainedExecutorBackend: Got assigned task 780 19/05/23 18:49:54 INFO Executor: Running task 6.0 in stage 73.0 (TID 780) 19/05/23 18:49:54 INFO BlockManager: Found block rdd_412_6 locally 19/05/23 18:49:55 INFO Executor: Finished task 6.0 in stage 73.0 (TID 780). 2762 bytes result sent to driver 19/05/23 18:49:55 INFO CoarseGrainedExecutorBackend: Got assigned task 782 19/05/23 18:49:55 INFO Executor: Running task 10.0 in stage 73.0 (TID 782) 19/05/23 18:49:55 INFO BlockManager: Found block rdd_412_10 locally 19/05/23 18:49:55 INFO Executor: Finished task 10.0 in stage 73.0 (TID 782). 2762 bytes result sent to driver 19/05/23 18:50:01 INFO CoarseGrainedExecutorBackend: Got assigned task 784 19/05/23 18:50:01 INFO Executor: Running task 1.0 in stage 73.0 (TID 784) 19/05/23 18:50:01 INFO BlockManager: Found block rdd_412_1 remotely 19/05/23 18:50:01 INFO Executor: Finished task 1.0 in stage 73.0 (TID 784). 2762 bytes result sent to driver 19/05/23 18:50:02 INFO CoarseGrainedExecutorBackend: Got assigned task 787 19/05/23 18:50:02 INFO Executor: Running task 0.0 in stage 74.0 (TID 787) 19/05/23 18:50:02 INFO MapOutputTrackerWorker: Updating epoch to 37 and clearing cache 19/05/23 18:50:02 INFO TorrentBroadcast: Started reading broadcast variable 111 19/05/23 18:50:02 INFO MemoryStore: Block broadcast_111_piece0 stored as bytes in memory (estimated size 10.5 KB, free 1969.2 MB) 19/05/23 18:50:02 INFO TorrentBroadcast: Reading broadcast variable 111 took 5 ms 19/05/23 18:50:02 INFO MemoryStore: Block broadcast_111 stored as values in memory (estimated size 23.4 KB, free 1969.2 MB) 19/05/23 18:50:02 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 36, fetching them 19/05/23 18:50:02 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@srv-01-11-b09.iad1.trmr.io:33383) 19/05/23 18:50:02 INFO MapOutputTrackerWorker: Got the output locations 19/05/23 18:50:02 INFO ShuffleBlockFetcherIterator: Getting 12 non-empty blocks out of 12 blocks 19/05/23 18:50:02 INFO ShuffleBlockFetcherIterator: Started 2 remote fetches in 1 ms 2019-05-23 18:50:02,187 INFO (MainThread-52742) Connected to TFSparkNode.mgr on 10.90.28.230, executor=1, state='running' 2019-05-23 18:50:02,194 INFO (MainThread-52742) mgr.state='running' 2019-05-23 18:50:02,194 INFO (MainThread-52742) Feeding partition <itertools.chain object at 0x7f97ea46c2d0> into input queue <multiprocessing.queues.JoinableQueue object at 0x7f97dae9d2d0> 19/05/23 19:00:03 ERROR Executor: Exception in task 0.0 in stage 74.0 (TID 787) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/hadoop/yarn/local/usercache/pipeline/appcache/application_1557769783296_1143/container_e21_1557769783296_1143_01_000005/pyspark.zip/pyspark/worker.py", line 253, in main process() File "/hadoop/yarn/local/usercache/pipeline/appcache/application_1557769783296_1143/container_e21_1557769783296_1143_01_000005/pyspark.zip/pyspark/worker.py", line 248, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/rdd.py", line 2440, in pipeline_func File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/rdd.py", line 2440, in pipeline_func File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/rdd.py", line 2440, in pipeline_func File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/rdd.py", line 350, in func File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/rdd.py", line 799, in func File "/usr/lib/python2.7/site-packages/tensorflowonspark/TFSparkNode.py", line 420, in _train raise Exception("Timeout while feeding partition") Exception: Timeout while feeding partition
Worker logs:
019-05-23 18:49:53,977 INFO (MainThread-118610) Starting TensorFlow worker:0 as worker on cluster node 2 on background process
19/05/23 18:49:53 INFO PythonRunner: Times: total = 7808, boot = -36392, init = 43165, finish = 1035
19/05/23 18:49:53 INFO Executor: Finished task 2.0 in stage 72.0 (TID 773). 1418 bytes result sent to driver
2019-05-23 18:49:53,985 INFO (MainThread-121583) 2: ======== worker:0 ========
2019-05-23 18:49:53,986 INFO (MainThread-121583) 2: Cluster spec: {‘worker’: [‘10.90.28.232:34493’, ‘10.90.28.252:38762’], ‘ps’: [‘10.90.28.222:45450’], ‘master’: [‘10.90.28.230:42065’]}
2019-05-23 18:49:53,986 INFO (MainThread-121583) 2: Using CPU
19/05/23 18:49:53 INFO CoarseGrainedExecutorBackend: Got assigned task 775
19/05/23 18:49:53 INFO Executor: Running task 0.0 in stage 73.0 (TID 775)
19/05/23 18:49:53 INFO TorrentBroadcast: Started reading broadcast variable 110
19/05/23 18:49:54 INFO MemoryStore: Block broadcast_110_piece0 stored as bytes in memory (estimated size 658.9 KB, free 1970.6 MB)
19/05/23 18:49:54 INFO TorrentBroadcast: Reading broadcast variable 110 took 50 ms
19/05/23 18:49:54 INFO MemoryStore: Block broadcast_110 stored as values in memory (estimated size 1434.9 KB, free 1969.2 MB)
19/05/23 18:49:54 INFO BlockManager: Found block rdd_412_0 locally
2019-05-23 18:49:54.084395: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2599985000 Hz
2019-05-23 18:49:54.085463: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x4554d90 executing computations on platform Host. Devices:
2019-05-23 18:49:54.085503: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined>
19/05/23 18:49:54 INFO CodeGenerator: Code generated in 24.400731 ms
2019-05-23 18:49:54.095435: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:252] Initialize GrpcChannelCache for job master -> {0 -> 10.90.28.230:42065}
2019-05-23 18:49:54.095469: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:252] Initialize GrpcChannelCache for job ps -> {0 -> 10.90.28.222:45450}
2019-05-23 18:49:54.095481: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:252] Initialize GrpcChannelCache for job worker -> {0 -> localhost:34493, 1 -> 10.90.28.252:38762}
2019-05-23 18:49:54.097460: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:391] Started server with target: grpc://localhost:34493
2019-05-23 18:49:54,175 WARNING (MainThread-121583) From /usr/lib/python2.7/site-packages/tensorflow/python/ops/resource_variable_ops.py:435: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2019-05-23 18:49:54,238 WARNING (MainThread-121583) From /usr/lib/python2.7/site-packages/tensorflow/python/keras/layers/core.py:143: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use rate
instead of keep_prob
. Rate should be set to rate = 1 - keep_prob
.
19/05/23 18:49:54 INFO Executor: Finished task 0.0 in stage 73.0 (TID 775). 2805 bytes result sent to driver
19/05/23 18:49:54 INFO CoarseGrainedExecutorBackend: Got assigned task 776
19/05/23 18:49:54 INFO Executor: Running task 4.0 in stage 73.0 (TID 776)
19/05/23 18:49:54 INFO BlockManager: Found block rdd_412_4 locally
19/05/23 18:49:54 INFO Executor: Finished task 4.0 in stage 73.0 (TID 776). 2762 bytes result sent to driver
19/05/23 18:49:54 INFO CoarseGrainedExecutorBackend: Got assigned task 778
19/05/23 18:49:54 INFO Executor: Running task 8.0 in stage 73.0 (TID 778)
19/05/23 18:49:54 INFO BlockManager: Found block rdd_412_8 locally
19/05/23 18:49:54 INFO Executor: Finished task 8.0 in stage 73.0 (TID 778). 2805 bytes result sent to driver
Total params: 15,425 Trainable params: 15,425 Non-trainable params: 0
num_features: 174 num_records: 240000 batch_size: 1953 epochs: 3 steps_per_epoch: 128
WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see:
- https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
- https://github.com/tensorflow/addons If you depend on functionality not listed there, please file an issue.
2019-05-23 18:49:57,229 INFO (MainThread-121583) ParameterServerStrategy with compute_devices = (‘/replica:0/task:0/device:CPU:0’,), variable_device = ‘/device:CPU:0’
2019-05-23 18:49:57,229 INFO (MainThread-121583) TF_CONFIG environment variable: {u’environment’: u’cloud’, u’cluster’: {u’ps’: [u’10.90.28.222:45450’], u’worker’: [u’10.90.28.232:34493’, u’10.90.28.252:38762’], u’master’: [u’10.90.28.230:42065’]}, u’task’: {u’index’: 0, u’type’: u’worker’}}
2019-05-23 18:49:57,229 INFO (MainThread-121583) Initializing RunConfig with distribution strategies.
2019-05-23 18:49:57,230 INFO (MainThread-121583) Not using Distribute Coordinator.
2019-05-23 18:49:57,230 INFO (MainThread-121583) Using the Keras model provided.
2019-05-23 18:49:57,754 WARNING (MainThread-121583) From /usr/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
2019-05-23 18:49:58,405 INFO (MainThread-121583) Using config: {‘_save_checkpoints_secs’: 600, ‘_session_config’: device_filters: “/job:ps”
device_filters: “/job:worker/task:0”
allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, ‘_keep_checkpoint_max’: 5, ‘_task_type’: u’worker’, ‘_train_distribute’: <tensorflow.contrib.distribute.python.parameter_server_strategy.ParameterServerStrategy object at 0x7ff44bbdf1d0>, ‘_is_chief’: False, ‘_cluster_spec’: <tensorflow.python.training.server_lib.ClusterSpec object at 0x7ff44bbdf050>, ‘_model_dir’: ‘/tmp/model-20190523184945’, ‘_protocol’: None, ‘_save_checkpoints_steps’: None, ‘_keep_checkpoint_every_n_hours’: 10000, ‘_service’: None, ‘_num_ps_replicas’: 1, ‘_tf_random_seed’: None, ‘_save_summary_steps’: 100, ‘_device_fn’: None, ‘_experimental_distribute’: None, ‘_num_worker_replicas’: 3, ‘_task_id’: 0, ‘_log_step_count_steps’: 100, ‘_evaluation_master’: ‘’, ‘_eval_distribute’: <tensorflow.contrib.distribute.python.parameter_server_strategy.ParameterServerStrategy object at 0x7ff44bbdf1d0>, ‘_global_id_in_cluster’: 1, ‘_master’: u’grpc://10.90.28.232:34493’, ‘_distribute_coordinator_mode’: None}
2019-05-23 18:49:58,413 WARNING (MainThread-121583) From /usr/lib/python2.7/site-packages/tensorflow/python/data/ops/dataset_ops.py:429: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, use
tf.py_function, which takes a python function which manipulates tf eager
tensors instead of numpy arrays. It’s easy to convert a tf eager tensor to
an ndarray (just call tensor.numpy()) but having access to eager tensors
means tf.py_function
s can use accelerators such as GPUs as well as
being differentiable using a gradient tape.
2019-05-23 18:49:58,545 INFO (Thread-1-121583) Calling model_fn. 2019-05-23 18:49:59,510 INFO (Thread-1-121583) Done calling model_fn. 2019-05-23 18:49:59,554 INFO (MainThread-121583) Warm-starting with WarmStartSettings: WarmStartSettings(ckpt_to_initialize_from=‘/tmp/model-20190523184945/keras/keras_model.ckpt’, vars_to_warm_start=‘.*’, var_name_to_vocab_info={}, var_name_to_prev_var_name={}) 2019-05-23 18:49:59,554 INFO (MainThread-121583) Warm-starting from: (‘/tmp/model-20190523184945/keras/keras_model.ckpt’,) 2019-05-23 18:49:59,555 INFO (MainThread-121583) Warm-starting variable: dense_2/kernel; prev_var_name: Unchanged 2019-05-23 18:49:59,555 INFO (MainThread-121583) Warm-starting variable: dense/bias; prev_var_name: Unchanged 2019-05-23 18:49:59,555 INFO (MainThread-121583) Warm-starting variable: dense_2/bias; prev_var_name: Unchanged 2019-05-23 18:49:59,555 INFO (MainThread-121583) Warm-starting variable: dense_1/kernel; prev_var_name: Unchanged 2019-05-23 18:49:59,555 INFO (MainThread-121583) Warm-starting variable: dense_1/bias; prev_var_name: Unchanged 2019-05-23 18:49:59,555 INFO (MainThread-121583) Warm-starting variable: dense/kernel; prev_var_name: Unchanged 2019-05-23 18:49:59,586 INFO (MainThread-121583) Create CheckpointSaverHook. 2019-05-23 18:49:59,856 INFO (MainThread-121583) Graph was finalized. 2019-05-23 18:49:59.897664: I tensorflow/core/distributed_runtime/master_session.cc:1192] Start master session c7382e2c43bf9a42 with config: device_filters: “/job:ps” device_filters: “/job:worker/task:0” allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } isolate_session_state: true 2019-05-23 18:50:00,001 INFO (MainThread-121583) Waiting for model to be ready. Ready_for_local_init_op: Variables not initialized: global_step, dense/kernel, dense/bias, dense_1/kernel, dense_1/bias, dense_2/kernel, dense_2/bias, training/TFOptimizer/beta1_power, training/TFOptimizer/beta2_power, dense/kernel/Adam, dense/kernel/Adam_1, dense/bias/Adam, dense/bias/Adam_1, dense_1/kernel/Adam, dense_1/kernel/Adam_1, dense_1/bias/Adam, dense_1/bias/Adam_1, dense_2/kernel/Adam, dense_2/kernel/Adam_1, dense_2/bias/Adam, dense_2/bias/Adam_1, ready: None 19/05/23 18:50:01 INFO CoarseGrainedExecutorBackend: Got assigned task 785 19/05/23 18:50:01 INFO Executor: Running task 5.0 in stage 73.0 (TID 785) 19/05/23 18:50:01 INFO BlockManager: Found block rdd_412_5 remotely 19/05/23 18:50:01 INFO Executor: Finished task 5.0 in stage 73.0 (TID 785). 2762 bytes result sent to driver 19/05/23 18:50:02 INFO CoarseGrainedExecutorBackend: Got assigned task 789 19/05/23 18:50:02 INFO Executor: Running task 2.0 in stage 74.0 (TID 789) 19/05/23 18:50:02 INFO MapOutputTrackerWorker: Updating epoch to 37 and clearing cache 19/05/23 18:50:02 INFO TorrentBroadcast: Started reading broadcast variable 111 19/05/23 18:50:02 INFO MemoryStore: Block broadcast_111_piece0 stored as bytes in memory (estimated size 10.5 KB, free 1969.2 MB) 19/05/23 18:50:02 INFO TorrentBroadcast: Reading broadcast variable 111 took 5 ms 19/05/23 18:50:02 INFO MemoryStore: Block broadcast_111 stored as values in memory (estimated size 23.4 KB, free 1969.2 MB) 19/05/23 18:50:02 INFO MapOutputTrackerWorker: Don’t have map outputs for shuffle 36, fetching them 19/05/23 18:50:02 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@srv-01-11-b09.iad1.trmr.io:33383) 19/05/23 18:50:02 INFO MapOutputTrackerWorker: Got the output locations 19/05/23 18:50:02 INFO ShuffleBlockFetcherIterator: Getting 12 non-empty blocks out of 12 blocks 19/05/23 18:50:02 INFO ShuffleBlockFetcherIterator: Started 2 remote fetches in 1 ms 2019-05-23 18:50:02,149 INFO (MainThread-118615) Connected to TFSparkNode.mgr on 10.90.28.232, executor=2, state=‘running’ 2019-05-23 18:50:02,160 INFO (MainThread-118615) mgr.state=‘running’ 2019-05-23 18:50:02,160 INFO (MainThread-118615) Feeding partition <itertools.chain object at 0x7ff49de182d0> into input queue <multiprocessing.queues.JoinableQueue object at 0x7ff48e5f22d0> 2019-05-23 18:50:30.041061: I tensorflow/core/distributed_runtime/master_session.cc:1192] Start master session 90084f8fef3f0cd9 with config: device_filters: “/job:ps” device_filters: “/job:worker/task:0” allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } isolate_session_state: true 2019-05-23 18:50:30,098 INFO (MainThread-121583) Waiting for model to be ready. Ready_for_local_init_op: Variables not initialized: global_step, dense/kernel, dense/bias, dense_1/kernel, dense_1/bias, dense_2/kernel, dense_2/bias, training/TFOptimizer/beta1_power, training/TFOptimizer/beta2_power, dense/kernel/Adam, dense/kernel/Adam_1, dense/bias/Adam, dense/bias/Adam_1, dense_1/kernel/Adam, dense_1/kernel/Adam_1, dense_1/bias/Adam, dense_1/bias/Adam_1, dense_2/kernel/Adam, dense_2/kernel/Adam_1, dense_2/bias/Adam, dense_2/bias/Adam_1, ready: None 2019-05-23 18:51:00.121986: I tensorflow/core/distributed_runtime/master_session.cc:1192] Start master session 55f754381ba9f6c4 with config: device_filters: “/job:ps” device_filters: “/job:worker/task:0” allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } isolate_session_state: true 2019-05-23 18:51:00,176 INFO (MainThread-121583) Waiting for model to be ready. Ready_for_local_init_op: Variables not initialized: global_step, dense/kernel, dense/bias, dense_1/kernel, dense_1/bias, dense_2/kernel, dense_2/bias, training/TFOptimizer/beta1_power, training/TFOptimizer/beta2_power, dense/kernel/Adam, dense/kernel/Adam_1, dense/bias/Adam, dense/bias/Adam_1, dense_1/kernel/Adam, dense_1/kernel/Adam_1, dense_1/bias/Adam, dense_1/bias/Adam_1, dense_2/kernel/Adam, dense_2/kernel/Adam_1, dense_2/bias/Adam, dense_2/bias/Adam_1, ready: None 2019-05-23 18:51:30.194164: I tensorflow/core/distributed_runtime/master_session.cc:1192] Start master session 920fab950aa69a50 with config: device_filters: “/job:ps” device_filters: “/job:worker/task:0” allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } isolate_session_state: true 2019-05-23 18:51:30,244 INFO (MainThread-121583) Waiting for model to be ready. Ready_for_local_init_op: Variables not initialized: global_step, dense/kernel, dense/bias, dense_1/kernel, dense_1/bias, dense_2/kernel, dense_2/bias, training/TFOptimizer/beta1_power, training/TFOptimizer/beta2_power, dense/kernel/Adam, dense/kernel/Adam_1, dense/bias/Adam, dense/bias/Adam_1, dense_1/kernel/Adam, dense_1/kernel/Adam_1, dense_1/bias/Adam, dense_1/bias/Adam_1, dense_2/kernel/Adam, dense_2/kernel/Adam_1, dense_2/bias/Adam, dense_2/bias/Adam_1, ready: None
PS Logs: `019-05-23 18:49:55,059 INFO (MainThread-39243) 0: ======== ps:0 ======== 2019-05-23 18:49:55,060 INFO (MainThread-39243) 0: Cluster spec: {‘worker’: [‘10.90.28.232:34493’, ‘10.90.28.252:38762’], ‘ps’: [‘10.90.28.222:45450’], ‘master’: [‘10.90.28.230:42065’]} 2019-05-23 18:49:55,060 INFO (MainThread-39243) 0: Using CPU 2019-05-23 18:49:55.163831: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2200225000 Hz 2019-05-23 18:49:55.167500: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x7529890 executing computations on platform Host. Devices: 2019-05-23 18:49:55.167542: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined> 2019-05-23 18:49:55.185042: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:252] Initialize GrpcChannelCache for job master -> {0 -> 10.90.28.230:42065} 2019-05-23 18:49:55.185069: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:252] Initialize GrpcChannelCache for job ps -> {0 -> localhost:45450} 2019-05-23 18:49:55.185084: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:252] Initialize GrpcChannelCache for job worker -> {0 -> 10.90.28.232:34493, 1 -> 10.90.28.252:38762} 2019-05-23 18:49:55.193529: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:391] Started server with target: grpc://localhost:45450
Spark Submit Command Line:
--master yarn --queue default --conf spark.sql.autoBrodcastJoinThreshold=-1 --conf spark.yarn.executor.memoryOverhead=1g --conf spark.storage.memoryFraction=0.2 --conf spark.executor.memory =4g --conf spark.driver.memory=2g --conf spark.executor.instances=4 --conf spark.executor.cores=4 --conf spark.dynamicAllocation.enabled=false --conf spark.yarn.maxAppAttempts=1 --conf spark.task.cpus=4
Issue Analytics
- State:
- Created 4 years ago
- Reactions:2
- Comments:12 (6 by maintainers)
Top GitHub Comments
Thanks for posting. I am facing the same issue.
@markromedia The model checkpoint file looks suspicious:
/tmp/model-20190523184945/keras/keras_model.ckpt
. This needs to be a path on a distributed filesystem, e.g. HDFS. Also, you will need to setup theLD_LIBRARY_PATH
similar to this example to include the paths to thelibhdfs.so
andlibjvm.so
.