question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ParameterServerStrategy stuck in "Feeding partition"

See original GitHub issue

Environment:

  • Python version [2.7]
  • Spark version [2.3.2]
  • TensorFlow version [1.13.1]
  • TensorFlowOnSpark version [1.4.3]
  • Cluster version [Hadoop 3.x]

Describe the bug: Using the examples provided I have been attempting to port one of our existing spark-ml jobs to Tfos. These jobs all run in a dedicated Yarn cluster(cpu-only).

As part of my POC, I am attempting to create a 2-worker, 1-ps, 1-master Tensor-cluster which trains a simple Keras model(converted to tf estimator), using a distributed strategy of ParameterServerStrategy.

When I start this up, the cluster “establishes” itself, but I see the master gets stuck trying to feed the queue. I have run the same example in non-distributed mode, and it worked fine(multiple instances doing the same thing).

Thanks in advance for any help.

Here is the relevant snippet of code running

` model = Sequential() model.add(Dense(64, input_dim=num_features, activation=‘sigmoid’)) model.add(Dropout(0.2)) model.add(Dense(64, activation=‘sigmoid’)) model.add(Dropout(0.2)) model.add(Dense(1, activation=‘sigmoid’)) model.compile(loss=‘binary_crossentropy’, optimizer= tf.train.AdamOptimizer(), metrics=[‘accuracy’]) model.summary()

    distribution_strategy = tf.contrib.distribute.ParameterServerStrategy()
    config = tf.estimator.RunConfig(
        train_distribute=distribution_strategy, eval_distribute=distribution_strategy)
    estimator = tf.keras.estimator.model_to_estimator(model, model_dir=model_dir, config=config)

	def generate_rdd_data(tf_feed):
	    while not tf_feed.should_stop():
	        batch = tf_feed.next_batch(1)
	        if len(batch) > 0:
	            record = batch[0]
	            features = numpy.array(record[0]).astype(numpy.float32)
	            label = numpy.array([record[1]]).astype(numpy.float32)

	            yield (features, label)
	        else:
	            return

    def train_input_fn():
        ds = tf.data.Dataset.from_generator(generator,
                                            (tf.float32, tf.float32),
                                            (tf.TensorShape([num_features]), tf.TensorShape([1])))
        ds = ds.batch(args.batch_size)
        return ds

    # add a hook to terminate the RDD data feed when the session ends
    hooks = [StopFeedHook(tf_feed)]

    # train model
    estimator.train(input_fn=train_input_fn, max_steps=steps_per_epoch, `hooks=hooks)`

Logs: Master Logs: 2019-05-23 18:49:54,434 INFO (MainThread-53285) 1: ======== master:0 ======== 2019-05-23 18:49:54,434 INFO (MainThread-53285) 1: Cluster spec: {'worker': ['10.90.28.232:34493', '10.90.28.252:38762'], 'ps': ['10.90.28.222:45450'], 'master': ['10.90.28.230:42065']} 2019-05-23 18:49:54,435 INFO (MainThread-53285) 1: Using CPU 19/05/23 18:49:54 INFO TorrentBroadcast: Started reading broadcast variable 110 19/05/23 18:49:54 INFO MemoryStore: Block broadcast_110_piece0 stored as bytes in memory (estimated size 658.9 KB, free 1970.6 MB) 19/05/23 18:49:54 INFO TorrentBroadcast: Reading broadcast variable 110 took 28 ms 19/05/23 18:49:54 INFO MemoryStore: Block broadcast_110 stored as values in memory (estimated size 1434.9 KB, free 1969.2 MB) 2019-05-23 18:49:54.479917: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300070000 Hz 2019-05-23 18:49:54.481187: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x44295a0 executing computations on platform Host. Devices: 2019-05-23 18:49:54.481225: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined> 2019-05-23 18:49:54.483937: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:252] Initialize GrpcChannelCache for job master -> {0 -> localhost:42065} 2019-05-23 18:49:54.483963: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:252] Initialize GrpcChannelCache for job ps -> {0 -> 10.90.28.222:45450} 2019-05-23 18:49:54.483990: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:252] Initialize GrpcChannelCache for job worker -> {0 -> 10.90.28.232:34493, 1 -> 10.90.28.252:38762} 2019-05-23 18:49:54.486650: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:391] Started server with target: grpc://localhost:42065 19/05/23 18:49:54 INFO BlockManager: Found block rdd_412_3 locally 19/05/23 18:49:54 INFO CodeGenerator: Code generated in 21.755654 ms 19/05/23 18:49:54 INFO Executor: Finished task 3.0 in stage 73.0 (TID 777). 2805 bytes result sent to driver 19/05/23 18:49:54 INFO CoarseGrainedExecutorBackend: Got assigned task 780 19/05/23 18:49:54 INFO Executor: Running task 6.0 in stage 73.0 (TID 780) 19/05/23 18:49:54 INFO BlockManager: Found block rdd_412_6 locally 19/05/23 18:49:55 INFO Executor: Finished task 6.0 in stage 73.0 (TID 780). 2762 bytes result sent to driver 19/05/23 18:49:55 INFO CoarseGrainedExecutorBackend: Got assigned task 782 19/05/23 18:49:55 INFO Executor: Running task 10.0 in stage 73.0 (TID 782) 19/05/23 18:49:55 INFO BlockManager: Found block rdd_412_10 locally 19/05/23 18:49:55 INFO Executor: Finished task 10.0 in stage 73.0 (TID 782). 2762 bytes result sent to driver 19/05/23 18:50:01 INFO CoarseGrainedExecutorBackend: Got assigned task 784 19/05/23 18:50:01 INFO Executor: Running task 1.0 in stage 73.0 (TID 784) 19/05/23 18:50:01 INFO BlockManager: Found block rdd_412_1 remotely 19/05/23 18:50:01 INFO Executor: Finished task 1.0 in stage 73.0 (TID 784). 2762 bytes result sent to driver 19/05/23 18:50:02 INFO CoarseGrainedExecutorBackend: Got assigned task 787 19/05/23 18:50:02 INFO Executor: Running task 0.0 in stage 74.0 (TID 787) 19/05/23 18:50:02 INFO MapOutputTrackerWorker: Updating epoch to 37 and clearing cache 19/05/23 18:50:02 INFO TorrentBroadcast: Started reading broadcast variable 111 19/05/23 18:50:02 INFO MemoryStore: Block broadcast_111_piece0 stored as bytes in memory (estimated size 10.5 KB, free 1969.2 MB) 19/05/23 18:50:02 INFO TorrentBroadcast: Reading broadcast variable 111 took 5 ms 19/05/23 18:50:02 INFO MemoryStore: Block broadcast_111 stored as values in memory (estimated size 23.4 KB, free 1969.2 MB) 19/05/23 18:50:02 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 36, fetching them 19/05/23 18:50:02 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@srv-01-11-b09.iad1.trmr.io:33383) 19/05/23 18:50:02 INFO MapOutputTrackerWorker: Got the output locations 19/05/23 18:50:02 INFO ShuffleBlockFetcherIterator: Getting 12 non-empty blocks out of 12 blocks 19/05/23 18:50:02 INFO ShuffleBlockFetcherIterator: Started 2 remote fetches in 1 ms 2019-05-23 18:50:02,187 INFO (MainThread-52742) Connected to TFSparkNode.mgr on 10.90.28.230, executor=1, state='running' 2019-05-23 18:50:02,194 INFO (MainThread-52742) mgr.state='running' 2019-05-23 18:50:02,194 INFO (MainThread-52742) Feeding partition <itertools.chain object at 0x7f97ea46c2d0> into input queue <multiprocessing.queues.JoinableQueue object at 0x7f97dae9d2d0> 19/05/23 19:00:03 ERROR Executor: Exception in task 0.0 in stage 74.0 (TID 787) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/hadoop/yarn/local/usercache/pipeline/appcache/application_1557769783296_1143/container_e21_1557769783296_1143_01_000005/pyspark.zip/pyspark/worker.py", line 253, in main process() File "/hadoop/yarn/local/usercache/pipeline/appcache/application_1557769783296_1143/container_e21_1557769783296_1143_01_000005/pyspark.zip/pyspark/worker.py", line 248, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/rdd.py", line 2440, in pipeline_func File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/rdd.py", line 2440, in pipeline_func File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/rdd.py", line 2440, in pipeline_func File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/rdd.py", line 350, in func File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/rdd.py", line 799, in func File "/usr/lib/python2.7/site-packages/tensorflowonspark/TFSparkNode.py", line 420, in _train raise Exception("Timeout while feeding partition") Exception: Timeout while feeding partition

Worker logs: 019-05-23 18:49:53,977 INFO (MainThread-118610) Starting TensorFlow worker:0 as worker on cluster node 2 on background process 19/05/23 18:49:53 INFO PythonRunner: Times: total = 7808, boot = -36392, init = 43165, finish = 1035 19/05/23 18:49:53 INFO Executor: Finished task 2.0 in stage 72.0 (TID 773). 1418 bytes result sent to driver 2019-05-23 18:49:53,985 INFO (MainThread-121583) 2: ======== worker:0 ======== 2019-05-23 18:49:53,986 INFO (MainThread-121583) 2: Cluster spec: {‘worker’: [‘10.90.28.232:34493’, ‘10.90.28.252:38762’], ‘ps’: [‘10.90.28.222:45450’], ‘master’: [‘10.90.28.230:42065’]} 2019-05-23 18:49:53,986 INFO (MainThread-121583) 2: Using CPU 19/05/23 18:49:53 INFO CoarseGrainedExecutorBackend: Got assigned task 775 19/05/23 18:49:53 INFO Executor: Running task 0.0 in stage 73.0 (TID 775) 19/05/23 18:49:53 INFO TorrentBroadcast: Started reading broadcast variable 110 19/05/23 18:49:54 INFO MemoryStore: Block broadcast_110_piece0 stored as bytes in memory (estimated size 658.9 KB, free 1970.6 MB) 19/05/23 18:49:54 INFO TorrentBroadcast: Reading broadcast variable 110 took 50 ms 19/05/23 18:49:54 INFO MemoryStore: Block broadcast_110 stored as values in memory (estimated size 1434.9 KB, free 1969.2 MB) 19/05/23 18:49:54 INFO BlockManager: Found block rdd_412_0 locally 2019-05-23 18:49:54.084395: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2599985000 Hz 2019-05-23 18:49:54.085463: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x4554d90 executing computations on platform Host. Devices: 2019-05-23 18:49:54.085503: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined> 19/05/23 18:49:54 INFO CodeGenerator: Code generated in 24.400731 ms 2019-05-23 18:49:54.095435: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:252] Initialize GrpcChannelCache for job master -> {0 -> 10.90.28.230:42065} 2019-05-23 18:49:54.095469: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:252] Initialize GrpcChannelCache for job ps -> {0 -> 10.90.28.222:45450} 2019-05-23 18:49:54.095481: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:252] Initialize GrpcChannelCache for job worker -> {0 -> localhost:34493, 1 -> 10.90.28.252:38762} 2019-05-23 18:49:54.097460: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:391] Started server with target: grpc://localhost:34493 2019-05-23 18:49:54,175 WARNING (MainThread-121583) From /usr/lib/python2.7/site-packages/tensorflow/python/ops/resource_variable_ops.py:435: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. 2019-05-23 18:49:54,238 WARNING (MainThread-121583) From /usr/lib/python2.7/site-packages/tensorflow/python/keras/layers/core.py:143: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version. Instructions for updating: Please use rate instead of keep_prob. Rate should be set to rate = 1 - keep_prob. 19/05/23 18:49:54 INFO Executor: Finished task 0.0 in stage 73.0 (TID 775). 2805 bytes result sent to driver 19/05/23 18:49:54 INFO CoarseGrainedExecutorBackend: Got assigned task 776 19/05/23 18:49:54 INFO Executor: Running task 4.0 in stage 73.0 (TID 776) 19/05/23 18:49:54 INFO BlockManager: Found block rdd_412_4 locally 19/05/23 18:49:54 INFO Executor: Finished task 4.0 in stage 73.0 (TID 776). 2762 bytes result sent to driver 19/05/23 18:49:54 INFO CoarseGrainedExecutorBackend: Got assigned task 778 19/05/23 18:49:54 INFO Executor: Running task 8.0 in stage 73.0 (TID 778) 19/05/23 18:49:54 INFO BlockManager: Found block rdd_412_8 locally 19/05/23 18:49:54 INFO Executor: Finished task 8.0 in stage 73.0 (TID 778). 2805 bytes result sent to driver


Total params: 15,425 Trainable params: 15,425 Non-trainable params: 0


num_features: 174 num_records: 240000 batch_size: 1953 epochs: 3 steps_per_epoch: 128

WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see:

2019-05-23 18:49:57,229 INFO (MainThread-121583) ParameterServerStrategy with compute_devices = (‘/replica:0/task:0/device:CPU:0’,), variable_device = ‘/device:CPU:0’ 2019-05-23 18:49:57,229 INFO (MainThread-121583) TF_CONFIG environment variable: {u’environment’: u’cloud’, u’cluster’: {u’ps’: [u’10.90.28.222:45450’], u’worker’: [u’10.90.28.232:34493’, u’10.90.28.252:38762’], u’master’: [u’10.90.28.230:42065’]}, u’task’: {u’index’: 0, u’type’: u’worker’}} 2019-05-23 18:49:57,229 INFO (MainThread-121583) Initializing RunConfig with distribution strategies. 2019-05-23 18:49:57,230 INFO (MainThread-121583) Not using Distribute Coordinator. 2019-05-23 18:49:57,230 INFO (MainThread-121583) Using the Keras model provided. 2019-05-23 18:49:57,754 WARNING (MainThread-121583) From /usr/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. 2019-05-23 18:49:58,405 INFO (MainThread-121583) Using config: {‘_save_checkpoints_secs’: 600, ‘_session_config’: device_filters: “/job:ps” device_filters: “/job:worker/task:0” allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } , ‘_keep_checkpoint_max’: 5, ‘_task_type’: u’worker’, ‘_train_distribute’: <tensorflow.contrib.distribute.python.parameter_server_strategy.ParameterServerStrategy object at 0x7ff44bbdf1d0>, ‘_is_chief’: False, ‘_cluster_spec’: <tensorflow.python.training.server_lib.ClusterSpec object at 0x7ff44bbdf050>, ‘_model_dir’: ‘/tmp/model-20190523184945’, ‘_protocol’: None, ‘_save_checkpoints_steps’: None, ‘_keep_checkpoint_every_n_hours’: 10000, ‘_service’: None, ‘_num_ps_replicas’: 1, ‘_tf_random_seed’: None, ‘_save_summary_steps’: 100, ‘_device_fn’: None, ‘_experimental_distribute’: None, ‘_num_worker_replicas’: 3, ‘_task_id’: 0, ‘_log_step_count_steps’: 100, ‘_evaluation_master’: ‘’, ‘_eval_distribute’: <tensorflow.contrib.distribute.python.parameter_server_strategy.ParameterServerStrategy object at 0x7ff44bbdf1d0>, ‘_global_id_in_cluster’: 1, ‘_master’: u’grpc://10.90.28.232:34493’, ‘_distribute_coordinator_mode’: None} 2019-05-23 18:49:58,413 WARNING (MainThread-121583) From /usr/lib/python2.7/site-packages/tensorflow/python/data/ops/dataset_ops.py:429: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version. Instructions for updating: tf.py_func is deprecated in TF V2. Instead, use tf.py_function, which takes a python function which manipulates tf eager tensors instead of numpy arrays. It’s easy to convert a tf eager tensor to an ndarray (just call tensor.numpy()) but having access to eager tensors means tf.py_functions can use accelerators such as GPUs as well as being differentiable using a gradient tape.

2019-05-23 18:49:58,545 INFO (Thread-1-121583) Calling model_fn. 2019-05-23 18:49:59,510 INFO (Thread-1-121583) Done calling model_fn. 2019-05-23 18:49:59,554 INFO (MainThread-121583) Warm-starting with WarmStartSettings: WarmStartSettings(ckpt_to_initialize_from=‘/tmp/model-20190523184945/keras/keras_model.ckpt’, vars_to_warm_start=‘.*’, var_name_to_vocab_info={}, var_name_to_prev_var_name={}) 2019-05-23 18:49:59,554 INFO (MainThread-121583) Warm-starting from: (‘/tmp/model-20190523184945/keras/keras_model.ckpt’,) 2019-05-23 18:49:59,555 INFO (MainThread-121583) Warm-starting variable: dense_2/kernel; prev_var_name: Unchanged 2019-05-23 18:49:59,555 INFO (MainThread-121583) Warm-starting variable: dense/bias; prev_var_name: Unchanged 2019-05-23 18:49:59,555 INFO (MainThread-121583) Warm-starting variable: dense_2/bias; prev_var_name: Unchanged 2019-05-23 18:49:59,555 INFO (MainThread-121583) Warm-starting variable: dense_1/kernel; prev_var_name: Unchanged 2019-05-23 18:49:59,555 INFO (MainThread-121583) Warm-starting variable: dense_1/bias; prev_var_name: Unchanged 2019-05-23 18:49:59,555 INFO (MainThread-121583) Warm-starting variable: dense/kernel; prev_var_name: Unchanged 2019-05-23 18:49:59,586 INFO (MainThread-121583) Create CheckpointSaverHook. 2019-05-23 18:49:59,856 INFO (MainThread-121583) Graph was finalized. 2019-05-23 18:49:59.897664: I tensorflow/core/distributed_runtime/master_session.cc:1192] Start master session c7382e2c43bf9a42 with config: device_filters: “/job:ps” device_filters: “/job:worker/task:0” allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } isolate_session_state: true 2019-05-23 18:50:00,001 INFO (MainThread-121583) Waiting for model to be ready. Ready_for_local_init_op: Variables not initialized: global_step, dense/kernel, dense/bias, dense_1/kernel, dense_1/bias, dense_2/kernel, dense_2/bias, training/TFOptimizer/beta1_power, training/TFOptimizer/beta2_power, dense/kernel/Adam, dense/kernel/Adam_1, dense/bias/Adam, dense/bias/Adam_1, dense_1/kernel/Adam, dense_1/kernel/Adam_1, dense_1/bias/Adam, dense_1/bias/Adam_1, dense_2/kernel/Adam, dense_2/kernel/Adam_1, dense_2/bias/Adam, dense_2/bias/Adam_1, ready: None 19/05/23 18:50:01 INFO CoarseGrainedExecutorBackend: Got assigned task 785 19/05/23 18:50:01 INFO Executor: Running task 5.0 in stage 73.0 (TID 785) 19/05/23 18:50:01 INFO BlockManager: Found block rdd_412_5 remotely 19/05/23 18:50:01 INFO Executor: Finished task 5.0 in stage 73.0 (TID 785). 2762 bytes result sent to driver 19/05/23 18:50:02 INFO CoarseGrainedExecutorBackend: Got assigned task 789 19/05/23 18:50:02 INFO Executor: Running task 2.0 in stage 74.0 (TID 789) 19/05/23 18:50:02 INFO MapOutputTrackerWorker: Updating epoch to 37 and clearing cache 19/05/23 18:50:02 INFO TorrentBroadcast: Started reading broadcast variable 111 19/05/23 18:50:02 INFO MemoryStore: Block broadcast_111_piece0 stored as bytes in memory (estimated size 10.5 KB, free 1969.2 MB) 19/05/23 18:50:02 INFO TorrentBroadcast: Reading broadcast variable 111 took 5 ms 19/05/23 18:50:02 INFO MemoryStore: Block broadcast_111 stored as values in memory (estimated size 23.4 KB, free 1969.2 MB) 19/05/23 18:50:02 INFO MapOutputTrackerWorker: Don’t have map outputs for shuffle 36, fetching them 19/05/23 18:50:02 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@srv-01-11-b09.iad1.trmr.io:33383) 19/05/23 18:50:02 INFO MapOutputTrackerWorker: Got the output locations 19/05/23 18:50:02 INFO ShuffleBlockFetcherIterator: Getting 12 non-empty blocks out of 12 blocks 19/05/23 18:50:02 INFO ShuffleBlockFetcherIterator: Started 2 remote fetches in 1 ms 2019-05-23 18:50:02,149 INFO (MainThread-118615) Connected to TFSparkNode.mgr on 10.90.28.232, executor=2, state=‘running’ 2019-05-23 18:50:02,160 INFO (MainThread-118615) mgr.state=‘running’ 2019-05-23 18:50:02,160 INFO (MainThread-118615) Feeding partition <itertools.chain object at 0x7ff49de182d0> into input queue <multiprocessing.queues.JoinableQueue object at 0x7ff48e5f22d0> 2019-05-23 18:50:30.041061: I tensorflow/core/distributed_runtime/master_session.cc:1192] Start master session 90084f8fef3f0cd9 with config: device_filters: “/job:ps” device_filters: “/job:worker/task:0” allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } isolate_session_state: true 2019-05-23 18:50:30,098 INFO (MainThread-121583) Waiting for model to be ready. Ready_for_local_init_op: Variables not initialized: global_step, dense/kernel, dense/bias, dense_1/kernel, dense_1/bias, dense_2/kernel, dense_2/bias, training/TFOptimizer/beta1_power, training/TFOptimizer/beta2_power, dense/kernel/Adam, dense/kernel/Adam_1, dense/bias/Adam, dense/bias/Adam_1, dense_1/kernel/Adam, dense_1/kernel/Adam_1, dense_1/bias/Adam, dense_1/bias/Adam_1, dense_2/kernel/Adam, dense_2/kernel/Adam_1, dense_2/bias/Adam, dense_2/bias/Adam_1, ready: None 2019-05-23 18:51:00.121986: I tensorflow/core/distributed_runtime/master_session.cc:1192] Start master session 55f754381ba9f6c4 with config: device_filters: “/job:ps” device_filters: “/job:worker/task:0” allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } isolate_session_state: true 2019-05-23 18:51:00,176 INFO (MainThread-121583) Waiting for model to be ready. Ready_for_local_init_op: Variables not initialized: global_step, dense/kernel, dense/bias, dense_1/kernel, dense_1/bias, dense_2/kernel, dense_2/bias, training/TFOptimizer/beta1_power, training/TFOptimizer/beta2_power, dense/kernel/Adam, dense/kernel/Adam_1, dense/bias/Adam, dense/bias/Adam_1, dense_1/kernel/Adam, dense_1/kernel/Adam_1, dense_1/bias/Adam, dense_1/bias/Adam_1, dense_2/kernel/Adam, dense_2/kernel/Adam_1, dense_2/bias/Adam, dense_2/bias/Adam_1, ready: None 2019-05-23 18:51:30.194164: I tensorflow/core/distributed_runtime/master_session.cc:1192] Start master session 920fab950aa69a50 with config: device_filters: “/job:ps” device_filters: “/job:worker/task:0” allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } isolate_session_state: true 2019-05-23 18:51:30,244 INFO (MainThread-121583) Waiting for model to be ready. Ready_for_local_init_op: Variables not initialized: global_step, dense/kernel, dense/bias, dense_1/kernel, dense_1/bias, dense_2/kernel, dense_2/bias, training/TFOptimizer/beta1_power, training/TFOptimizer/beta2_power, dense/kernel/Adam, dense/kernel/Adam_1, dense/bias/Adam, dense/bias/Adam_1, dense_1/kernel/Adam, dense_1/kernel/Adam_1, dense_1/bias/Adam, dense_1/bias/Adam_1, dense_2/kernel/Adam, dense_2/kernel/Adam_1, dense_2/bias/Adam, dense_2/bias/Adam_1, ready: None

PS Logs: `019-05-23 18:49:55,059 INFO (MainThread-39243) 0: ======== ps:0 ======== 2019-05-23 18:49:55,060 INFO (MainThread-39243) 0: Cluster spec: {‘worker’: [‘10.90.28.232:34493’, ‘10.90.28.252:38762’], ‘ps’: [‘10.90.28.222:45450’], ‘master’: [‘10.90.28.230:42065’]} 2019-05-23 18:49:55,060 INFO (MainThread-39243) 0: Using CPU 2019-05-23 18:49:55.163831: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2200225000 Hz 2019-05-23 18:49:55.167500: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x7529890 executing computations on platform Host. Devices: 2019-05-23 18:49:55.167542: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined> 2019-05-23 18:49:55.185042: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:252] Initialize GrpcChannelCache for job master -> {0 -> 10.90.28.230:42065} 2019-05-23 18:49:55.185069: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:252] Initialize GrpcChannelCache for job ps -> {0 -> localhost:45450} 2019-05-23 18:49:55.185084: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:252] Initialize GrpcChannelCache for job worker -> {0 -> 10.90.28.232:34493, 1 -> 10.90.28.252:38762} 2019-05-23 18:49:55.193529: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:391] Started server with target: grpc://localhost:45450

Spark Submit Command Line: --master yarn --queue default --conf spark.sql.autoBrodcastJoinThreshold=-1 --conf spark.yarn.executor.memoryOverhead=1g --conf spark.storage.memoryFraction=0.2 --conf spark.executor.memory =4g --conf spark.driver.memory=2g --conf spark.executor.instances=4 --conf spark.executor.cores=4 --conf spark.dynamicAllocation.enabled=false --conf spark.yarn.maxAppAttempts=1 --conf spark.task.cpus=4

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:2
  • Comments:12 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
hdarabicommented, May 23, 2019

Thanks for posting. I am facing the same issue.

1reaction
leewyangcommented, May 23, 2019

@markromedia The model checkpoint file looks suspicious: /tmp/model-20190523184945/keras/keras_model.ckpt. This needs to be a path on a distributed filesystem, e.g. HDFS. Also, you will need to setup the LD_LIBRARY_PATH similar to this example to include the paths to the libhdfs.so and libjvm.so.

Read more comments on GitHub >

github_iconTop Results From Across the Web

tf.distribute.experimental.ParameterServerStrategy - TensorFlow
Prepare a strategy to use with the cluster and variable partitioning info. strategy = tf.distribute.experimental.ParameterServerStrategy(
Read more >
At distributed Tensorflow using ParameterServerStrategy ...
I'm trying making distributed Tensorflow using ParameterServerStrategy. It is stuck after printing this. I don't know why.
Read more >
Diff - refs/tags/v1.12.1^2..refs/tags/v1.12.1 - Google Git
+ Scope s = Scope::NewRootScope(); + auto feed = ops::Placeholder(s. ... we + // may be stuck at a cross-replica barrier on-device.
Read more >
Machine Learning Parallelism Could Be Adaptive ...
4.6 Parameter cache and local data partitioned across CPU and GPU ... the transformer encoder, the transformer decoder, and feed-forward.
Read more >
Covering Cloud News, Trends & Innovations - Cloudzone Blog
In between sessions, we caught up with our hardworking team to ask them about their ... training strategies such as the TensorFlow ParameterServerStrategy, ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found