Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Job failing on CDH 5.8.2 - Executor Heartbeat timing out

See original GitHub issue

I’m running the below command: spark-submit --master yarn --deploy-mode client --queue cpu --num-executors 2 --executor-memory 4G --py-files TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py --conf spark.dynamicAllocation.enabled=false --conf spark.yarn.maxAppAttempts=1 --conf spark.executor.heartbeatInterval=1200s --conf spark.executorEnv.LD_LIBRARY_PATH="/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib:/usr/lib/jvm/java-7-oracle-cloudera/jre/lib/amd64/server" TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py --images hdfs:///user/mayub/mnist/csv/train/images --labels hdfs:///user/mayub/mnist/csv/train/labels --mode train --model hdfs:///user/mayub//mnist/mnist_model2

After waiting for a while the job fails with the below error; “Removing executor 2 with no recent heartbeats: 172657 ms exceeds timeout 120000 ms”

I have tried running this in couple of ways:

Option 1: Client mode Variables - $LIB_HDFS = /opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/ $LIB_JVM= /usr/lib/jvm/java-7-oracle-cloudera/jre/lib/amd64/server

spark-submit --master yarn --deploy-mode client --queue cpu --num-executors 2 --executor-memory 4G --py-files TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py --conf spark.dynamicAllocation.enabled=false --conf spark.yarn.maxAppAttempts=1 --conf spark.executor.heartbeatInterval=1200s --conf spark.executorEnv.LD_LIBRARY_PATH="/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib:/usr/lib/jvm/java-7-oracle-cloudera/jre/lib/amd64/server" TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py --images hdfs:///user/mayub/mnist/csv/train/images --labels hdfs:///user/mayub/mnist/csv/train/labels --mode train --model hdfs:///user/mayub//mnist/mnist_model2

Error log File: “Executor Heartbeat timing out” spark_client_mode.txt

Option 2: cluster mode Variables - $LIB_HDFS = /opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/ $LIB_JVM= /usr/lib/jvm/java-7-oracle-cloudera/jre/lib/amd64/server

Command: spark-submit --master yarn --deploy-mode cluster --queue cpu --num-executors 1 --executor-memory 2G --py-files TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py --conf spark.dynamicAllocation.enabled=false --conf spark.yarn.maxAppAttempts=1 --conf spark.executorEnv.LD_LIBRARY_PATH=$LIB_JVM:$LIB_HDFS TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py --images hdfs:///user/mayub/mnist/csv/train/images --labels hdfs:///user/mayub/mnist/csv/train/labels --mode train --model hdfs:///user/mayub//mnist/mnist_model2

Error Log File: “Job just hangs and then fails after running for a while while”

spark_hanging_job.txt

Option 3: with additional Cloudera configs

Variables - $LIB_HDFS = /opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/ $LIB_JVM= /usr/lib/jvm/java-7-oracle-cloudera/jre/lib/amd64/server

Command: spark-submit --master yarn --deploy-mode client --queue cpu --num-executors 2 --executor-memory 4G --py-files TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py --conf spark.dynamicAllocation.enabled=false --conf spark.yarn.maxAppAttempts=1 --conf spark.executor.heartbeatInterval=1200s --conf spark.executorEnv.LD_LIBRARY_PATH="/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib:/usr/lib/jvm/java-7-oracle-cloudera/jre/lib/amd64/server" TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py --images hdfs:///user/mayub/mnist/csv/train/images --labels hdfs:///user/mayub/mnist/csv/train/labels --mode train --model hdfs:///user/mayub//mnist/mnist_model2 Error- same error as option 1.

Just to validate that if other spark-submit jobs are running fine. I ran a spark word count example and it ran fine.

Appreciate any help.

Issue Analytics

State:
Created 5 years ago
Comments:16 (8 by maintainers)

Top GitHub Comments

1reaction

mohammedayub44commented, Jun 27, 2018

Just to summarize the issue and resolution for future reference. I removed the following configurations from query as --num-executors 4 implicitly sets dynamic allocation to false. --conf spark.dynamicAllocation.enabled=false --conf spark.dynamicAllocation.maxExecutors=4 --conf spark.dynamicAllocation.minExecutors=4

Also, removed absolute NN pathname from train, test, model and output HDFS directories as relative path names resolve successfully.

I had to tune the following parameters according to the setting in my Cloudera cluster (please don’t blindly use the ones from the example) --executor-memory 8G --driver-memory 4G --conf spark.yarn.executor.memoryOverhead=1600 --conf spark.yarn.driver.memoryOverhead=720

Add permissions to model output folder. $hdfs dfs -chmod 777 /user/mayub/mnist/mnist_model

Here is the final query that worked in both ‘client’ and ‘cluster’ mode.

Training: spark-submit --master yarn --deploy-mode cluster --queue cpu --num-executors 4 --executor-memory 8G --driver-memory 4G --py-files TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py --conf spark.yarn.maxAppAttempts=1 --conf spark.yarn.executor.memoryOverhead=1600 --conf spark.yarn.driver.memoryOverhead=720 --conf spark.executorEnv.LD_LIBRARY_PATH="/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib:/usr/lib/jvm/java-7-oracle-cloudera/jre/lib/amd64/server" TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py --images /user/mayub/mnist/csv/test/images --labels /user/mayub/mnist/csv/test/labels --mode train --model /user/mayub/mnist/mnist_model

Inference spark-submit --master yarn --deploy-mode cluster --queue cpu --num-executors 4 --executor-memory 8G --driver-memory 4G --py-files TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py --conf spark.yarn.executor.memoryOverhead=1600 --conf spark.yarn.driver.memoryOverhead=720 --conf spark.executorEnv.LD_LIBRARY_PATH="/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib:/usr/lib/jvm/java-7-oracle-cloudera/jre/lib/amd64/server" TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py --images /user/mayub/mnist/csv/test/images --labels /user/mayub/mnist/csv/test/labels --mode inference --model /user/mayub/mnist/mnist_model --output /user/mayub/mnist/predictions

@leewyang Thanks for your help. I’ll go ahead and close this, but would like your thoughts on my earlier comment on ‘yarn’ user.

0reactions

kwontaeheoncommented, Dec 28, 2018

$hdfs dfs -chmod 777 /user/mayub/mnist/mnist_model

That did the trick. Thanks.

Top Results From Across the Web

Executor heartbeat timed out - Databricks Community

"SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 ... one of the running tasks) Reason: Executor heartbeat timed out...

Known Issues and Workarounds in Cloudera Manager 5 | 5.x

Parcel distribution to hosts from Cloudera Manager server is known to fail and timeout on Ubuntu 14.04 with a Cloudera Express license.

Spark cluster full of heartbeat timeouts, executors exiting on ...

In my spark-defaults.conf I set the spark.network.timeout to a higher value. Heartbeat interval was somewhat irrelevant to the problem (though tuning is ...

IBM Big Replicate 2.1.0 User Guide

Avoid out of memory under failed socket connection scenario - DCO-683 ... a cluster-wide basis, each can be overridden at the time of...

Spark failure detection - heartbeats on waitingforcode.com

An executor is considered as dead if, at the time of checking, its last heartbeat message is older than the timeout value specified...