question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Job failing on CDH 5.8.2 - Executor Heartbeat timing out

See original GitHub issue

I’m running the below command: spark-submit --master yarn --deploy-mode client --queue cpu --num-executors 2 --executor-memory 4G --py-files TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py --conf spark.dynamicAllocation.enabled=false --conf spark.yarn.maxAppAttempts=1 --conf spark.executor.heartbeatInterval=1200s --conf spark.executorEnv.LD_LIBRARY_PATH="/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib:/usr/lib/jvm/java-7-oracle-cloudera/jre/lib/amd64/server" TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py --images hdfs:///user/mayub/mnist/csv/train/images --labels hdfs:///user/mayub/mnist/csv/train/labels --mode train --model hdfs:///user/mayub//mnist/mnist_model2

After waiting for a while the job fails with the below error; “Removing executor 2 with no recent heartbeats: 172657 ms exceeds timeout 120000 ms”

I have tried running this in couple of ways:

Option 1: Client mode Variables - $LIB_HDFS = /opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/ $LIB_JVM= /usr/lib/jvm/java-7-oracle-cloudera/jre/lib/amd64/server

spark-submit --master yarn --deploy-mode client --queue cpu --num-executors 2 --executor-memory 4G --py-files TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py --conf spark.dynamicAllocation.enabled=false --conf spark.yarn.maxAppAttempts=1 --conf spark.executor.heartbeatInterval=1200s --conf spark.executorEnv.LD_LIBRARY_PATH="/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib:/usr/lib/jvm/java-7-oracle-cloudera/jre/lib/amd64/server" TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py --images hdfs:///user/mayub/mnist/csv/train/images --labels hdfs:///user/mayub/mnist/csv/train/labels --mode train --model hdfs:///user/mayub//mnist/mnist_model2

Error log File: “Executor Heartbeat timing out” spark_client_mode.txt

Option 2: cluster mode Variables - $LIB_HDFS = /opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/ $LIB_JVM= /usr/lib/jvm/java-7-oracle-cloudera/jre/lib/amd64/server

Command: spark-submit --master yarn --deploy-mode cluster --queue cpu --num-executors 1 --executor-memory 2G --py-files TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py --conf spark.dynamicAllocation.enabled=false --conf spark.yarn.maxAppAttempts=1 --conf spark.executorEnv.LD_LIBRARY_PATH=$LIB_JVM:$LIB_HDFS TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py --images hdfs:///user/mayub/mnist/csv/train/images --labels hdfs:///user/mayub/mnist/csv/train/labels --mode train --model hdfs:///user/mayub//mnist/mnist_model2

Error Log File: “Job just hangs and then fails after running for a while while”

spark_hanging_job.txt

Option 3: with additional Cloudera configs

Variables - $LIB_HDFS = /opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/ $LIB_JVM= /usr/lib/jvm/java-7-oracle-cloudera/jre/lib/amd64/server

Command: spark-submit --master yarn --deploy-mode client --queue cpu --num-executors 2 --executor-memory 4G --py-files TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py --conf spark.dynamicAllocation.enabled=false --conf spark.yarn.maxAppAttempts=1 --conf spark.executor.heartbeatInterval=1200s --conf spark.executorEnv.LD_LIBRARY_PATH="/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib:/usr/lib/jvm/java-7-oracle-cloudera/jre/lib/amd64/server" TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py --images hdfs:///user/mayub/mnist/csv/train/images --labels hdfs:///user/mayub/mnist/csv/train/labels --mode train --model hdfs:///user/mayub//mnist/mnist_model2 Error- same error as option 1.

Just to validate that if other spark-submit jobs are running fine. I ran a spark word count example and it ran fine.

Appreciate any help.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:16 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
mohammedayub44commented, Jun 27, 2018

Just to summarize the issue and resolution for future reference. I removed the following configurations from query as --num-executors 4 implicitly sets dynamic allocation to false. --conf spark.dynamicAllocation.enabled=false --conf spark.dynamicAllocation.maxExecutors=4 --conf spark.dynamicAllocation.minExecutors=4

Also, removed absolute NN pathname from train, test, model and output HDFS directories as relative path names resolve successfully.

I had to tune the following parameters according to the setting in my Cloudera cluster (please don’t blindly use the ones from the example) --executor-memory 8G --driver-memory 4G --conf spark.yarn.executor.memoryOverhead=1600 --conf spark.yarn.driver.memoryOverhead=720

Add permissions to model output folder. $hdfs dfs -chmod 777 /user/mayub/mnist/mnist_model

Here is the final query that worked in both ‘client’ and ‘cluster’ mode.

Training: spark-submit --master yarn --deploy-mode cluster --queue cpu --num-executors 4 --executor-memory 8G --driver-memory 4G --py-files TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py --conf spark.yarn.maxAppAttempts=1 --conf spark.yarn.executor.memoryOverhead=1600 --conf spark.yarn.driver.memoryOverhead=720 --conf spark.executorEnv.LD_LIBRARY_PATH="/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib:/usr/lib/jvm/java-7-oracle-cloudera/jre/lib/amd64/server" TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py --images /user/mayub/mnist/csv/test/images --labels /user/mayub/mnist/csv/test/labels --mode train --model /user/mayub/mnist/mnist_model

Inference spark-submit --master yarn --deploy-mode cluster --queue cpu --num-executors 4 --executor-memory 8G --driver-memory 4G --py-files TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py --conf spark.yarn.executor.memoryOverhead=1600 --conf spark.yarn.driver.memoryOverhead=720 --conf spark.executorEnv.LD_LIBRARY_PATH="/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib:/usr/lib/jvm/java-7-oracle-cloudera/jre/lib/amd64/server" TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py --images /user/mayub/mnist/csv/test/images --labels /user/mayub/mnist/csv/test/labels --mode inference --model /user/mayub/mnist/mnist_model --output /user/mayub/mnist/predictions

@leewyang Thanks for your help. I’ll go ahead and close this, but would like your thoughts on my earlier comment on ‘yarn’ user.

0reactions
kwontaeheoncommented, Dec 28, 2018

$hdfs dfs -chmod 777 /user/mayub/mnist/mnist_model

That did the trick. Thanks.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Executor heartbeat timed out - Databricks Community
"SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 ... one of the running tasks) Reason: Executor heartbeat timed out...
Read more >
Known Issues and Workarounds in Cloudera Manager 5 | 5.x
Parcel distribution to hosts from Cloudera Manager server is known to fail and timeout on Ubuntu 14.04 with a Cloudera Express license.
Read more >
Spark cluster full of heartbeat timeouts, executors exiting on ...
In my spark-defaults.conf I set the spark.network.timeout to a higher value. Heartbeat interval was somewhat irrelevant to the problem (though tuning is ...
Read more >
IBM Big Replicate 2.1.0 User Guide
Avoid out of memory under failed socket connection scenario - DCO-683 ... a cluster-wide basis, each can be overridden at the time of...
Read more >
Spark failure detection - heartbeats on waitingforcode.com
An executor is considered as dead if, at the time of checking, its last heartbeat message is older than the timeout value specified...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found