Job failing on CDH 5.8.2 - Executor Heartbeat timing out
See original GitHub issueI’m running the below command:
spark-submit --master yarn --deploy-mode client --queue cpu --num-executors 2 --executor-memory 4G --py-files TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py --conf spark.dynamicAllocation.enabled=false --conf spark.yarn.maxAppAttempts=1 --conf spark.executor.heartbeatInterval=1200s --conf spark.executorEnv.LD_LIBRARY_PATH="/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib:/usr/lib/jvm/java-7-oracle-cloudera/jre/lib/amd64/server" TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py --images hdfs:///user/mayub/mnist/csv/train/images --labels hdfs:///user/mayub/mnist/csv/train/labels --mode train --model hdfs:///user/mayub//mnist/mnist_model2
After waiting for a while the job fails with the below error; “Removing executor 2 with no recent heartbeats: 172657 ms exceeds timeout 120000 ms”
I have tried running this in couple of ways:
Option 1: Client mode Variables - $LIB_HDFS = /opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/ $LIB_JVM= /usr/lib/jvm/java-7-oracle-cloudera/jre/lib/amd64/server
spark-submit --master yarn --deploy-mode client --queue cpu --num-executors 2 --executor-memory 4G --py-files TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py --conf spark.dynamicAllocation.enabled=false --conf spark.yarn.maxAppAttempts=1 --conf spark.executor.heartbeatInterval=1200s --conf spark.executorEnv.LD_LIBRARY_PATH="/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib:/usr/lib/jvm/java-7-oracle-cloudera/jre/lib/amd64/server" TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py --images hdfs:///user/mayub/mnist/csv/train/images --labels hdfs:///user/mayub/mnist/csv/train/labels --mode train --model hdfs:///user/mayub//mnist/mnist_model2
Error log File: “Executor Heartbeat timing out” spark_client_mode.txt
Option 2: cluster mode Variables - $LIB_HDFS = /opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/ $LIB_JVM= /usr/lib/jvm/java-7-oracle-cloudera/jre/lib/amd64/server
Command:
spark-submit --master yarn --deploy-mode cluster --queue cpu --num-executors 1 --executor-memory 2G --py-files TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py --conf spark.dynamicAllocation.enabled=false --conf spark.yarn.maxAppAttempts=1 --conf spark.executorEnv.LD_LIBRARY_PATH=$LIB_JVM:$LIB_HDFS TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py --images hdfs:///user/mayub/mnist/csv/train/images --labels hdfs:///user/mayub/mnist/csv/train/labels --mode train --model hdfs:///user/mayub//mnist/mnist_model2
Error Log File: “Job just hangs and then fails after running for a while while”
Option 3: with additional Cloudera configs
Variables - $LIB_HDFS = /opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/ $LIB_JVM= /usr/lib/jvm/java-7-oracle-cloudera/jre/lib/amd64/server
Command:
spark-submit --master yarn --deploy-mode client --queue cpu --num-executors 2 --executor-memory 4G --py-files TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py --conf spark.dynamicAllocation.enabled=false --conf spark.yarn.maxAppAttempts=1 --conf spark.executor.heartbeatInterval=1200s --conf spark.executorEnv.LD_LIBRARY_PATH="/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib:/usr/lib/jvm/java-7-oracle-cloudera/jre/lib/amd64/server" TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py --images hdfs:///user/mayub/mnist/csv/train/images --labels hdfs:///user/mayub/mnist/csv/train/labels --mode train --model hdfs:///user/mayub//mnist/mnist_model2
Error- same error as option 1.
Just to validate that if other spark-submit jobs are running fine. I ran a spark word count example and it ran fine.
Appreciate any help.
Issue Analytics
- State:
- Created 5 years ago
- Comments:16 (8 by maintainers)
Top GitHub Comments
Just to summarize the issue and resolution for future reference. I removed the following configurations from query as
--num-executors 4
implicitly sets dynamic allocation to false.--conf spark.dynamicAllocation.enabled=false
--conf spark.dynamicAllocation.maxExecutors=4
--conf spark.dynamicAllocation.minExecutors=4
Also, removed absolute NN pathname from train, test, model and output HDFS directories as relative path names resolve successfully.
I had to tune the following parameters according to the setting in my Cloudera cluster (please don’t blindly use the ones from the example)
--executor-memory 8G
--driver-memory 4G
--conf spark.yarn.executor.memoryOverhead=1600
--conf spark.yarn.driver.memoryOverhead=720
Add permissions to model output folder.
$hdfs dfs -chmod 777 /user/mayub/mnist/mnist_model
Here is the final query that worked in both ‘client’ and ‘cluster’ mode.
Training:
spark-submit --master yarn --deploy-mode cluster --queue cpu --num-executors 4 --executor-memory 8G --driver-memory 4G --py-files TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py --conf spark.yarn.maxAppAttempts=1 --conf spark.yarn.executor.memoryOverhead=1600 --conf spark.yarn.driver.memoryOverhead=720 --conf spark.executorEnv.LD_LIBRARY_PATH="/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib:/usr/lib/jvm/java-7-oracle-cloudera/jre/lib/amd64/server" TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py --images /user/mayub/mnist/csv/test/images --labels /user/mayub/mnist/csv/test/labels --mode train --model /user/mayub/mnist/mnist_model
Inference
spark-submit --master yarn --deploy-mode cluster --queue cpu --num-executors 4 --executor-memory 8G --driver-memory 4G --py-files TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py --conf spark.yarn.executor.memoryOverhead=1600 --conf spark.yarn.driver.memoryOverhead=720 --conf spark.executorEnv.LD_LIBRARY_PATH="/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib:/usr/lib/jvm/java-7-oracle-cloudera/jre/lib/amd64/server" TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py --images /user/mayub/mnist/csv/test/images --labels /user/mayub/mnist/csv/test/labels --mode inference --model /user/mayub/mnist/mnist_model --output /user/mayub/mnist/predictions
@leewyang Thanks for your help. I’ll go ahead and close this, but would like your thoughts on my earlier comment on ‘yarn’ user.
That did the trick. Thanks.