question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[LightGBM] Train Lambdamart failed with "org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1"

See original GitHub issue

Describe the bug Hi, @imatiach-msft , I’m using mmlspark0.18.1 to train a ranking job. This below is the main training flow.

def run(self):
        train_df = self.load_svm_rank_data()
        df = train_df.repartition(8, 'query_id')
        model = LightGBMRanker(
            parallelism='data_parallel',
            #  parallelism='voting_parallel',
            objective='lambdarank',
            boostingType='gbdt',
            numIterations=500,
            learningRate=0.1,
            # For recall task, 511,8 is enough.
            # numLeaves=511,
            # maxDepth=8,
            numLeaves=1023,
            maxDepth=10,
            earlyStoppingRound=0,
            maxPosition=20,
            #minSumHessianInLeaf=0.0005,
            minSumHessianInLeaf=0.001,
            lambdaL1=0.01,
            lambdaL2=0.01,
            isProvideTrainingMetric=True,
            #  baggingSeed=3,
            #  boostFromAverage=True,
            #  categoricalSlotIndexes=None,
            #  categoricalSlotNames=None,
            defaultListenPort=49650,
            #  defaultListenPort=12400,
            featuresCol='features',
            groupCol='query_id',
            #  initScoreCol=None,
            labelCol='label',
            #  labelGain=[],
            #  modelString='',
            numBatches=0,
            #  predictionCol='prediction',
            timeout=600000.0,
            #  useBarrierExecutionMode=False,
            #  validationIndicatorCol=None,
            verbosity=1,
            #  weightCol=None,
        ).fit(df)

This below is the spark job config.

/opt/meituan/spark-2.2/bin/spark-submit     --deploy-mode cluster --queue root.zw03_training.hadoop-map.training --executor-cores 40 --num-executors 8 --master yarn --driver-memory 8G --files /opt/meituan/spark-2.2/conf/hive-site.xml --executor-memory 16G --files /opt/tmp/etl/remote_file/session_D5E23EBC14BCEA4F_pysparkjar_00877de2cbfbf624dca5ac527f415c9e/city_province_list --repositories http://pixel.sankuai.com/repository/group-releases,http://pixel.sankuai.com/repository/mtdp --conf spark.yarn.maxAppAttempts=1 --conf spark.task.cpus=40 --conf spark.sql.autoBroadcastJoinThreshold=-1 --conf spark.kryoserializer.buffer.max=1024m --conf spark.driver.maxResultSize=10G --conf spark.executor.instances=8 --conf spark.hadoop.parquet.enable.summary-metadata=false --conf spark.executor.heartbeatInterval=30s --conf spark.default.parallelism=1024 --conf spark.sql.hive.metastorePartitionPruning=true --conf spark.yarn.driver.memoryOverhead=8096 --conf spark.sql.orc.filterPushdown=true --conf spark.sql.parquet.filterPushdown=true --conf spark.sql.shuffle.partitions=1024 --conf spark.sql.orc.splits.include.file.footer=true --conf spark.jars.packages=com.microsoft.ml.spark:mmlspark_2.11:0.18.1 --conf spark.sql.orc.cache.stripe.details.size=10000 --conf spark.sql.parquet.mergeSchema=false --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.yarn.executor.memoryOverhead=60G --conf spark.yarn.am.extraJavaOptions="-DappIdentify=hope_3375504 -Dport=AppMaster " --conf spark.driver.extraJavaOptions="-DappIdentify=hope_3375504 -Dport=Driver -XX:PermSize=128M -XX:MaxPermSize=256M " --conf spark.executor.extraJavaOptions="-DappIdentify=hope_3375504 -Dport=Executor "          --name huobaochong:/opt/meituan/20200616/topk_train_v2/shanghai/topk_train/topk_train.hope     --conf spark.job.owner=huobaochong     --conf spark.client.host=zw02-data-msp-launcher13.mt     --conf spark.job.type=mtmsp     --conf spark.flowid=D5E23EBC14BCEA4F     --conf spark.yarn.app.tags.flowid=D5E23EBC14BCEA4F     --conf spark.yarn.app.tags.schedulejobid=cantor-6177712     --conf spark.yarn.app.tags.scheduleinstanceid=     --conf spark.yarn.app.tags.scheduleplanid=     --conf spark.yarn.app.tags.onceexecid=once-exec-6163959     --conf spark.yarn.app.tags.rm.taskcode=hope:huobaochong:/opt/meituan/20200616/topk_train_v2/shanghai/topk_train/topk_train.hope     --conf spark.yarn.app.tags.rm.taskname=huobaochong:/opt/meituan/20200616/topk_train_v2/shanghai/topk_train/topk_train.hope     --conf spark.yarn.app.tags.rm.tasktype=hope     --conf spark.yarn.app.tags.mtmspCompileVersion=0     --conf spark.yarn.job.priority=1     --conf spark.hive.mt.metastore.audit.id=SPARK-MTMSP-D5E23EBC14BCEA4F     --conf spark.hadoop.hive.mt.metastore.audit.id=SPARK-MTMSP-D5E23EBC14BCEA4F     --conf spark.hbo.enabled=true     --conf spark.executor.cantorEtlIncreaseMemory.enabled=true     /opt/tmp/etl/remote_file/session_D5E23EBC14BCEA4F_pysparkjar_00877de2cbfbf624dca5ac527f415c9e/topk_train.py     20200615190316-v0.0.3_china-20200505-20200520-common-staging shangha

This below is the error info. image stdout from the driver node:

py4j.protocol.Py4JJavaError: An error occurred while calling o149.fit. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 1.1 failed 4 times, most recent failure: Lost task 3.3 in stage 1.1 (TID 13502, zw03-data-hdp-dn-cpu0244.mt, executor 9): java.net.ConnectException: Connection refused (Connection refused) at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at java.net.Socket.connect(Socket.java:538) at java.net.Socket.<init>(Socket.java:434) at java.net.Socket.<init>(Socket.java:211) at com.microsoft.ml.spark.lightgbm.TrainUtils$.getNetworkInitNodes(TrainUtils.scala:324) at com.microsoft.ml.spark.lightgbm.TrainUtils$$anonfun$15.apply(TrainUtils.scala:398) at com.microsoft.ml.spark.lightgbm.TrainUtils$$anonfun$15.apply(TrainUtils.scala:393) at com.microsoft.ml.spark.core.env.StreamUtilities$.using(StreamUtilities.scala:28) at com.microsoft.ml.spark.lightgbm.TrainUtils$.trainLightGBM(TrainUtils.scala:392) at com.microsoft.ml.spark.lightgbm.LightGBMBase$$anonfun$6.apply(LightGBMBase.scala:85) at com.microsoft.ml.spark.lightgbm.LightGBMBase$$anonfun$6.apply(LightGBMBase.scala:85) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:196) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:193) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:834) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:834) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:43) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:43) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:89) at org.apache.spark.scheduler.Task.run(Task.scala:110) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:363) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1576) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1564) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1563) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1563) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:822) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:822) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:822) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1794) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1746) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1735) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:634) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2060) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2157) at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1033) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.reduce(RDD.scala:1015) at org.apache.spark.sql.Dataset.reduce(Dataset.scala:1460) at com.microsoft.ml.spark.lightgbm.LightGBMBase$class.innerTrain(LightGBMBase.scala:90) at com.microsoft.ml.spark.lightgbm.LightGBMRanker.innerTrain(LightGBMRanker.scala:25) at com.microsoft.ml.spark.lightgbm.LightGBMBase$class.train(LightGBMBase.scala:38) at com.microsoft.ml.spark.lightgbm.LightGBMRanker.train(LightGBMRanker.scala:25) at com.microsoft.ml.spark.lightgbm.LightGBMRanker.train(LightGBMRanker.scala:25) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.ConnectException: Connection refused (Connection refused) at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at java.net.Socket.connect(Socket.java:538) at java.net.Socket.<init>(Socket.java:434) at java.net.Socket.<init>(Socket.java:211) at com.microsoft.ml.spark.lightgbm.TrainUtils$.getNetworkInitNodes(TrainUtils.scala:324) at com.microsoft.ml.spark.lightgbm.TrainUtils$$anonfun$15.apply(TrainUtils.scala:398) at com.microsoft.ml.spark.lightgbm.TrainUtils$$anonfun$15.apply(TrainUtils.scala:393) at com.microsoft.ml.spark.core.env.StreamUtilities$.using(StreamUtilities.scala:28) at com.microsoft.ml.spark.lightgbm.TrainUtils$.trainLightGBM(TrainUtils.scala:392) at com.microsoft.ml.spark.lightgbm.LightGBMBase$$anonfun$6.apply(LightGBMBase.scala:85) at com.microsoft.ml.spark.lightgbm.LightGBMBase$$anonfun$6.apply(LightGBMBase.scala:85) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:196) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:193) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:834) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:834) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:43) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:43) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:89) at org.apache.spark.scheduler.Task.run(Task.scala:110) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:363) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) … 1 more

Errors from executors like this below. image

Info (please complete the following information):

  • MMLSpark Version: mmlspark_2.11:0.18.1
  • Spark Version: 2.2
  • Spark Platform: spark with yarn

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:37 (22 by maintainers)

github_iconTop GitHub Comments

2reactions
ce39906commented, Jun 26, 2020

@ce39906 also, did you try the numTasks parameter I added in the new PR I sent you?

#881

Did that change the number of tasks?

@imatiach-msft , sorry about replying late, these days are the Dragon Boat Festival holidays. here is my mainly spark conf

master = yarn-cluster
driver-memory = 8G
driver-cores = 4
executor-memory = 100G
executor-cores = 40
is_dynamic_allocation = false
num-executors = 4

[option_env_args]
spark.executor.instances = 4
spark.task.cpus = 1
spark.jars.packages = com.microsoft.ml.lightgbm:lightgbmlib:2.3.180,com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc1-91-a13271d7-SNAPSHOT

spark.yarn.maxAppAttempts = 1
spark.driver.maxResultSize = 10G
spark.yarn.driver.memoryOverhead = 4096
spark.yarn.executor.memoryOverhead = 120G
spark.sql.orc.filterPushdown = true
spark.sql.orc.splits.include.file.footer = true
spark.sql.orc.cache.stripe.details.size = 10000
spark.sql.hive.metastorePartitionPruning = true

spark.hadoop.parquet.enable.summary-metadata = false
spark.sql.parquet.mergeSchema = false
spark.sql.parquet.filterPushdown = true
spark.sql.hive.metastorePartitionPruning = true

spark.sql.autoBroadcastJoinThreshold = -1

spark.sql.shuffle.partitions = 2048
spark.default.parallelism = 2048
spark.serializer = org.apache.spark.serializer.KryoSerializer
spark.kryoserializer.buffer.max = 1024m
spark.executor.heartbeatInterval = 30s
spark.network.timeout = 800s
spark.executor.extraJavaOptions="-XX:+UseG1GC -XX:-UseGCOverheadLimit"

did you try the numTasks parameter I added in the new PR I sent you?
Yes, I set numTask parameter to 40(cores of each executor) * 4 (number of executors) = 160, and the training stage had 160 tasks. The job with these parameters succeeds. I’m trying other memory configs about executor.memory and executor.memory.overhead

0reactions
imatiach-msftcommented, Jun 25, 2020

@ce39906 also, did you try the numTasks parameter I added in the new PR I sent you?

https://github.com/Azure/mmlspark/pull/881

Did that change the number of tasks?

Read more comments on GitHub >

github_iconTop Results From Across the Web

[LightGBM] Train Lambdamart failed with "org.apache.spark ...
[LightGBM] Train Lambdamart failed with "org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1" #879.
Read more >
Why do Spark jobs fail with org.apache.spark.shuffle ...
Our solution was to change/add spark.shuffle. ... MetadataFetchFailedException: Missing an output location for shuffle 67.
Read more >
Spark Metadata Fetch Failed Exception: Missing an
org.apache.spark.shuffle.MetadataFetchFailedException:Missing an output location for shuffle 11 at org.apache.spark.
Read more >
Missing An Output Location For Shuffle 0 In Speculation Mode
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 11. These are the details of the failing job (see attachments):.
Read more >
FetchFailedException or MetadataFetchFailedException when ...
apache.spark.shuffle.FetchFailedException can occur due to timeout retrieving shuffle partitions. To fix this problem, you can set the following:.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found