[LightGBM] Train Lambdamart failed with "org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1"
See original GitHub issueDescribe the bug Hi, @imatiach-msft , I’m using mmlspark0.18.1 to train a ranking job. This below is the main training flow.
def run(self):
train_df = self.load_svm_rank_data()
df = train_df.repartition(8, 'query_id')
model = LightGBMRanker(
parallelism='data_parallel',
# parallelism='voting_parallel',
objective='lambdarank',
boostingType='gbdt',
numIterations=500,
learningRate=0.1,
# For recall task, 511,8 is enough.
# numLeaves=511,
# maxDepth=8,
numLeaves=1023,
maxDepth=10,
earlyStoppingRound=0,
maxPosition=20,
#minSumHessianInLeaf=0.0005,
minSumHessianInLeaf=0.001,
lambdaL1=0.01,
lambdaL2=0.01,
isProvideTrainingMetric=True,
# baggingSeed=3,
# boostFromAverage=True,
# categoricalSlotIndexes=None,
# categoricalSlotNames=None,
defaultListenPort=49650,
# defaultListenPort=12400,
featuresCol='features',
groupCol='query_id',
# initScoreCol=None,
labelCol='label',
# labelGain=[],
# modelString='',
numBatches=0,
# predictionCol='prediction',
timeout=600000.0,
# useBarrierExecutionMode=False,
# validationIndicatorCol=None,
verbosity=1,
# weightCol=None,
).fit(df)
This below is the spark job config.
/opt/meituan/spark-2.2/bin/spark-submit --deploy-mode cluster --queue root.zw03_training.hadoop-map.training --executor-cores 40 --num-executors 8 --master yarn --driver-memory 8G --files /opt/meituan/spark-2.2/conf/hive-site.xml --executor-memory 16G --files /opt/tmp/etl/remote_file/session_D5E23EBC14BCEA4F_pysparkjar_00877de2cbfbf624dca5ac527f415c9e/city_province_list --repositories http://pixel.sankuai.com/repository/group-releases,http://pixel.sankuai.com/repository/mtdp --conf spark.yarn.maxAppAttempts=1 --conf spark.task.cpus=40 --conf spark.sql.autoBroadcastJoinThreshold=-1 --conf spark.kryoserializer.buffer.max=1024m --conf spark.driver.maxResultSize=10G --conf spark.executor.instances=8 --conf spark.hadoop.parquet.enable.summary-metadata=false --conf spark.executor.heartbeatInterval=30s --conf spark.default.parallelism=1024 --conf spark.sql.hive.metastorePartitionPruning=true --conf spark.yarn.driver.memoryOverhead=8096 --conf spark.sql.orc.filterPushdown=true --conf spark.sql.parquet.filterPushdown=true --conf spark.sql.shuffle.partitions=1024 --conf spark.sql.orc.splits.include.file.footer=true --conf spark.jars.packages=com.microsoft.ml.spark:mmlspark_2.11:0.18.1 --conf spark.sql.orc.cache.stripe.details.size=10000 --conf spark.sql.parquet.mergeSchema=false --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.yarn.executor.memoryOverhead=60G --conf spark.yarn.am.extraJavaOptions="-DappIdentify=hope_3375504 -Dport=AppMaster " --conf spark.driver.extraJavaOptions="-DappIdentify=hope_3375504 -Dport=Driver -XX:PermSize=128M -XX:MaxPermSize=256M " --conf spark.executor.extraJavaOptions="-DappIdentify=hope_3375504 -Dport=Executor " --name huobaochong:/opt/meituan/20200616/topk_train_v2/shanghai/topk_train/topk_train.hope --conf spark.job.owner=huobaochong --conf spark.client.host=zw02-data-msp-launcher13.mt --conf spark.job.type=mtmsp --conf spark.flowid=D5E23EBC14BCEA4F --conf spark.yarn.app.tags.flowid=D5E23EBC14BCEA4F --conf spark.yarn.app.tags.schedulejobid=cantor-6177712 --conf spark.yarn.app.tags.scheduleinstanceid= --conf spark.yarn.app.tags.scheduleplanid= --conf spark.yarn.app.tags.onceexecid=once-exec-6163959 --conf spark.yarn.app.tags.rm.taskcode=hope:huobaochong:/opt/meituan/20200616/topk_train_v2/shanghai/topk_train/topk_train.hope --conf spark.yarn.app.tags.rm.taskname=huobaochong:/opt/meituan/20200616/topk_train_v2/shanghai/topk_train/topk_train.hope --conf spark.yarn.app.tags.rm.tasktype=hope --conf spark.yarn.app.tags.mtmspCompileVersion=0 --conf spark.yarn.job.priority=1 --conf spark.hive.mt.metastore.audit.id=SPARK-MTMSP-D5E23EBC14BCEA4F --conf spark.hadoop.hive.mt.metastore.audit.id=SPARK-MTMSP-D5E23EBC14BCEA4F --conf spark.hbo.enabled=true --conf spark.executor.cantorEtlIncreaseMemory.enabled=true /opt/tmp/etl/remote_file/session_D5E23EBC14BCEA4F_pysparkjar_00877de2cbfbf624dca5ac527f415c9e/topk_train.py 20200615190316-v0.0.3_china-20200505-20200520-common-staging shangha
This below is the error info. stdout from the driver node:
py4j.protocol.Py4JJavaError: An error occurred while calling o149.fit. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 1.1 failed 4 times, most recent failure: Lost task 3.3 in stage 1.1 (TID 13502, zw03-data-hdp-dn-cpu0244.mt, executor 9): java.net.ConnectException: Connection refused (Connection refused) at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at java.net.Socket.connect(Socket.java:538) at java.net.Socket.<init>(Socket.java:434) at java.net.Socket.<init>(Socket.java:211) at com.microsoft.ml.spark.lightgbm.TrainUtils$.getNetworkInitNodes(TrainUtils.scala:324) at com.microsoft.ml.spark.lightgbm.TrainUtils$$anonfun$15.apply(TrainUtils.scala:398) at com.microsoft.ml.spark.lightgbm.TrainUtils$$anonfun$15.apply(TrainUtils.scala:393) at com.microsoft.ml.spark.core.env.StreamUtilities$.using(StreamUtilities.scala:28) at com.microsoft.ml.spark.lightgbm.TrainUtils$.trainLightGBM(TrainUtils.scala:392) at com.microsoft.ml.spark.lightgbm.LightGBMBase$$anonfun$6.apply(LightGBMBase.scala:85) at com.microsoft.ml.spark.lightgbm.LightGBMBase$$anonfun$6.apply(LightGBMBase.scala:85) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:196) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:193) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:834) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:834) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:43) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:43) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:89) at org.apache.spark.scheduler.Task.run(Task.scala:110) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:363) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1576) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1564) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1563) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1563) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:822) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:822) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:822) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1794) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1746) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1735) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:634) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2060) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2157) at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1033) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.reduce(RDD.scala:1015) at org.apache.spark.sql.Dataset.reduce(Dataset.scala:1460) at com.microsoft.ml.spark.lightgbm.LightGBMBase$class.innerTrain(LightGBMBase.scala:90) at com.microsoft.ml.spark.lightgbm.LightGBMRanker.innerTrain(LightGBMRanker.scala:25) at com.microsoft.ml.spark.lightgbm.LightGBMBase$class.train(LightGBMBase.scala:38) at com.microsoft.ml.spark.lightgbm.LightGBMRanker.train(LightGBMRanker.scala:25) at com.microsoft.ml.spark.lightgbm.LightGBMRanker.train(LightGBMRanker.scala:25) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.ConnectException: Connection refused (Connection refused) at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at java.net.Socket.connect(Socket.java:538) at java.net.Socket.<init>(Socket.java:434) at java.net.Socket.<init>(Socket.java:211) at com.microsoft.ml.spark.lightgbm.TrainUtils$.getNetworkInitNodes(TrainUtils.scala:324) at com.microsoft.ml.spark.lightgbm.TrainUtils$$anonfun$15.apply(TrainUtils.scala:398) at com.microsoft.ml.spark.lightgbm.TrainUtils$$anonfun$15.apply(TrainUtils.scala:393) at com.microsoft.ml.spark.core.env.StreamUtilities$.using(StreamUtilities.scala:28) at com.microsoft.ml.spark.lightgbm.TrainUtils$.trainLightGBM(TrainUtils.scala:392) at com.microsoft.ml.spark.lightgbm.LightGBMBase$$anonfun$6.apply(LightGBMBase.scala:85) at com.microsoft.ml.spark.lightgbm.LightGBMBase$$anonfun$6.apply(LightGBMBase.scala:85) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:196) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:193) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:834) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:834) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:43) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:43) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:89) at org.apache.spark.scheduler.Task.run(Task.scala:110) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:363) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) … 1 more
Errors from executors like this below.
Info (please complete the following information):
- MMLSpark Version: mmlspark_2.11:0.18.1
- Spark Version: 2.2
- Spark Platform: spark with yarn
Issue Analytics
- State:
- Created 3 years ago
- Comments:37 (22 by maintainers)
Top GitHub Comments
@imatiach-msft , sorry about replying late, these days are the Dragon Boat Festival holidays. here is my mainly spark conf
did you try the numTasks parameter I added in the new PR I sent you?
Yes, I set numTask parameter to 40(cores of each executor) * 4 (number of executors) = 160, and the training stage had 160 tasks. The job with these parameters succeeds. I’m trying other memory configs about executor.memory and executor.memory.overhead
@ce39906 also, did you try the numTasks parameter I added in the new PR I sent you?
https://github.com/Azure/mmlspark/pull/881
Did that change the number of tasks?