Lightgbm - mysterious OOM problems
See original GitHub issueI am consistently getting errors like this at the reduce step while trying to train a lightgbm model:
org.apache.spark.SparkException: Job aborted due to stage failure: Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(7, 0) finished unsuccessfully.
ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Container from a bad node: container_1626376382032_0004_01_000002 on host: spark-ml-pipeline2-w-1.c.xxx.internal. Exit status: 137. Diagnostics: [2021-07-15 19:41:35.679]Container killed on request. Exit code is 137
[2021-07-15 19:41:35.680]Container exited with a non-zero exit code 137.
dataset rows: 208,840,700 dataset features: 110 size: ~150GB
training code/params:
spark = SparkSession \
.builder \
.appName(f"{args['model_type']}-model-train") \
.getOrCreate()
train = spark.read.parquet(f"gs://blahblah").select(*["id", "dt", "features", "label"])
test = spark.read.parquet(f"gs://blahblah").select(*["id", "dt", "features", "label"])
model = LightGBMClassifier(
labelCol="label",
objective="binary",
maxDepth=8,
numLeaves=70,
learningRate=0.04,
featureFraction=0.8,
lambdaL1=3.0,
lambdaL2=3.0,
posBaggingFraction=1.0,
negBaggingFraction=0.5,
baggingFreq=10,
numIterations=200,
maxBin=63,
useBarrierExecutionMode=True,
)
trained = model.fit(train)
results = trained.transform(test)
cluster config: 3x n2-highmem-16 workers (16 vcpus + 128 memory each)
spark:spark.driver.maxResultSize
1920m
spark:spark.driver.memory
3840m
spark:spark.dynamicAllocation.enabled
false
spark:spark.executor.cores
8
spark:spark.executor.instances
2
spark:spark.executor.memory
57215m
spark:spark.executorEnv.OPENBLAS_NUM_THREADS
1
spark:spark.jars.packages
com.microsoft.ml.spark:mmlspark:1.0.0-rc3-148-87ec5f74-SNAPSHOT
spark:spark.jars.repositories
https://mmlspark.azureedge.net/maven
spark:spark.scheduler.mode
FAIR
spark:spark.shuffle.service.enabled
false
spark:spark.sql.cbo.enabled
true
spark:spark.ui.port
0
spark:spark.yarn.am.memory
640m
yarn-env:YARN_NODEMANAGER_HEAPSIZE
4000
yarn-env:YARN_RESOURCEMANAGER_HEAPSIZE
3840
yarn-env:YARN_TIMELINESERVER_HEAPSIZE
3840
yarn:yarn.nodemanager.address
0.0.0.0:8026
yarn:yarn.nodemanager.resource.cpu-vcores
16
yarn:yarn.nodemanager.resource.memory-mb
125872
yarn:yarn.resourcemanager.nodemanager-graceful-decommission-timeout-secs
86400
yarn:yarn.scheduler.maximum-allocation-mb
125872
yarn:yarn.scheduler.minimum-allocation-mb
1
** Stacktrace**
py4j.protocol.Py4JJavaError: An error occurred while calling o82.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(7, 0) finished unsuccessfully.
ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Container from a bad node: container_1626376382032_0004_01_000002 on host: spark-ml-pipeline2-w-1.c.xxx. Exit status: 137. Diagnostics: [2021-07-15 19:41:35.679]Container killed on request. Exit code is 137
[2021-07-15 19:41:35.680]Container exited with a non-zero exit code 137.
[2021-07-15 19:41:35.680]Killed by external signal
.
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2259)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2208)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2207)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2207)
at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1968)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2443)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2388)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2377)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2202)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2297)
at org.apache.spark.rdd.RDD.$anonfun$reduce$1(RDD.scala:1120)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
at org.apache.spark.rdd.RDD.reduce(RDD.scala:1102)
at com.microsoft.ml.spark.lightgbm.LightGBMBase.innerTrain(LightGBMBase.scala:481)
at com.microsoft.ml.spark.lightgbm.LightGBMBase.innerTrain$(LightGBMBase.scala:440)
at com.microsoft.ml.spark.lightgbm.LightGBMClassifier.innerTrain(LightGBMClassifier.scala:26)
at com.microsoft.ml.spark.lightgbm.LightGBMBase.$anonfun$train$1(LightGBMBase.scala:63)
at com.microsoft.ml.spark.logging.BasicLogging.logVerb(BasicLogging.scala:63)
at com.microsoft.ml.spark.logging.BasicLogging.logVerb$(BasicLogging.scala:60)
at com.microsoft.ml.spark.lightgbm.LightGBMClassifier.logVerb(LightGBMClassifier.scala:26)
at com.microsoft.ml.spark.logging.BasicLogging.logTrain(BasicLogging.scala:49)
at com.microsoft.ml.spark.logging.BasicLogging.logTrain$(BasicLogging.scala:48)
at com.microsoft.ml.spark.lightgbm.LightGBMClassifier.logTrain(LightGBMClassifier.scala:26)
at com.microsoft.ml.spark.lightgbm.LightGBMBase.train(LightGBMBase.scala:44)
at com.microsoft.ml.spark.lightgbm.LightGBMBase.train$(LightGBMBase.scala:43)
at com.microsoft.ml.spark.lightgbm.LightGBMClassifier.train(LightGBMClassifier.scala:26)
at com.microsoft.ml.spark.lightgbm.LightGBMClassifier.train(LightGBMClassifier.scala:26)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:151)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:115)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
When I do: train = train.sample(withReplacement=False, fraction=0.25)
the job ran successfully. I’m kinda guessing that I could fix it by throwing more resources at the problem, but I would think my current cluster should be totally overkill given the dataset size.
So far I’ve tried:
- a few different mmlspark package versions
- turning on/off
useBarrierExecutionMode
- decreasing
maxBin
- setting
numTasks
to a small number (3) - eliminating all pre-processing steps from job (just read parquet train/test data and then fit model)
- dropping all optional arguments from
LightGBMClassifier
specification
I am on spark 3.0 and using com.microsoft.ml.spark:mmlspark:1.0.0-rc3-148-87ec5f74-SNAPSHOT
Thank you!
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
Lightgbm - mysterious OOM problems · Issue #1124 - GitHub
I am consistently getting errors like this at the reduce step while trying to train a lightgbm model: org.apache.spark.
Read more >LightGBM FAQ — LightGBM 3.3.3.99 documentation
Critical Issues . A critical issue could be a crash, prediction error, nonsense output, or something else requiring immediate attention.
Read more >Work for Performance - Open performance-issues in open ...
This is a list of issues concerning performance in open-source software. The list is compiled using the APIs of libraries.io and github.com. The...
Read more >The Best GPUs for Deep Learning in 2020 - Tim Dettmers
Here, I provide an in-depth analysis of GPUs for deep learning/machine learning and explain what is the best GPU for your use-case and ......
Read more >神秘的OOM问题:Lightgbm - mysterious OOM problems - 第2页
LightGBM - 神秘的OOM问题:Lightgbm - mysterious OOM problems 看全部 ... 我在试图训练LightGBM型号时,我始终如一地获得这样的错误: org.apache.spark.SparkException ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thank you both for the suggestions! In my case
useSingleDatasetMode=True
actually led to successful model training. I think you were right that the dataset was actually right up against the limit that the cluster could handle. I don’t think the input data was imbalanced in this case, and repartitioning alone did nothing.thanks again - everything is working for me 😃