question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Lightgbm - mysterious OOM problems

See original GitHub issue

I am consistently getting errors like this at the reduce step while trying to train a lightgbm model:

org.apache.spark.SparkException: Job aborted due to stage failure: Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(7, 0) finished unsuccessfully.
ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Container from a bad node: container_1626376382032_0004_01_000002 on host: spark-ml-pipeline2-w-1.c.xxx.internal. Exit status: 137. Diagnostics: [2021-07-15 19:41:35.679]Container killed on request. Exit code is 137
[2021-07-15 19:41:35.680]Container exited with a non-zero exit code 137.

dataset rows: 208,840,700 dataset features: 110 size: ~150GB

training code/params:

spark = SparkSession \
        .builder \
        .appName(f"{args['model_type']}-model-train") \
        .getOrCreate()

train = spark.read.parquet(f"gs://blahblah").select(*["id", "dt", "features", "label"])
test = spark.read.parquet(f"gs://blahblah").select(*["id", "dt", "features", "label"])

model = LightGBMClassifier(
    labelCol="label",
    objective="binary",
    maxDepth=8,
    numLeaves=70,
    learningRate=0.04,
    featureFraction=0.8,
    lambdaL1=3.0,
    lambdaL2=3.0,
    posBaggingFraction=1.0,
    negBaggingFraction=0.5,
    baggingFreq=10,
    numIterations=200,
    maxBin=63,
    useBarrierExecutionMode=True,
)

trained = model.fit(train)
results = trained.transform(test)

cluster config: 3x n2-highmem-16 workers (16 vcpus + 128 memory each)

spark:spark.driver.maxResultSize
1920m
spark:spark.driver.memory
3840m
spark:spark.dynamicAllocation.enabled
false
spark:spark.executor.cores
8
spark:spark.executor.instances
2
spark:spark.executor.memory
57215m
spark:spark.executorEnv.OPENBLAS_NUM_THREADS
1
spark:spark.jars.packages
com.microsoft.ml.spark:mmlspark:1.0.0-rc3-148-87ec5f74-SNAPSHOT
spark:spark.jars.repositories
https://mmlspark.azureedge.net/maven
spark:spark.scheduler.mode
FAIR
spark:spark.shuffle.service.enabled
false
spark:spark.sql.cbo.enabled
true
spark:spark.ui.port
0
spark:spark.yarn.am.memory
640m
yarn-env:YARN_NODEMANAGER_HEAPSIZE
4000
yarn-env:YARN_RESOURCEMANAGER_HEAPSIZE
3840
yarn-env:YARN_TIMELINESERVER_HEAPSIZE
3840
yarn:yarn.nodemanager.address
0.0.0.0:8026
yarn:yarn.nodemanager.resource.cpu-vcores
16
yarn:yarn.nodemanager.resource.memory-mb
125872
yarn:yarn.resourcemanager.nodemanager-graceful-decommission-timeout-secs
86400
yarn:yarn.scheduler.maximum-allocation-mb
125872
yarn:yarn.scheduler.minimum-allocation-mb
1

** Stacktrace**

py4j.protocol.Py4JJavaError: An error occurred while calling o82.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(7, 0) finished unsuccessfully.
ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Container from a bad node: container_1626376382032_0004_01_000002 on host: spark-ml-pipeline2-w-1.c.xxx. Exit status: 137. Diagnostics: [2021-07-15 19:41:35.679]Container killed on request. Exit code is 137
[2021-07-15 19:41:35.680]Container exited with a non-zero exit code 137.
[2021-07-15 19:41:35.680]Killed by external signal
.
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2259)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2208)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2207)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2207)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1968)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2443)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2388)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2377)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2202)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2297)
	at org.apache.spark.rdd.RDD.$anonfun$reduce$1(RDD.scala:1120)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
	at org.apache.spark.rdd.RDD.reduce(RDD.scala:1102)
	at com.microsoft.ml.spark.lightgbm.LightGBMBase.innerTrain(LightGBMBase.scala:481)
	at com.microsoft.ml.spark.lightgbm.LightGBMBase.innerTrain$(LightGBMBase.scala:440)
	at com.microsoft.ml.spark.lightgbm.LightGBMClassifier.innerTrain(LightGBMClassifier.scala:26)
	at com.microsoft.ml.spark.lightgbm.LightGBMBase.$anonfun$train$1(LightGBMBase.scala:63)
	at com.microsoft.ml.spark.logging.BasicLogging.logVerb(BasicLogging.scala:63)
	at com.microsoft.ml.spark.logging.BasicLogging.logVerb$(BasicLogging.scala:60)
	at com.microsoft.ml.spark.lightgbm.LightGBMClassifier.logVerb(LightGBMClassifier.scala:26)
	at com.microsoft.ml.spark.logging.BasicLogging.logTrain(BasicLogging.scala:49)
	at com.microsoft.ml.spark.logging.BasicLogging.logTrain$(BasicLogging.scala:48)
	at com.microsoft.ml.spark.lightgbm.LightGBMClassifier.logTrain(LightGBMClassifier.scala:26)
	at com.microsoft.ml.spark.lightgbm.LightGBMBase.train(LightGBMBase.scala:44)
	at com.microsoft.ml.spark.lightgbm.LightGBMBase.train$(LightGBMBase.scala:43)
	at com.microsoft.ml.spark.lightgbm.LightGBMClassifier.train(LightGBMClassifier.scala:26)
	at com.microsoft.ml.spark.lightgbm.LightGBMClassifier.train(LightGBMClassifier.scala:26)
	at org.apache.spark.ml.Predictor.fit(Predictor.scala:151)
	at org.apache.spark.ml.Predictor.fit(Predictor.scala:115)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

When I do: train = train.sample(withReplacement=False, fraction=0.25) the job ran successfully. I’m kinda guessing that I could fix it by throwing more resources at the problem, but I would think my current cluster should be totally overkill given the dataset size.

So far I’ve tried:

  • a few different mmlspark package versions
  • turning on/off useBarrierExecutionMode
  • decreasing maxBin
  • setting numTasks to a small number (3)
  • eliminating all pre-processing steps from job (just read parquet train/test data and then fit model)
  • dropping all optional arguments from LightGBMClassifier specification

I am on spark 3.0 and using com.microsoft.ml.spark:mmlspark:1.0.0-rc3-148-87ec5f74-SNAPSHOT

Thank you!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
trillvillecommented, Jul 20, 2021

Thank you both for the suggestions! In my case useSingleDatasetMode=True actually led to successful model training. I think you were right that the dataset was actually right up against the limit that the cluster could handle. I don’t think the input data was imbalanced in this case, and repartitioning alone did nothing.

0reactions
trillvillecommented, Jul 21, 2021

thanks again - everything is working for me 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Lightgbm - mysterious OOM problems · Issue #1124 - GitHub
I am consistently getting errors like this at the reduce step while trying to train a lightgbm model: org.apache.spark.
Read more >
LightGBM FAQ — LightGBM 3.3.3.99 documentation
Critical Issues . A critical issue could be a crash, prediction error, nonsense output, or something else requiring immediate attention.
Read more >
Work for Performance - Open performance-issues in open ...
This is a list of issues concerning performance in open-source software. The list is compiled using the APIs of libraries.io and github.com. The...
Read more >
The Best GPUs for Deep Learning in 2020 - Tim Dettmers
Here, I provide an in-depth analysis of GPUs for deep learning/machine learning and explain what is the best GPU for your use-case and ......
Read more >
神秘的OOM问题:Lightgbm - mysterious OOM problems - 第2页
LightGBM - 神秘的OOM问题:Lightgbm - mysterious OOM problems 看全部 ... 我在试图训练LightGBM型号时,我始终如一地获得这样的错误: org.apache.spark.SparkException ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found