Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Lightgbm - mysterious OOM problems

See original GitHub issue

I am consistently getting errors like this at the reduce step while trying to train a lightgbm model:

org.apache.spark.SparkException: Job aborted due to stage failure: Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(7, 0) finished unsuccessfully.
ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Container from a bad node: container_1626376382032_0004_01_000002 on host: spark-ml-pipeline2-w-1.c.xxx.internal. Exit status: 137. Diagnostics: [2021-07-15 19:41:35.679]Container killed on request. Exit code is 137
[2021-07-15 19:41:35.680]Container exited with a non-zero exit code 137.

dataset rows: 208,840,700 dataset features: 110 size: ~150GB

training code/params:

spark = SparkSession \
        .builder \
        .appName(f"{args['model_type']}-model-train") \
        .getOrCreate()

train = spark.read.parquet(f"gs://blahblah").select(*["id", "dt", "features", "label"])
test = spark.read.parquet(f"gs://blahblah").select(*["id", "dt", "features", "label"])

model = LightGBMClassifier(
    labelCol="label",
    objective="binary",
    maxDepth=8,
    numLeaves=70,
    learningRate=0.04,
    featureFraction=0.8,
    lambdaL1=3.0,
    lambdaL2=3.0,
    posBaggingFraction=1.0,
    negBaggingFraction=0.5,
    baggingFreq=10,
    numIterations=200,
    maxBin=63,
    useBarrierExecutionMode=True,
)

trained = model.fit(train)
results = trained.transform(test)

cluster config: 3x n2-highmem-16 workers (16 vcpus + 128 memory each)

spark:spark.driver.maxResultSize
1920m
spark:spark.driver.memory
3840m
spark:spark.dynamicAllocation.enabled
false
spark:spark.executor.cores
8
spark:spark.executor.instances
2
spark:spark.executor.memory
57215m
spark:spark.executorEnv.OPENBLAS_NUM_THREADS
1
spark:spark.jars.packages
com.microsoft.ml.spark:mmlspark:1.0.0-rc3-148-87ec5f74-SNAPSHOT
spark:spark.jars.repositories
https://mmlspark.azureedge.net/maven
spark:spark.scheduler.mode
FAIR
spark:spark.shuffle.service.enabled
false
spark:spark.sql.cbo.enabled
true
spark:spark.ui.port
0
spark:spark.yarn.am.memory
640m
yarn-env:YARN_NODEMANAGER_HEAPSIZE
4000
yarn-env:YARN_RESOURCEMANAGER_HEAPSIZE
3840
yarn-env:YARN_TIMELINESERVER_HEAPSIZE
3840
yarn:yarn.nodemanager.address
0.0.0.0:8026
yarn:yarn.nodemanager.resource.cpu-vcores
16
yarn:yarn.nodemanager.resource.memory-mb
125872
yarn:yarn.resourcemanager.nodemanager-graceful-decommission-timeout-secs
86400
yarn:yarn.scheduler.maximum-allocation-mb
125872
yarn:yarn.scheduler.minimum-allocation-mb
1

** Stacktrace**

py4j.protocol.Py4JJavaError: An error occurred while calling o82.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(7, 0) finished unsuccessfully.
ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Container from a bad node: container_1626376382032_0004_01_000002 on host: spark-ml-pipeline2-w-1.c.xxx. Exit status: 137. Diagnostics: [2021-07-15 19:41:35.679]Container killed on request. Exit code is 137
[2021-07-15 19:41:35.680]Container exited with a non-zero exit code 137.
[2021-07-15 19:41:35.680]Killed by external signal
.
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2259)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2208)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2207)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2207)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1968)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2443)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2388)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2377)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2202)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2297)
	at org.apache.spark.rdd.RDD.$anonfun$reduce$1(RDD.scala:1120)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
	at org.apache.spark.rdd.RDD.reduce(RDD.scala:1102)
	at com.microsoft.ml.spark.lightgbm.LightGBMBase.innerTrain(LightGBMBase.scala:481)
	at com.microsoft.ml.spark.lightgbm.LightGBMBase.innerTrain$(LightGBMBase.scala:440)
	at com.microsoft.ml.spark.lightgbm.LightGBMClassifier.innerTrain(LightGBMClassifier.scala:26)
	at com.microsoft.ml.spark.lightgbm.LightGBMBase.$anonfun$train$1(LightGBMBase.scala:63)
	at com.microsoft.ml.spark.logging.BasicLogging.logVerb(BasicLogging.scala:63)
	at com.microsoft.ml.spark.logging.BasicLogging.logVerb$(BasicLogging.scala:60)
	at com.microsoft.ml.spark.lightgbm.LightGBMClassifier.logVerb(LightGBMClassifier.scala:26)
	at com.microsoft.ml.spark.logging.BasicLogging.logTrain(BasicLogging.scala:49)
	at com.microsoft.ml.spark.logging.BasicLogging.logTrain$(BasicLogging.scala:48)
	at com.microsoft.ml.spark.lightgbm.LightGBMClassifier.logTrain(LightGBMClassifier.scala:26)
	at com.microsoft.ml.spark.lightgbm.LightGBMBase.train(LightGBMBase.scala:44)
	at com.microsoft.ml.spark.lightgbm.LightGBMBase.train$(LightGBMBase.scala:43)
	at com.microsoft.ml.spark.lightgbm.LightGBMClassifier.train(LightGBMClassifier.scala:26)
	at com.microsoft.ml.spark.lightgbm.LightGBMClassifier.train(LightGBMClassifier.scala:26)
	at org.apache.spark.ml.Predictor.fit(Predictor.scala:151)
	at org.apache.spark.ml.Predictor.fit(Predictor.scala:115)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

When I do: train = train.sample(withReplacement=False, fraction=0.25) the job ran successfully. I’m kinda guessing that I could fix it by throwing more resources at the problem, but I would think my current cluster should be totally overkill given the dataset size.

So far I’ve tried:

a few different mmlspark package versions
turning on/off useBarrierExecutionMode
decreasing maxBin
setting numTasks to a small number (3)
eliminating all pre-processing steps from job (just read parquet train/test data and then fit model)
dropping all optional arguments from LightGBMClassifier specification

I am on spark 3.0 and using com.microsoft.ml.spark:mmlspark:1.0.0-rc3-148-87ec5f74-SNAPSHOT

Thank you!

Issue Analytics

State:
Created 2 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

trillvillecommented, Jul 20, 2021

Thank you both for the suggestions! In my case useSingleDatasetMode=True actually led to successful model training. I think you were right that the dataset was actually right up against the limit that the cluster could handle. I don’t think the input data was imbalanced in this case, and repartitioning alone did nothing.

0reactions

trillvillecommented, Jul 21, 2021

thanks again - everything is working for me 😃