Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Errors Training on Large Dataset : LightGBM

See original GitHub issue

Im trying to fit a classifier with LightGBM on a large dataset. There are about 900,000,000 rows and 40 columns, 7 of which are integers being treated as categorical.

SynapseML Version: com.microsoft.azure:synapseml_2.12:0.9.5
Spark Version 3,2
Spark Platform: AWS EMR

The current cluster is

6 workers with 16 VCPUS and 64GB RAM each.

The call to Lightgbm is as follows:

lgb_estimator = LightGBMClassifier(objective ="binary", learningRate = 0.1, numIterations = 222,
                                       categoricalSlotNames = ["cat1",
                                                                "cat2",
                                                                "cat3",
                                                                "cat4",
                                                                "cat5",
                                                                "cat6",
                                                                "cat7"], 
                                       numLeaves= 31,
                                       probabilityCol='probs',
                                       featuresCol='features',
                                       labelCol='target',
                                   useBarrierExecutionMode=True
                                  )
    
lgbmModel = lgb_estimator.fit(df_train)

I have had various errors running on a 50% sample but it completes with a much smaller sample. I switched to useBarrierExecutionMode = True which resulted in not enough space on disk errors so I increased the volume on all the workers. I get errors that dont seem to helpful:

An error occurred while calling o109.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(9, 3) finished unsuccessfully.
java.net.ConnectException: Connection refused (Connection refused)

Or when not using BarrierExecutionMode=True, something like:

An error occurred while calling o284.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 19.0 failed 4 times, most recent failure: Lost task 0.3 in stage 19.0 (TID 8696) (ip-10-0-188-112.ec2.internal executor 3): java.net.ConnectException: Connection refused (Connection refused)

My questions:

Can this size of cluster support such a large dataset training with Lightgbm? My naïve thought was that it could but would be slow.
When dealing with large datasets, any recommendations on how to set Spark properties that may help?
Any suggestions on the cluster size needed to run this data?

AB#1833527

Issue Analytics

State:
Created a year ago
Comments:12 (5 by maintainers)

Top GitHub Comments

1reaction

imatiach-msftcommented, Jun 16, 2022

I would know this much better if I could see the cluster logs. Note the build above is just latest master, it doesn’t yet include the new optimizations. I wrote to @svotaw and he wrote that he will be able to send a PR by Monday, so sometime next week we can send you a build to try out with the new streaming optimizations to see if it helps prevent the error.

0reactions

svotawcommented, Jul 9, 2022

It is a large set of changes to both LightGBM and SynapseML code, so I have been splitting it up to make reviewing easier. Unfortunatey this slows down progress checking in.