question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Errors Training on Large Dataset : LightGBM

See original GitHub issue

Im trying to fit a classifier with LightGBM on a large dataset. There are about 900,000,000 rows and 40 columns, 7 of which are integers being treated as categorical.

  • SynapseML Version: com.microsoft.azure:synapseml_2.12:0.9.5
  • Spark Version 3,2
  • Spark Platform: AWS EMR

The current cluster is

6 workers with 16 VCPUS and 64GB RAM each.

The call to Lightgbm is as follows:

lgb_estimator = LightGBMClassifier(objective ="binary", learningRate = 0.1, numIterations = 222,
                                       categoricalSlotNames = ["cat1",
                                                                "cat2",
                                                                "cat3",
                                                                "cat4",
                                                                "cat5",
                                                                "cat6",
                                                                "cat7"], 
                                       numLeaves= 31,
                                       probabilityCol='probs',
                                       featuresCol='features',
                                       labelCol='target',
                                   useBarrierExecutionMode=True
                                  )
    
lgbmModel = lgb_estimator.fit(df_train)

I have had various errors running on a 50% sample but it completes with a much smaller sample. I switched to useBarrierExecutionMode = True which resulted in not enough space on disk errors so I increased the volume on all the workers. I get errors that dont seem to helpful:

An error occurred while calling o109.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(9, 3) finished unsuccessfully.
java.net.ConnectException: Connection refused (Connection refused)

Or when not using BarrierExecutionMode=True, something like:

An error occurred while calling o284.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 19.0 failed 4 times, most recent failure: Lost task 0.3 in stage 19.0 (TID 8696) (ip-10-0-188-112.ec2.internal executor 3): java.net.ConnectException: Connection refused (Connection refused)

My questions:

  1. Can this size of cluster support such a large dataset training with Lightgbm? My naïve thought was that it could but would be slow.
  2. When dealing with large datasets, any recommendations on how to set Spark properties that may help?
  3. Any suggestions on the cluster size needed to run this data?

AB#1833527

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:12 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
imatiach-msftcommented, Jun 16, 2022

I would know this much better if I could see the cluster logs. Note the build above is just latest master, it doesn’t yet include the new optimizations. I wrote to @svotaw and he wrote that he will be able to send a PR by Monday, so sometime next week we can send you a build to try out with the new streaming optimizations to see if it helps prevent the error.

0reactions
svotawcommented, Jul 9, 2022

It is a large set of changes to both LightGBM and SynapseML code, so I have been splitting it up to make reviewing easier. Unfortunatey this slows down progress checking in.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Build XGBoost / LightGBM models on large datasets
In these competitions, the data is not 'huge' — well, don't tell me the data you're handling is huge if it can be...
Read more >
LightGBM FAQ — LightGBM 3.3.3.99 documentation
1. Any training command using LightGBM does not work after an error occurred during the training of a previous LightGBM model.  In...
Read more >
Understanding LightGBM Parameters (and How to Tune Them)
Large num_leaves increases accuracy on the training set and also the chance of getting hurt by overfitting. According to the documentation, one ...
Read more >
Run through LightGBM Fast Training Techniques - Medium
(The prediction errors are loss function gradients on all the training ... To deal with super large dataset, LightGBM introduces distributed ...
Read more >
LightGBM | Explanation & Practice with dataset - Kaggle
Therefore, in the face of large-scale data sets, LightGBM is still very calm ... Samples with small gradient and small training error indicate...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found