Errors Training on Large Dataset : LightGBM
See original GitHub issueIm trying to fit a classifier with LightGBM on a large dataset. There are about 900,000,000 rows and 40 columns, 7 of which are integers being treated as categorical.
- SynapseML Version: com.microsoft.azure:synapseml_2.12:0.9.5
- Spark Version 3,2
- Spark Platform: AWS EMR
The current cluster is
6 workers with 16 VCPUS and 64GB RAM each.
The call to Lightgbm is as follows:
lgb_estimator = LightGBMClassifier(objective ="binary", learningRate = 0.1, numIterations = 222,
categoricalSlotNames = ["cat1",
"cat2",
"cat3",
"cat4",
"cat5",
"cat6",
"cat7"],
numLeaves= 31,
probabilityCol='probs',
featuresCol='features',
labelCol='target',
useBarrierExecutionMode=True
)
lgbmModel = lgb_estimator.fit(df_train)
I have had various errors running on a 50% sample but it completes with a much smaller sample. I switched to useBarrierExecutionMode = True which resulted in not enough space on disk errors so I increased the volume on all the workers. I get errors that dont seem to helpful:
An error occurred while calling o109.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(9, 3) finished unsuccessfully.
java.net.ConnectException: Connection refused (Connection refused)
Or when not using BarrierExecutionMode=True, something like:
An error occurred while calling o284.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 19.0 failed 4 times, most recent failure: Lost task 0.3 in stage 19.0 (TID 8696) (ip-10-0-188-112.ec2.internal executor 3): java.net.ConnectException: Connection refused (Connection refused)
My questions:
- Can this size of cluster support such a large dataset training with Lightgbm? My naïve thought was that it could but would be slow.
- When dealing with large datasets, any recommendations on how to set Spark properties that may help?
- Any suggestions on the cluster size needed to run this data?
AB#1833527
Issue Analytics
- State:
- Created a year ago
- Comments:12 (5 by maintainers)
Top Results From Across the Web
Build XGBoost / LightGBM models on large datasets
In these competitions, the data is not 'huge' — well, don't tell me the data you're handling is huge if it can be...
Read more >LightGBM FAQ — LightGBM 3.3.3.99 documentation
1. Any training command using LightGBM does not work after an error occurred during the training of a previous LightGBM model. In...
Read more >Understanding LightGBM Parameters (and How to Tune Them)
Large num_leaves increases accuracy on the training set and also the chance of getting hurt by overfitting. According to the documentation, one ...
Read more >Run through LightGBM Fast Training Techniques - Medium
(The prediction errors are loss function gradients on all the training ... To deal with super large dataset, LightGBM introduces distributed ...
Read more >LightGBM | Explanation & Practice with dataset - Kaggle
Therefore, in the face of large-scale data sets, LightGBM is still very calm ... Samples with small gradient and small training error indicate...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I would know this much better if I could see the cluster logs. Note the build above is just latest master, it doesn’t yet include the new optimizations. I wrote to @svotaw and he wrote that he will be able to send a PR by Monday, so sometime next week we can send you a build to try out with the new streaming optimizations to see if it helps prevent the error.
It is a large set of changes to both LightGBM and SynapseML code, so I have been splitting it up to make reviewing easier. Unfortunatey this slows down progress checking in.