question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Large prediction results unless using repartition(1) in databricks with lgbm model

See original GitHub issue

I’m using mmlspark lgbm model for regression problem and faced strange thing. If using all normal code as in the example, results will be terrible, becouse predictions are huge (around 10^37 , while target is in range from 0 to 200). Testing, I found that using dataset.repartition(1).cache() fixed this problem,but with one detail - modelling began to take longer (around 1h, while 20m earlier). This is logical since all the data (about 4m rows and 150 columns) is collected before learning in one partition.

I tried changing lgbm param useBarrierExecutionMode to True and different parallelism params, but this changes doesn’t affect result.

Is there a way not to use such workaround with repartition and still having normal results?

Code, used for training

      repartitioned_data = data_train.repartition(1).cache() #  want to delete this line

      # Define model
      model = LightGBMRegressor(
          objective='regression',
          labelCol='label',
          featuresCol="features"
      )

      # Define grid params
      paramGrid = ParamGridBuilder() \
              .addGrid(model.numIterations, [100, 250])\
            .build()

      # Define cross validation for grid params
      evaluator = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="mae")
      crossval = CrossValidator(estimator=model,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,   
                          numFolds=2)

      # Train model
      pipeline = crossval.fit(data_train)
  • Databricks Runtime Version 6.4 (includes Apache Spark 2.4.5, Scala 2.11)
  • 3 worker nodes Standard_DS4_v2
  • driver node Standard_DS4_v2
  • mmlspark version mmlspark_2.11:1.0.0-rc3

AB#1984587

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:10 (5 by maintainers)

github_iconTop GitHub Comments

3reactions
shiyu1994commented, Apr 22, 2021

@imatiach-msft The PR has been merged, please check.

0reactions
shsabcommented, Sep 7, 2022

@imatiach-msft Any update on this issue? We are facing the same issue and using the reparation(1) workaround however it is not feasible for large datasets.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Large prediction results unless using repartition(1) in ... - GitHub
I'm using mmlspark lgbm model for regression problem and faced strange thing. If using all normal code as in the example, results will...
Read more >
pyspark - Large prediction results unless using repartition(1 ...
I'm using mmlspark lgbm model for regression problem and faced strange thing. If using all normal code as in the example, results will...
Read more >
Detecting Data Bias Using SHAP and Machine Learning
Using the SHAP tool, we explore possible gender bias in software development using data from the StackOverflow survey.
Read more >
LightGBM For Binary Classification In Python - Medium
The goal is to perform a binary classification using the LightGBM model to predict if an employee would leave a company or not...
Read more >
How to build machine learning model at large scale with ...
In this post, I am going to show you how one can leverage Apache Spark to implement the core part of the building...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found