Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Large prediction results unless using repartition(1) in databricks with lgbm model

See original GitHub issue

I’m using mmlspark lgbm model for regression problem and faced strange thing. If using all normal code as in the example, results will be terrible, becouse predictions are huge (around 10^37 , while target is in range from 0 to 200). Testing, I found that using dataset.repartition(1).cache() fixed this problem,but with one detail - modelling began to take longer (around 1h, while 20m earlier). This is logical since all the data (about 4m rows and 150 columns) is collected before learning in one partition.

I tried changing lgbm param useBarrierExecutionMode to True and different parallelism params, but this changes doesn’t affect result.

Is there a way not to use such workaround with repartition and still having normal results?

Code, used for training

      repartitioned_data = data_train.repartition(1).cache() #  want to delete this line

      # Define model
      model = LightGBMRegressor(
          objective='regression',
          labelCol='label',
          featuresCol="features"
      )

      # Define grid params
      paramGrid = ParamGridBuilder() \
              .addGrid(model.numIterations, [100, 250])\
            .build()

      # Define cross validation for grid params
      evaluator = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="mae")
      crossval = CrossValidator(estimator=model,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,   
                          numFolds=2)

      # Train model
      pipeline = crossval.fit(data_train)

Databricks Runtime Version 6.4 (includes Apache Spark 2.4.5, Scala 2.11)
3 worker nodes Standard_DS4_v2
driver node Standard_DS4_v2
mmlspark version mmlspark_2.11:1.0.0-rc3

AB#1984587

Issue Analytics

State:
Created 3 years ago
Comments:10 (5 by maintainers)

Top GitHub Comments

3reactions

shiyu1994commented, Apr 22, 2021

@imatiach-msft The PR has been merged, please check.

0reactions

shsabcommented, Sep 7, 2022

@imatiach-msft Any update on this issue? We are facing the same issue and using the reparation(1) workaround however it is not feasible for large datasets.