Large prediction results unless using repartition(1) in databricks with lgbm model
See original GitHub issueI’m using mmlspark lgbm model for regression problem and faced strange thing. If using all normal code as in the example, results will be terrible, becouse predictions are huge (around 10^37 , while target is in range from 0 to 200).
Testing, I found that using dataset.repartition(1).cache()
fixed this problem,but with one detail - modelling began to take longer (around 1h, while 20m earlier). This is logical since all the data (about 4m rows and 150 columns) is collected before learning in one partition.
I tried changing lgbm param useBarrierExecutionMode
to True and different parallelism
params, but this changes doesn’t affect result.
Is there a way not to use such workaround with repartition and still having normal results?
Code, used for training
repartitioned_data = data_train.repartition(1).cache() # want to delete this line
# Define model
model = LightGBMRegressor(
objective='regression',
labelCol='label',
featuresCol="features"
)
# Define grid params
paramGrid = ParamGridBuilder() \
.addGrid(model.numIterations, [100, 250])\
.build()
# Define cross validation for grid params
evaluator = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="mae")
crossval = CrossValidator(estimator=model,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
numFolds=2)
# Train model
pipeline = crossval.fit(data_train)
- Databricks Runtime Version 6.4 (includes Apache Spark 2.4.5, Scala 2.11)
- 3 worker nodes Standard_DS4_v2
- driver node Standard_DS4_v2
- mmlspark version mmlspark_2.11:1.0.0-rc3
AB#1984587
Issue Analytics
- State:
- Created 3 years ago
- Comments:10 (5 by maintainers)
Top GitHub Comments
@imatiach-msft The PR has been merged, please check.
@imatiach-msft Any update on this issue? We are facing the same issue and using the reparation(1) workaround however it is not feasible for large datasets.