Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG]Spark autolog `sparkDatasourceInfo` tag not reset between runs

See original GitHub issue

Willingness to contribute

The MLflow Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the MLflow code base?

Yes. I can contribute a fix for this bug independently.
Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.
No. I cannot contribute a bug fix at this time.

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Databricks 9.1LTS runtime
MLflow installed from (source or binary): %pip install mlflow
MLflow version (run mlflow --version): mlflow, version 1.24.0
Python version: mlflow, version 1.24.0
Exact command to reproduce: mlflow.spark.autolog()

Describe the problem

When I enable mlflow.spark.autolog() then trigger two different runs with with mlflow.start_run() and under each run use spark.read.load(xxx) with different data source, the run’s tag sparkDatasourceInfo will append those sources and show in the experiments. I’m expecting that different runs should clean up the tag of sparkDatasourceInfo first and then append whatever data source used within the current run.

Code to reproduce issue

mlflow.spark.autolog()
with mlflow.start_run():
    df = spark.read.format("csv").load("path1")
    display(df)
with mlflow.start_run():
    df = spark.read.format("csv").load("path2")
    display(df)

What component(s), interfaces, languages, and integrations does this bug affect?

Components

area/artifacts: Artifact stores and artifact logging
area/build: Build and test infrastructure for MLflow
area/docs: MLflow documentation pages
area/examples: Example code
area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
area/models: MLmodel format, model serialization/deserialization, flavors
area/projects: MLproject format, project running backends
area/scoring: MLflow Model server, model deployment tools, Spark UDFs
area/server-infra: MLflow Tracking server backend
area/tracking: Tracking Service, tracking client APIs, autologging

Interface

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow’s components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

Language

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

Integrations

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

Issue Analytics

State:
Created a year ago
Comments:7 (3 by maintainers)

Top GitHub Comments

1reaction

dbczumarcommented, Mar 29, 2022

Hi @serena-ruan. Thank you for raising this! Would you be able to root cause the issue and file a fix? Happy to help provide guidance and PR review!

0reactions

serena-ruancommented, Apr 6, 2022

Ah I see, thanks for the clarification! I’ll try to fix it this week 😄

Top Results From Across the Web

[BUG]Spark autolog `sparkDatasourceInfo` tag not ... - GitHub

Open source platform for the machine learning lifecycle - [BUG]Spark autolog `sparkDatasourceInfo` tag not reset between runs · mlflow/mlflow@27769b2.

mlflow.spark — MLflow 2.0.1 documentation

Log a Spark MLlib model as an MLflow artifact for the current run. This uses the MLlib persistence format and produces an MLflow...

Apache Spark MLlib and automated MLflow tracking

MLlib automated MLflow tracking is deprecated on clusters that run Databricks Runtime 10.1 ML and above, and it is disabled by default on ......

Solving 5 Mysterious Spark Errors | by yhoztak - Medium

It involves Spark, Livy, Jupyter notebook, luigi, EMR, backed with S3 in ... Here are some of the tricky ones I run into...

RowNumber with Reset - apache spark - Stack Overflow

1 Answer 1 · you first flag all the consecutive occurrences of the state as 0 and others as 1 - this'll enable...