[BUG]Spark autolog `sparkDatasourceInfo` tag not reset between runs
See original GitHub issueWillingness to contribute
The MLflow Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the MLflow code base?
- Yes. I can contribute a fix for this bug independently.
- Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.
- No. I cannot contribute a bug fix at this time.
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Databricks 9.1LTS runtime
- MLflow installed from (source or binary): %pip install mlflow
- MLflow version (run
mlflow --version): mlflow, version 1.24.0 - Python version: mlflow, version 1.24.0
- Exact command to reproduce: mlflow.spark.autolog()
Describe the problem
When I enable mlflow.spark.autolog() then trigger two different runs with with mlflow.start_run() and under each run use spark.read.load(xxx) with different data source, the run’s tag sparkDatasourceInfo will append those sources and show in the experiments. I’m expecting that different runs should clean up the tag of sparkDatasourceInfo first and then append whatever data source used within the current run.
Code to reproduce issue
mlflow.spark.autolog()
with mlflow.start_run():
df = spark.read.format("csv").load("path1")
display(df)
with mlflow.start_run():
df = spark.read.format("csv").load("path2")
display(df)
What component(s), interfaces, languages, and integrations does this bug affect?
Components
-
area/artifacts: Artifact stores and artifact logging -
area/build: Build and test infrastructure for MLflow -
area/docs: MLflow documentation pages -
area/examples: Example code -
area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry -
area/models: MLmodel format, model serialization/deserialization, flavors -
area/projects: MLproject format, project running backends -
area/scoring: MLflow Model server, model deployment tools, Spark UDFs -
area/server-infra: MLflow Tracking server backend -
area/tracking: Tracking Service, tracking client APIs, autologging
Interface
-
area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server -
area/docker: Docker use across MLflow’s components, such as MLflow Projects and MLflow Models -
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry -
area/windows: Windows support
Language
-
language/r: R APIs and clients -
language/java: Java APIs and clients -
language/new: Proposals for new client languages
Integrations
-
integrations/azure: Azure and Azure ML integrations -
integrations/sagemaker: SageMaker integrations -
integrations/databricks: Databricks integrations
Issue Analytics
- State:
- Created a year ago
- Comments:7 (3 by maintainers)
Top Results From Across the Web
[BUG]Spark autolog `sparkDatasourceInfo` tag not ... - GitHub
Open source platform for the machine learning lifecycle - [BUG]Spark autolog `sparkDatasourceInfo` tag not reset between runs · mlflow/mlflow@27769b2.
Read more >mlflow.spark — MLflow 2.0.1 documentation
Log a Spark MLlib model as an MLflow artifact for the current run. This uses the MLlib persistence format and produces an MLflow...
Read more >Apache Spark MLlib and automated MLflow tracking
MLlib automated MLflow tracking is deprecated on clusters that run Databricks Runtime 10.1 ML and above, and it is disabled by default on ......
Read more >Solving 5 Mysterious Spark Errors | by yhoztak - Medium
It involves Spark, Livy, Jupyter notebook, luigi, EMR, backed with S3 in ... Here are some of the tricky ones I run into...
Read more >RowNumber with Reset - apache spark - Stack Overflow
1 Answer 1 · you first flag all the consecutive occurrences of the state as 0 and others as 1 - this'll enable...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Hi @serena-ruan. Thank you for raising this! Would you be able to root cause the issue and file a fix? Happy to help provide guidance and PR review!
Ah I see, thanks for the clarification! I’ll try to fix it this week 😄