question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] running an Mlflow Project from github URL with 'main' branch instead of 'master'

See original GitHub issue

Thank you for submitting an issue. Please refer to our issue policy for additional information about bug reports. For help with debugging your code, please refer to Stack Overflow.

Please fill in this bug report template to ensure a timely and thorough response.

Willingness to contribute

The MLflow Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the MLflow code base?

  • Yes. I can contribute a fix for this bug independently.
  • Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.
  • No. I cannot contribute a bug fix at this time.

System information

  • Have I written custom code (as opposed to using a stock example script provided in MLflow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): macOS Monterey 12.2.1
  • MLflow installed from (source or binary): pip install mlflow
  • MLflow version (run mlflow --version): 1.25.1
  • Python version: 3.8.12
  • npm version, if running the dev UI: -
  • Exact command to reproduce: mlflow run https://github.com/adin786/mlflow-test.git

Describe the problem

Describe the problem clearly here. Include descriptions of the expected behavior and the actual behavior.

I have written an Mlflow Project specified by an MLproject file, in a Github hosted repo. When creating this repo, Github defaults to calling the starting branch ‘origin/main’ rather than the old name ‘origin/master’.

This causes an issue for mlflow. When I run mlflow run https://github.com/adin786/mlflow-test.git without specifying a branch name directly, it fails with the following error message

2022/04/18 22:49:00 INFO mlflow.projects.utils: === Fetching project from https://github.com/adin786/mlflow-test.git into /var/folders/n6/8q8ymx711f981rrq8_6rqnj80000gp/T/tmposbzzzxi ===
Traceback (most recent call last):
  File "/Users/azam/miniforge3/envs/mlflow-test/bin/mlflow", line 8, in <module>
    sys.exit(cli())
  File "/Users/azam/miniforge3/envs/mlflow-test/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/Users/azam/miniforge3/envs/mlflow-test/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/Users/azam/miniforge3/envs/mlflow-test/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/azam/miniforge3/envs/mlflow-test/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/azam/miniforge3/envs/mlflow-test/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/azam/miniforge3/envs/mlflow-test/lib/python3.8/site-packages/mlflow/cli.py", line 178, in run
    projects.run(
  File "/Users/azam/miniforge3/envs/mlflow-test/lib/python3.8/site-packages/mlflow/projects/__init__.py", line 329, in run
    submitted_run_obj = _run(
  File "/Users/azam/miniforge3/envs/mlflow-test/lib/python3.8/site-packages/mlflow/projects/__init__.py", line 95, in _run
    submitted_run = backend.run(
  File "/Users/azam/miniforge3/envs/mlflow-test/lib/python3.8/site-packages/mlflow/projects/backend/local.py", line 46, in run
    work_dir = fetch_and_validate_project(project_uri, version, entry_point, params)
  File "/Users/azam/miniforge3/envs/mlflow-test/lib/python3.8/site-packages/mlflow/projects/utils.py", line 130, in fetch_and_validate_project
    work_dir = _fetch_project(uri=uri, version=version)
  File "/Users/azam/miniforge3/envs/mlflow-test/lib/python3.8/site-packages/mlflow/projects/utils.py", line 164, in _fetch_project
    _fetch_git_repo(parsed_uri, version, dst_dir)
  File "/Users/azam/miniforge3/envs/mlflow-test/lib/python3.8/site-packages/mlflow/projects/utils.py", line 202, in _fetch_git_repo
    repo.create_head("master", origin.refs.master)
  File "/Users/azam/miniforge3/envs/mlflow-test/lib/python3.8/site-packages/git/util.py", line 1001, in __getattr__
    return list.__getattribute__(self, attr)
AttributeError: 'IterableList' object has no attribute 'origin/master'

It looks like the module mlflow.projects.utils contains a call to repo.create_head("master", origin.refs.master) if the --version flag is not specified. So mlflow is assuming there is a ‘master’ branch and fails because my repo only contains ‘main’.

I found that the issue does not occur if I do one of the following:

  • I specify --version with a commit hash like cf576c14af64e4814aa31e7e524c1a6d9f024266
  • I specify --version with main

Suggestion: First check if master or main exist as branches.

Code to reproduce issue

Provide a reproducible test case that is the bare minimum necessary to generate the problem.

Create a github repo with the following 3 files:

MLproject file:

name: project_test

conda_env: conda.yaml

entry_points: 
  main:
    command: "python train.py"

conda.yaml file:

name: mlflow-test
channels:
  - conda-forge
dependencies:
  - pip=22.0.4
  - python=3.8.12
  - pip:
    - mlflow==1.25.1

train.py file:

import mlflow

with mlflow.start_run():
    mlflow.log_param('my_param', 99)
    mlflow.log_metric('my_metric', 55555)

Then run:

git add .
git commit -m "commit message"
git push
mlflow run <github-repo-url>

Other info / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

[See above]

What component(s), interfaces, languages, and integrations does this bug affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow’s components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
jagane-infinstorcommented, May 15, 2022

I tried the fix mentioned above, i.e. using repo.heads.default.checkout(). It worked for me with the following cases:

  1. github public repo that does not require auth at https://github.com/jagane-infinstor/mlflow-example-docker.git
  2. AWS CodeCommit private repo that requires auth

I think this is a better solution that my proposal to try and check out a branch called ‘main’ if the attempt to checkout a branch called ‘master’ fails.

1reaction
harupycommented, May 15, 2022

I tested the change below and it worked:

diff --git a/mlflow/projects/utils.py b/mlflow/projects/utils.py
index faf250488..f637df2b2 100644
--- a/mlflow/projects/utils.py
+++ b/mlflow/projects/utils.py
@@ -200,8 +200,8 @@ def _fetch_git_repo(uri, version, dst_dir):
             )
     else:
         origin.fetch(depth=GIT_FETCH_DEPTH)
-        repo.create_head("master", origin.refs.master)
-        repo.heads.master.checkout()
+        repo.create_head("default", origin.refs[0])
+        repo.heads.default.checkout()
     repo.submodule_update(init=True, recursive=True)

How this works

  1. origin.fetch(depth=GIT_FETCH_DEPTH) fethces the default branch. It might be “master”, “main”, “foo”, “bar”, or something else. We can just let gitpython determine that.
  2. origin.refs[0] refers to the default branch so set it to “default”.
  3. Done.

I’m not 100% confident this works so we need to carefully test it.

How can we test this?

We can test this by creating a few MLflow project repositories with different default branches (e.g. “master”, “main”, and “foo”) and run mlflow run against them to make sure the default branch is fetched regardless of its name.

Read more comments on GitHub >

github_iconTop Results From Across the Web

mlflow.projects — MLflow 2.0.1 documentation
The mlflow.projects module provides an API for running MLflow projects locally or remotely. ... Wrapper around an MLflow project run (e.g. a subprocess...
Read more >
Databricks MLOps With GitHub Actions & MLflow - YouTube
This hands-on video shows you how to enable MLOps for continuous integration, delivery, and model deployment with mlflow using Github ...
Read more >
Introduction to MLflow for MLOps Part 2: Docker Environment
If a repository has an MLproject file you can also run a project directly from GitHub. This tutorial lives in the https://github.com/Noodle-ai/ ...
Read more >
Run MLflow Projects on Databricks
Learn about MLflow Projects and how to run an MLflow Project ... Git URI should be of the form: https://github.com/<repo>#<project-folder> .
Read more >
Fix Git's 'fatal: repository not found' error quickly | TechTarget
There's nothing worse than joining a new development team and eagerly cloning the existing source code repo only to run head first into ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found