[BUG] MLflow quickstart reporting failure
See original GitHub issueThank you for submitting an issue. Please refer to our issue policy for additional information about bug reports. For help with debugging your code, please refer to Stack Overflow.
Please fill in this bug report template to ensure a timely and thorough response.
Willingness to contribute
The MLflow Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the MLflow code base?
- Yes. I can contribute a fix for this bug independently.
- Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.
- No. I cannot contribute a bug fix at this time.
System information
- Have I written custom code (as opposed to using a stock example script provided in MLflow): No
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Manjaro Linux 5.16.14-1-MANJARO
- MLflow installed from (source or binary):
pip install mlflow
- MLflow version (run
mlflow --version
): 1.24.0 - Python version: Python 3.9.12
- npm version, if running the dev UI: N/A
- Exact command to reproduce:
mlflow run mlflow/examples/pytorch/ --no-conda --storage-dir $(pwd)/mlruns --experiment-name Test-Exp
Describe the problem
Describe the problem clearly here. Include descriptions of the expected behavior and the actual behavior.
I’m finding that the pytorch mnist
example is ending with a failure, for no known reason. I’d love to get more debug output, but don’t have an obvious way to do this.
$ mlflow run mlflow/examples/pytorch/ --no-conda --storage-dir $(pwd)/mlruns --experiment-name Test-Exp
2022/04/08 12:19:38 INFO mlflow.projects.utils: === Created directory [removed]/mlruns/tmprbb_ti6y for downloading remote URIs passed to arguments of type 'path' ===
2022/04/08 12:19:38 INFO mlflow.projects.backend.local: === Running command 'python mnist_tensorboard_artifact.py \
--batch-size 64 \
--test-batch-size 1000 \
--epochs 10 \
--lr 0.01 \
--momentum 0.5 \
--enable-cuda True \
--seed 5 \
--log-interval 100
' in run with ID 'ccdf036cada144fe9c91d48c15aa26d8' ===
...
Uploading TensorBoard events as a run artifact...
...
Sample predictions
Sample 0 : Ground truth is "3", model prediction is "5"
Sample 1 : Ground truth is "4", model prediction is "4"
Sample 2 : Ground truth is "3", model prediction is "3"
Sample 3 : Ground truth is "1", model prediction is "1"
Sample 4 : Ground truth is "2", model prediction is "2"
2022/04/08 12:22:12 ERROR mlflow.cli: === Run (ID 'ccdf036cada144fe9c91d48c15aa26d8') failed ===
Code to reproduce issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.
Server:
mlflow server --backend-store-uri sqlite:///mlflow.sqlite --default-artifact-root $(pwd)/mlruns --host 0.0.0.0
Runner:
mlflow run mlflow/examples/pytorch/ --no-conda --storage-dir $(pwd)/mlruns --experiment-name Test-Exp
Code at mlflow…mnist_tensorboard_artifact.py gives me this error, and the Tracking UI also shows a failure. Artifacts are logging correctly once the directories are correctly pointed.
Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
NA – request what is needed
What component(s), interfaces, languages, and integrations does this bug affect?
Components
-
area/artifacts
: Artifact stores and artifact logging -
area/build
: Build and test infrastructure for MLflow -
area/docs
: MLflow documentation pages -
area/examples
: Example code -
area/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registry -
area/models
: MLmodel format, model serialization/deserialization, flavors -
area/projects
: MLproject format, project running backends -
area/scoring
: MLflow Model server, model deployment tools, Spark UDFs -
area/server-infra
: MLflow Tracking server backend -
area/tracking
: Tracking Service, tracking client APIs, autologging
Interface
-
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev server -
area/docker
: Docker use across MLflow’s components, such as MLflow Projects and MLflow Models -
area/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registry -
area/windows
: Windows support
Language
-
language/r
: R APIs and clients -
language/java
: Java APIs and clients -
language/new
: Proposals for new client languages
Integrations
-
integrations/azure
: Azure and Azure ML integrations -
integrations/sagemaker
: SageMaker integrations -
integrations/databricks
: Databricks integrations
Issue Analytics
- State:
- Created a year ago
- Comments:7 (2 by maintainers)
Hi @jeinstei, can you try directly running
python mnist_tensorboard_artifact.py
? This might give us more detailed error logs?Will do – I’ll see if it comes up again once we spin up again