Issues with mlflow sagemaker deploy
See original GitHub issueSystem information
j
Code to reproduce issue
bb
Describe the problem
I want to deploy a mlflow spark app to sagemaker. Is this possible? As I sucessfully push a image to ECR with
mlflow sagemaker build-and-push-container
and then when I attempt to deploy this image with
‘mlflow sagemaker deploy…’
It fails due to a time out when trying to create the endpoint. When I log at the in cloud watch multiple errors appear, one is for example:
py4j.protocol.Py4JJavaError: An error occurred while calling o75.load.
or
2022/07/15 13:41:44 [error] 453#453: *21 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 10.32.0.2, server: , request: "GET /ping HTTP/1.1", upstream: "http://127.0.0.1:8000/ping", host: "model.aws.local:8080"
or
gunicorn.errors.HaltServer: <HaltServer 'Worker failed to boot.' 3>
or
CondaValueError: prefix already exists: /miniconda/envs/custom_env
I use libraries like:
from pyspark.sql import SparkSession
import pyspark.sql.types as T
import pyspark.sql.functions as F
from pyparsing import col
from pyspark.sql.window import Window
from pyspark.ml.feature import Bucketizer
from pyspark.ml.evaluation import RegressionEvaluator
from pandas import DataFrame
import mlflow
import mlflow.spark
I use versions java - 18.0.333 spark 3.1.2 hadoop 3.2 And all other necessary pip installations. Any help would be nice
Other info / logs
_No response log-events-viewer-result (1).csv _
Issue Analytics
- State:
- Created a year ago
- Comments:13 (7 by maintainers)
@harupy when I check if the model is serving locally is fails with the error:
2022/07/18 11:04:27 INFO mlflow.sagemaker: executing: docker run -v C:\Users\....\mlruns\5\0a92547...\artifacts\model:/opt/ml/model/ -p 5000:8080 -e MLFLOW_DEPLOYMENT_FLAVOR_NAME=python_function -e SERVING_ENVIRONMENT=SageMaker --rm mlflow-pyfunc serve docker: Error response from daemon: driver failed programming external connectivity on endpoint epic_kapitsa (8d05038a9156c5aba360d24c3756d249171b95bfd0f4cd9ff8c2168d6b4d3f7a): Bind for 0.0.0.0:5000 failed: port is already allocated.
Would this be causing the deploy stage to fail?Never mind it’s attempting to run now, I had to stop the mlflow ui running in the background. Will update you with the output
@hollytb You can also try running this command to check if model serving works locally: