question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot load model

See original GitHub issue

🐛 Describe the bug

I am trying to deploy locally pretrained model via sagemaker to make a endpoint and use it.

I deployed a model

from sagemaker.pytorch import PyTorchModel

pytorch_model = PyTorchModel(model_data=‘model.tar.gz’, role=role, entry_point=‘inference.py’, framework_version=“1.9.0”, py_version=“py38”)

predictor = pytorch_model.deploy(instance_type=‘ml.g4dn.xlarge’, initial_instance_count=1)

and predict data

from PIL import Image

data = Image.open(‘./samples/inputs/1.jpg’) result = predictor.predict(data) img = Image.open(result) img.show()

as a result I got an error

ModelError Traceback (most recent call last) /tmp/ipykernel_4268/3704626012.py in <cell line: 4>() 2 3 data = Image.open(‘./samples/inputs/1.jpg’) ----> 4 result = predictor.predict(data) 5 6 img = Image.open(result)

~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/sagemaker/predictor.py in predict(self, data, initial_args, target_model, target_variant, inference_id) 159 data, initial_args, target_model, target_variant, inference_id 160 ) –> 161 response = self.sagemaker_session.sagemaker_runtime_client.invoke_endpoint(**request_args) 162 return self._handle_response(response) 163

~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/botocore/client.py in _api_call(self, *args, **kwargs) 506 ) 507 # The “self” in this scope is referring to the BaseClient. –> 508 return self._make_api_call(operation_name, kwargs) 509 510 _api_call.name = str(py_operation_name)

~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params) 913 error_code = parsed_response.get(“Error”, {}).get(“Code”) 914 error_class = self.exceptions.from_code(error_code) –> 915 raise error_class(parsed_response, operation_name) 916 else: 917 return parsed_response

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message “Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.”.

I skim through logs via CloudWatch, and still struggling with this. need a help.

Error logs


timestamp message logStreamName
1661327528194 2022-08-24 07:52:07,987 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager… AllTraffic/i-0b6f78248b097b6c7
1661327528194 2022-08-24 07:52:08,112 [INFO ] main org.pytorch.serve.ModelServer - AllTraffic/i-0b6f78248b097b6c7
1661327528194 Torchserve version: 0.4.2 AllTraffic/i-0b6f78248b097b6c7
1661327528194 TS Home: /opt/conda/lib/python3.8/site-packages AllTraffic/i-0b6f78248b097b6c7
1661327528194 Current directory: / AllTraffic/i-0b6f78248b097b6c7
1661327528194 Temp directory: /home/model-server/tmp AllTraffic/i-0b6f78248b097b6c7
1661327528194 Number of GPUs: 1 AllTraffic/i-0b6f78248b097b6c7
1661327528194 Number of CPUs: 1 AllTraffic/i-0b6f78248b097b6c7
1661327528194 Max heap size: 3234 M AllTraffic/i-0b6f78248b097b6c7
1661327528194 Python executable: /opt/conda/bin/python3.8 AllTraffic/i-0b6f78248b097b6c7
1661327528194 Config file: /etc/sagemaker-ts.properties AllTraffic/i-0b6f78248b097b6c7
1661327528194 Inference address: http://0.0.0.0:8080 AllTraffic/i-0b6f78248b097b6c7
1661327528194 Management address: http://0.0.0.0:8080 AllTraffic/i-0b6f78248b097b6c7
1661327528194 Metrics address: http://127.0.0.1:8082 AllTraffic/i-0b6f78248b097b6c7
1661327528194 Model Store: /.sagemaker/ts/models AllTraffic/i-0b6f78248b097b6c7
1661327528194 Initial Models: model.mar AllTraffic/i-0b6f78248b097b6c7
1661327528194 Log dir: /logs AllTraffic/i-0b6f78248b097b6c7
1661327528194 Metrics dir: /logs AllTraffic/i-0b6f78248b097b6c7
1661327528194 Netty threads: 0 AllTraffic/i-0b6f78248b097b6c7
1661327528194 Netty client threads: 0 AllTraffic/i-0b6f78248b097b6c7
1661327528194 Default workers per model: 1 AllTraffic/i-0b6f78248b097b6c7
1661327528194 Blacklist Regex: N/A AllTraffic/i-0b6f78248b097b6c7
1661327528194 Maximum Response Size: 6553500 AllTraffic/i-0b6f78248b097b6c7
1661327528194 Maximum Request Size: 6553500 AllTraffic/i-0b6f78248b097b6c7
1661327528194 Prefer direct buffer: false AllTraffic/i-0b6f78248b097b6c7
1661327528194 Allowed Urls: [file://.* http(s)?😕/.*]
1661327528194 Custom python dependency for model allowed: false AllTraffic/i-0b6f78248b097b6c7
1661327528194 Metrics report format: prometheus AllTraffic/i-0b6f78248b097b6c7
1661327528194 Enable metrics API: true AllTraffic/i-0b6f78248b097b6c7
1661327528194 Workflow Store: /.sagemaker/ts/models AllTraffic/i-0b6f78248b097b6c7
1661327528194 Model config: N/A AllTraffic/i-0b6f78248b097b6c7
1661327528194 2022-08-24 07:52:08,120 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Loading snapshot serializer plugin… AllTraffic/i-0b6f78248b097b6c7
1661327528444 2022-08-24 07:52:08,149 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: model.mar AllTraffic/i-0b6f78248b097b6c7
1661327528444 2022-08-24 07:52:08,353 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model model loaded. AllTraffic/i-0b6f78248b097b6c7
1661327528694 2022-08-24 07:52:08,370 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel. AllTraffic/i-0b6f78248b097b6c7
1661327528694 2022-08-24 07:52:08,472 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8080 AllTraffic/i-0b6f78248b097b6c7
1661327528694 2022-08-24 07:52:08,473 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel. AllTraffic/i-0b6f78248b097b6c7
1661327528944 2022-08-24 07:52:08,474 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082 AllTraffic/i-0b6f78248b097b6c7
1661327528944 Model server started. AllTraffic/i-0b6f78248b097b6c7
1661327528944 2022-08-24 07:52:08,738 [WARN ] pool-2-thread-1 org.pytorch.serve.metrics.MetricCollector - worker pid is not available yet. AllTraffic/i-0b6f78248b097b6c7
1661327528944 2022-08-24 07:52:08,786 [INFO ] pool-2-thread-1 TS_METRICS - CPUUtilization.Percent:0.0 #Level:Host
1661327528944 2022-08-24 07:52:08,787 [INFO ] pool-2-thread-1 TS_METRICS - DiskAvailable.Gigabytes:24.598094940185547 #Level:Host
1661327528944 2022-08-24 07:52:08,788 [INFO ] pool-2-thread-1 TS_METRICS - DiskUsage.Gigabytes:27.390167236328125 #Level:Host
1661327528944 2022-08-24 07:52:08,788 [INFO ] pool-2-thread-1 TS_METRICS - DiskUtilization.Percent:52.7 #Level:Host
1661327528944 2022-08-24 07:52:08,788 [INFO ] pool-2-thread-1 TS_METRICS - MemoryAvailable.Megabytes:14186.97265625 #Level:Host
1661327528944 2022-08-24 07:52:08,789 [INFO ] pool-2-thread-1 TS_METRICS - MemoryUsed.Megabytes:1227.640625 #Level:Host
1661327529195 2022-08-24 07:52:08,789 [INFO ] pool-2-thread-1 TS_METRICS - MemoryUtilization.Percent:9.9 #Level:Host
1661327529195 2022-08-24 07:52:09,004 [INFO ] W-9000-model_1-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000 AllTraffic/i-0b6f78248b097b6c7
1661327529195 2022-08-24 07:52:09,004 [INFO ] W-9000-model_1-stdout MODEL_LOG - [PID]32 AllTraffic/i-0b6f78248b097b6c7
1661327529195 2022-08-24 07:52:09,004 [INFO ] W-9000-model_1-stdout MODEL_LOG - Torch worker started. AllTraffic/i-0b6f78248b097b6c7
1661327529195 2022-08-24 07:52:09,004 [INFO ] W-9000-model_1-stdout MODEL_LOG - Python runtime: 3.8.10 AllTraffic/i-0b6f78248b097b6c7
1661327529195 2022-08-24 07:52:09,011 [INFO ] W-9000-model_1 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000 AllTraffic/i-0b6f78248b097b6c7
1661327529195 2022-08-24 07:52:09,021 [INFO ] W-9000-model_1-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000. AllTraffic/i-0b6f78248b097b6c7
1661327529695 2022-08-24 07:52:09,064 [INFO ] W-9000-model_1-stdout MODEL_LOG - model_name: model, batchSize: 1 AllTraffic/i-0b6f78248b097b6c7
1661327529695 2022-08-24 07:52:09,605 [INFO ] W-9000-model_1-stdout MODEL_LOG - Backend worker process died. AllTraffic/i-0b6f78248b097b6c7
1661327529695 2022-08-24 07:52:09,605 [INFO ] W-9000-model_1-stdout MODEL_LOG - Traceback (most recent call last): AllTraffic/i-0b6f78248b097b6c7
1661327529695 2022-08-24 07:52:09,606 [INFO ] W-9000-model_1-stdout MODEL_LOG - File “/opt/conda/lib/python3.8/site-packages/ts/model_service_worker.py”, line 183, in <module> AllTraffic/i-0b6f78248b097b6c7
1661327529695 2022-08-24 07:52:09,606 [INFO ] W-9000-model_1-stdout MODEL_LOG - worker.run_server() AllTraffic/i-0b6f78248b097b6c7
1661327529695 2022-08-24 07:52:09,606 [INFO ] W-9000-model_1-stdout MODEL_LOG - File “/opt/conda/lib/python3.8/site-packages/ts/model_service_worker.py”, line 155, in run_server AllTraffic/i-0b6f78248b097b6c7
1661327529695 2022-08-24 07:52:09,607 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED AllTraffic/i-0b6f78248b097b6c7
1661327529695 2022-08-24 07:52:09,607 [INFO ] W-9000-model_1-stdout MODEL_LOG - self.handle_connection(cl_socket) AllTraffic/i-0b6f78248b097b6c7
1661327529695 2022-08-24 07:52:09,608 [INFO ] W-9000-model_1-stdout MODEL_LOG - File “/opt/conda/lib/python3.8/site-packages/ts/model_service_worker.py”, line 117, in handle_connection AllTraffic/i-0b6f78248b097b6c7
1661327529695 2022-08-24 07:52:09,608 [WARN ] W-9000-model_1 org.pytorch.serve.wlm.BatchAggregator - Load model failed: model, error: Worker died. AllTraffic/i-0b6f78248b097b6c7
1661327529695 2022-08-24 07:52:09,608 [INFO ] W-9000-model_1-stdout MODEL_LOG - service, result, code = self.load_model(msg) AllTraffic/i-0b6f78248b097b6c7
1661327529695 2022-08-24 07:52:09,609 [WARN ] W-9000-model_1 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-model_1-stderr AllTraffic/i-0b6f78248b097b6c7
1661327529695 2022-08-24 07:52:09,609 [WARN ] W-9000-model_1 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-model_1-stdout AllTraffic/i-0b6f78248b097b6c7
1661327529695 2022-08-24 07:52:09,610 [INFO ] W-9000-model_1-stdout MODEL_LOG - File “/opt/conda/lib/python3.8/site-packages/ts/model_service_worker.py”, line 90, in load_model AllTraffic/i-0b6f78248b097b6c7
1661327529695 2022-08-24 07:52:09,610 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-model_1-stdout AllTraffic/i-0b6f78248b097b6c7
1661327529695 2022-08-24 07:52:09,610 [INFO ] W-9000-model_1 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds. AllTraffic/i-0b6f78248b097b6c7
1661327531196 2022-08-24 07:52:09,628 [INFO ] W-9000-model_1-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-model_1-stderr AllTraffic/i-0b6f78248b097b6c7
1661327531196 2022-08-24 07:52:11,192 [INFO ] W-9000-model_1-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000 AllTraffic/i-0b6f78248b097b6c7
1661327531196 2022-08-24 07:52:11,193 [INFO ] W-9000-model_1-stdout MODEL_LOG - [PID]52 AllTraffic/i-0b6f78248b097b6c7
1661327531196 2022-08-24 07:52:11,193 [INFO ] W-9000-model_1-stdout MODEL_LOG - Torch worker started. AllTraffic/i-0b6f78248b097b6c7
1661327531196 2022-08-24 07:52:11,193 [INFO ] W-9000-model_1 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000 AllTraffic/i-0b6f78248b097b6c7
1661327531196 2022-08-24 07:52:11,194 [INFO ] W-9000-model_1-stdout MODEL_LOG - Python runtime: 3.8.10 AllTraffic/i-0b6f78248b097b6c7
1661327531446 2022-08-24 07:52:11,195 [INFO ] W-9000-model_1-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000. AllTraffic/i-0b6f78248b097b6c7
1661327531446 2022-08-24 07:52:11,212 [INFO ] W-9000-model_1-stdout MODEL_LOG - model_name: model, batchSize: 1 AllTraffic/i-0b6f78248b097b6c7
1661327531446 2022-08-24 07:52:11,368 [INFO ] W-9000-model_1-stdout MODEL_LOG - Backend worker process died. AllTraffic/i-0b6f78248b097b6c7
1661327531446 2022-08-24 07:52:11,368 [INFO ] epollEventLoopGroup-5-2 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED AllTraffic/i-0b6f78248b097b6c7
1661327531446 2022-08-24 07:52:11,368 [INFO ] W-9000-model_1-stdout MODEL_LOG - Traceback (most recent call last): AllTraffic/i-0b6f78248b097b6c7
1661327531446 2022-08-24 07:52:11,369 [WARN ] W-9000-model_1 org.pytorch.serve.wlm.BatchAggregator - Load model failed: model, error: Worker died. AllTraffic/i-0b6f78248b097b6c7
1661327531446 2022-08-24 07:52:11,371 [WARN ] W-9000-model_1 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-model_1-stderr AllTraffic/i-0b6f78248b097b6c7
1661327531446 2022-08-24 07:52:11,371 [WARN ] W-9000-model_1 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-model_1-stdout AllTraffic/i-0b6f78248b097b6c7
1661327531446 2022-08-24 07:52:11,371 [INFO ] W-9000-model_1 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds. AllTraffic/i-0b6f78248b097b6c7
1661327531446 2022-08-24 07:52:11,371 [INFO ] W-9000-model_1-stdout MODEL_LOG - File “/opt/conda/lib/python3.8/site-packages/ts/model_service_worker.py”, line 183, in <module> AllTraffic/i-0b6f78248b097b6c7
1661327531696 2022-08-24 07:52:11,372 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-model_1-stdout AllTraffic/i-0b6f78248b097b6c7
1661327531696 2022-08-24 07:52:11,665 [INFO ] W-9000-model_1 ACCESS_LOG - /169.254.178.2:35288 “GET /ping HTTP/1.1” 200 15 AllTraffic/i-0b6f78248b097b6c7
1661327531696 2022-08-24 07:52:11,666 [INFO ] W-9000-model_1 TS_METRICS - Requests2XX.Count:1 #Level:Host
1661327532947 2022-08-24 07:52:11,673 [INFO ] W-9000-model_1-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-model_1-stderr AllTraffic/i-0b6f78248b097b6c7
1661327532947 2022-08-24 07:52:12,892 [INFO ] W-9000-model_1-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000 AllTraffic/i-0b6f78248b097b6c7
1661327532947 2022-08-24 07:52:12,892 [INFO ] W-9000-model_1-stdout MODEL_LOG - [PID]65 AllTraffic/i-0b6f78248b097b6c7
1661327532947 2022-08-24 07:52:12,892 [INFO ] W-9000-model_1-stdout MODEL_LOG - Torch worker started. AllTraffic/i-0b6f78248b097b6c7
1661327532947 2022-08-24 07:52:12,892 [INFO ] W-9000-model_1 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000 AllTraffic/i-0b6f78248b097b6c7
1661327532947 2022-08-24 07:52:12,892 [INFO ] W-9000-model_1-stdout MODEL_LOG - Python runtime: 3.8.10 AllTraffic/i-0b6f78248b097b6c7
1661327532947 2022-08-24 07:52:12,893 [INFO ] W-9000-model_1-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000. AllTraffic/i-0b6f78248b097b6c7
1661327533197 2022-08-24 07:52:12,894 [INFO ] W-9000-model_1-stdout MODEL_LOG - model_name: model, batchSize: 1 AllTraffic/i-0b6f78248b097b6c7
1661327533197 2022-08-24 07:52:13,026 [INFO ] W-9000-model_1-stdout MODEL_LOG - Backend worker process died. AllTraffic/i-0b6f78248b097b6c7
1661327533197 2022-08-24 07:52:13,026 [INFO ] W-9000-model_1-stdout MODEL_LOG - Traceback (most recent call last): AllTraffic/i-0b6f78248b097b6c7
1661327533197 2022-08-24 07:52:13,027 [INFO ] W-9000-model_1-stdout MODEL_LOG - File “/opt/conda/lib/python3.8/site-packages/ts/model_service_worker.py”, line 183, in <module> AllTraffic/i-0b6f78248b097b6c7
1661327533197 2022-08-24 07:52:13,027 [INFO ] W-9000-model_1-stdout MODEL_LOG - worker.run_server() AllTraffic/i-0b6f78248b097b6c7

Installation instructions

I am using sagemaker

Model Packaing

`from sagemaker.pytorch import PyTorchModel

pytorch_model = PyTorchModel(model_data=‘model.tar.gz’, role=role, entry_point=‘inference.py’, framework_version=“1.9.0”, py_version=“py38”)`

config.properties

No response

Versions

framework_version=“1.9.0”, py_version=“py38” Torchserve version: 0.4.2 working on conda_pytorch_p38 sagemaker notebook instance

Repro instructions

inference file that I wrote class ConvNormLReLU(nn.Sequential): def init(self, in_ch, out_ch, kernel_size=3, stride=1, padding=1, pad_mode=“reflect”, groups=1, bias=False):

    pad_layer = {
        "zero":    nn.ZeroPad2d,
        "same":    nn.ReplicationPad2d,
        "reflect": nn.ReflectionPad2d,
    }
    if pad_mode not in pad_layer:
        raise NotImplementedError
        
    super(ConvNormLReLU, self).__init__(
        pad_layer[pad_mode](padding),
        nn.Conv2d(in_ch, out_ch, kernel_size=kernel_size, stride=stride, padding=0, groups=groups, bias=bias),
        nn.GroupNorm(num_groups=1, num_channels=out_ch, affine=True),
        nn.LeakyReLU(0.2, inplace=True)
    )

class InvertedResBlock(nn.Module): def init(self, in_ch, out_ch, expansion_ratio=2): super(InvertedResBlock, self).init()

    self.use_res_connect = in_ch == out_ch
    bottleneck = int(round(in_ch*expansion_ratio))
    layers = []
    if expansion_ratio != 1:
        layers.append(ConvNormLReLU(in_ch, bottleneck, kernel_size=1, padding=0))
    
    # dw
    layers.append(ConvNormLReLU(bottleneck, bottleneck, groups=bottleneck, bias=True))
    # pw
    layers.append(nn.Conv2d(bottleneck, out_ch, kernel_size=1, padding=0, bias=False))
    layers.append(nn.GroupNorm(num_groups=1, num_channels=out_ch, affine=True))

    self.layers = nn.Sequential(*layers)
    
def forward(self, input):
    out = self.layers(input)
    if self.use_res_connect:
        out = input + out
    return out

class Generator(nn.Module): def init(self, ): super().init()

    self.block_a = nn.Sequential(
        ConvNormLReLU(3,  32, kernel_size=7, padding=3),
        ConvNormLReLU(32, 64, stride=2, padding=(0,1,0,1)),
        ConvNormLReLU(64, 64)
    )
    
    self.block_b = nn.Sequential(
        ConvNormLReLU(64,  128, stride=2, padding=(0,1,0,1)),            
        ConvNormLReLU(128, 128)
    )
    
    self.block_c = nn.Sequential(
        ConvNormLReLU(128, 128),
        InvertedResBlock(128, 256, 2),
        InvertedResBlock(256, 256, 2),
        InvertedResBlock(256, 256, 2),
        InvertedResBlock(256, 256, 2),
        ConvNormLReLU(256, 128),
    )    
    
    self.block_d = nn.Sequential(
        ConvNormLReLU(128, 128),
        ConvNormLReLU(128, 128)
    )

    self.block_e = nn.Sequential(
        ConvNormLReLU(128, 64),
        ConvNormLReLU(64,  64),
        ConvNormLReLU(64,  32, kernel_size=7, padding=3)
    )

    self.out_layer = nn.Sequential(
        nn.Conv2d(32, 3, kernel_size=1, stride=1, padding=0, bias=False),
        nn.Tanh()
    )
    
def forward(self, input, align_corners=True):
    out = self.block_a(input)
    half_size = out.size()[-2:]
    out = self.block_b(out)
    out = self.block_c(out)
    
    if align_corners:
        out = F.interpolate(out, half_size, mode="bilinear", align_corners=True)
    else:
        out = F.interpolate(out, scale_factor=2, mode="bilinear", align_corners=False)
    out = self.block_d(out)

    if align_corners:
        out = F.interpolate(out, input.size()[-2:], mode="bilinear", align_corners=True)
    else:
        out = F.interpolate(out, scale_factor=2, mode="bilinear", align_corners=False)
    out = self.block_e(out)

    out = self.out_layer(out)
    return out

def model_fn(model_dir): “”"Load the model and return it. Providing this function is optional. There is a default_model_fn available, which will load the model compiled using SageMaker Neo. You can override the default here. The model_fn only needs to be defined if your model needs extra steps to load, and can otherwise be left undefined.

Keyword arguments:
model_dir -- the directory path where the model artifacts are present
"""        

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# The compiled model is saved as "model.pt"
model = Generator()
model_path = os.path.join(model_dir, 'model.pt')
with open(os.path.join(model_path, 'model.pt'), 'rb') as f:
    model.load_state_dict(torch.load(f))
    
model.to(device).eval()


return model

def transform_fn(model, request_body, request_content_type=‘image/', response_content_type='image/’): image_format = “png” #@param [“jpeg”, “png”] “”“Run prediction and return the output. The function 1. Pre-processes the input request 2. Runs prediction 3. Post-processes the prediction output. “”” # preprocess img_in = Image.open(io.BytesIO(request_body)).convert(“RGB”)

# predict
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
im_out = model(img_in)
buffer_out = BytesIO()
im_out.save(buffer_out, format=image_format)
out = buffer_out.getvalue()

return out, response_content_type

Possible Solution

No response

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:14

github_iconTop GitHub Comments

1reaction
LiJellcommented, Aug 30, 2022

@LiJell For 1.9 it is not clear from the logs why its failing to load the model, I wonder if there is any further pointer in the log traces to show the exact point it fails. I am guessing it can be some path issue, are you following this doc and your model artifact lives in a s3 bucket?

it seems with 1.11 that is failing on importing nvgpu, “packages/ts/metrics/system_metrics.py”, line 61, in gpu_utilization import nvgpu". I think we should have updated the docker containers for “nvgpu” issue, the workaround is to use a custom container here is an example / or install nvgpu in your script before importing it. For docker nvgpu issue cc:@lxning

Hi, HamidShojanazeri, there was an improvement after restructure model.tar.gz file maybe there was a mistake. However, there are sill error. I am trying a couple of thing to resolve error or warn. I will share the log really soon again.

By the way, nvgpu error still comes out depending on framework version. When I use 1.9.0 it look fine with nvgpu.

Thank you!!

0reactions
LiJellcommented, Sep 14, 2022

@LiJell Torchserve doesn’t own Sagemaker docker container. Please file a ticket to AWS Sagemaker if you are using Torchserve via Sagemaker.

Okay!! Thank you for your help!! I will ask this question on right place. Thank you again!!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Solved: Cannot Load model - Microsoft Power BI Community
Solved: This morning we are experiencing problems accessing any of our published PowerBI Dashboards (please see error message below) (UK Based) Could.
Read more >
Power BI error won't go away: "Cannot load model. We couldn ...
Text of the error: Cannot load model. We couldn't connect to your Analysis Services database. Double-check that your server and database ...
Read more >
What is the solution for 'Couldn't load model schema' message ...
Run this command in an elevated command prompt to solve the “Couldn't load model schema". error: C:\Program Files\Windows Defender\MpCmdRun.exe – ...
Read more >
Error: Could Not Load Model - Amazon SageMaker
I clone the model repo from the HF repo, tar.gz it, load it onto S3, create my SageMaker Model, endpoint configuration, and deploy...
Read more >
Power BI Modelling Error - Towards Data Science
Power Query indicates a data source cannot be loaded because it can't find a file. If a table cannot load, the model cannot...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found