question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TorchServe fails to start multiple workers threads on multiple GPUs with large model.

See original GitHub issue

On a c5.12xlarge instance, I was able to run 16 instances of the FairSeq English-to-German translation model, all simultaneously running translations. This model’s weights take up about 2.5GB on disk (though its resident footprint in memory seems smaller).

Attempting a similar feat on a p3.8xlarge turned out to be impossible. I could get a single instance running, but if I attempted to get even 4 workers running, they crash repeatedly with OOMEs:

2020-02-26 02:46:34,454 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Torch worker started.
2020-02-26 02:46:34,454 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Python runtime: 3.6.6
2020-02-26 02:46:34,454 [DEBUG] W-9001-fairseq_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9001-fairseq_model_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2020-02-26 02:46:34,454 [INFO ] W-9001-fairseq_model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9001
2020-02-26 02:46:34,455 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Connection accepted: /tmp/.ts.sock.9001.
2020-02-26 02:46:38,734 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Backend worker process die.
2020-02-26 02:46:38,734 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Traceback (most recent call last):
2020-02-26 02:46:38,734 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ts/model_service_worker.py", line 163, in <module>
2020-02-26 02:46:38,734 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     worker.run_server()
2020-02-26 02:46:38,734 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ts/model_service_worker.py", line 141, in run_server
2020-02-26 02:46:38,734 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     self.handle_connection(cl_socket)
2020-02-26 02:46:38,734 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ts/model_service_worker.py", line 105, in handle_connection
2020-02-26 02:46:38,734 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     service, result, code = self.load_model(msg)
2020-02-26 02:46:38,734 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ts/model_service_worker.py", line 83, in load_model
2020-02-26 02:46:38,734 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     service = model_loader.load(model_name, model_dir, handler, gpu, batch_size)
2020-02-26 02:46:38,734 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ts/model_loader.py", line 107, in load
2020-02-26 02:46:38,734 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     entry_point(None, service.context)
2020-02-26 02:46:38,734 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/tmp/models/fa2ad7fda70376da33595966d0cf3c38702ea6d1/fairseq_handler.py", line 120, in handle
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     raise e
2020-02-26 02:46:38,735 [INFO ] epollEventLoopGroup-4-15 org.pytorch.serve.wlm.WorkerThread - 9001 Worker disconnected. WORKER_STARTED
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/tmp/models/fa2ad7fda70376da33595966d0cf3c38702ea6d1/fairseq_handler.py", line 109, in handle
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     _service.initialize(context)
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/tmp/models/fa2ad7fda70376da33595966d0cf3c38702ea6d1/fairseq_handler.py", line 73, in initialize
2020-02-26 02:46:38,735 [DEBUG] W-9001-fairseq_model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
        at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
        at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:128)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     state_dict = torch.load(model_pt_path, map_location=self.device)
2020-02-26 02:46:38,735 [WARN ] W-9001-fairseq_model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: fairseq_model, error: Worker died.
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 529, in load
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
2020-02-26 02:46:38,735 [DEBUG] W-9001-fairseq_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9001-fairseq_model_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 702, in _legacy_load
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     result = unpickler.load()
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 665, in persistent_load
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     deserialized_objects[root_key] = restore_location(obj, location)
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9001 in 21 seconds.
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 740, in restore_location
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     return default_restore_location(storage, str(map_location))
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 156, in default_restore_location
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     result = fn(storage, location)
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 136, in _cuda_deserialize
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     return storage_type(obj.size())
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/cuda/__init__.py", line 480, in _lazy_new
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     return super(_CudaBase, cls).__new__(cls, *args, **kwargs)
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - RuntimeError: CUDA out of memory. Tried to allocate 802.00 MiB (GPU 0; 15.75 GiB total capacity; 1.69 GiB already allocated; 605.12 MiB free; 1.69 GiB reserved in total by PyTorch)

On digging through the logs, it appears that it’s attempting to start all workers on the same GPU. The following is the output of grep GPU ts_log.log:

2020-02-26 02:45:13,375 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 15.75 GiB total capacity; 3.03 GiB already allocated; 19.12 MiB free; 3.03 GiB reserved in total by PyTorch)
2020-02-26 02:45:13,375 [INFO ] W-9000-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 15.75 GiB total capacity; 3.16 GiB already allocated; 19.12 MiB free; 3.16 GiB reserved in total by PyTorch)
2020-02-26 02:45:13,383 [INFO ] W-9003-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 15.75 GiB total capacity; 2.60 GiB already allocated; 19.12 MiB free; 2.60 GiB reserved in total by PyTorch)
2020-02-26 02:45:26,382 [INFO ] W-9003-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 15.75 GiB total capacity; 3.05 GiB already allocated; 19.12 MiB free; 3.05 GiB reserved in total by PyTorch)
2020-02-26 02:45:26,519 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 15.75 GiB total capacity; 2.47 GiB already allocated; 19.12 MiB free; 2.47 GiB reserved in total by PyTorch)
2020-02-26 02:45:38,899 [INFO ] W-9003-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 15.75 GiB total capacity; 2.89 GiB already allocated; 7.12 MiB free; 2.89 GiB reserved in total by PyTorch)
2020-02-26 02:45:38,904 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 15.75 GiB total capacity; 2.65 GiB already allocated; 7.12 MiB free; 2.65 GiB reserved in total by PyTorch)
2020-02-26 02:45:52,201 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 15.75 GiB total capacity; 3.05 GiB already allocated; 19.12 MiB free; 3.05 GiB reserved in total by PyTorch)
2020-02-26 02:45:59,593 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - RuntimeError: CUDA out of memory. Tried to allocate 802.00 MiB (GPU 0; 15.75 GiB total capacity; 1.69 GiB already allocated; 605.12 MiB free; 1.69 GiB reserved in total by PyTorch)
2020-02-26 02:46:08,962 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - RuntimeError: CUDA out of memory. Tried to allocate 802.00 MiB (GPU 0; 15.75 GiB total capacity; 1.69 GiB already allocated; 605.12 MiB free; 1.69 GiB reserved in total by PyTorch)
2020-02-26 02:46:21,358 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - RuntimeError: CUDA out of memory. Tried to allocate 802.00 MiB (GPU 0; 15.75 GiB total capacity; 1.69 GiB already allocated; 605.12 MiB free; 1.69 GiB reserved in total by PyTorch)
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - RuntimeError: CUDA out of memory. Tried to allocate 802.00 MiB (GPU 0; 15.75 GiB total capacity; 1.69 GiB already allocated; 605.12 MiB free; 1.69 GiB reserved in total by PyTorch)

Note that multiple workers (W-9000, W-9001, W-9003) are shown, but only one GPU turns up (GPU 0). The p3.8xlarge has 4 GPUs.

I attempted to use arguments of the Management API, such as number_gpu=4, to fix this, but nothing worked. Same result every time.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:14 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
harshbafnacommented, Mar 4, 2020

@chauhang : It is contex.system_properties. As indicated in the previous comment. We have already updated the custom handler example (MNIST digit classifier) in “stage_release” branch

https://github.com/pytorch/serve/blob/stage_release/examples/image_classifier/mnist/mnist_handler.py

All the default handlers have been updated to use the same as well.

0reactions
harshbafnacommented, Apr 27, 2020

@okgrammer, you can supply all the other dependency files using --extra-files param as a comma separated list while creating the model-archive.

You can refer the image_segmenter example, where it supplies multiple .py files using --extra-files parameter while creating the model-archive.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Concurrent Inference - Medium
This is a post about getting multiple models to run on the GPU at the same time. This is a post about the...
Read more >
5. Advanced configuration - PyTorch
2. Load models at startup. You can configure TorchServe to load models during startup by setting the model_store and load_models ...
Read more >
PyTorch Multi GPU: 3 Techniques Explained - Run:AI
Learn how to accelerate deep learning tensor computations with 3 multi GPU techniques—data parallelism, distributed data parallelism and model parallelism.
Read more >
An efficient and flexible inference system for serving ... - arXiv
into 4 GPUs and at the opposite, one single DNN multi-threaded into 16 GPUs. ... The DNN model B is run by 2...
Read more >
How do I create a custom handler in torchserve?
Hi here's an example but it seems the problem should be how you receive parse the data from typing import Dict, List, Tuple...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found