HER, out of memory using 3 or more CPUs and one GPU
See original GitHub issueRunning the following HER command on my machine (Ubuntu 16.04, Tensorflow 1.5.0, one Titan X GPU, Python 3.5.2, latest version of baselines as of today, etc.) seems to work:
(py3-tensorflow) daniel@computer-name:~/baselines$ python -m baselines.her.experiment.train --num_cpu 2
2018-03-11 10:42:00.828727: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-03-11 10:42:00.833988: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-03-11 10:42:01.035000: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-03-11 10:42:01.035688: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties:
name: TITAN X (Pascal) major: 6 minor: 1 memoryClockRate(GHz): 1.531
pciBusID: 0000:01:00.0
totalMemory: 11.91GiB freeMemory: 10.72GiB
2018-03-11 10:42:01.035702: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: TITAN X (Pascal), pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-03-11 10:42:01.036552: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-03-11 10:42:01.036967: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties:
name: TITAN X (Pascal) major: 6 minor: 1 memoryClockRate(GHz): 1.531
pciBusID: 0000:01:00.0
totalMemory: 11.91GiB freeMemory: 10.60GiB
2018-03-11 10:42:01.036979: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: TITAN X (Pascal), pci bus id: 0000:01:00.0, compute capability: 6.1)
Logging to /tmp/openai-2018-03-11-10-42-01-211699
Logging to /tmp/openai-2018-03-11-10-42-01-238422
after this the statistics and logs are reported which make sense and indicate improved performance.
I noticed while that was running, the nvidia-smi
command shows that there are two python commands running but one use far more GPU memory than the other:
(py3-tensorflow) daniel@computer-name:~/baselines$ nvidia-smi
Sun Mar 11 10:43:42 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.25 Driver Version: 390.25 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN X (Pascal) Off | 00000000:01:00.0 On | N/A |
| 29% 52C P2 74W / 250W | 12035MiB / 12194MiB | 34% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 15162 G /usr/lib/xorg/Xorg 625MiB |
| 0 15613 G compiz 371MiB |
| 0 16006 G /usr/lib/firefox/firefox 2MiB |
| 0 16308 C ...l/seita-venvs/py3-tensorflow/bin/python 547MiB |
| 0 16309 C ...l/seita-venvs/py3-tensorflow/bin/python 10449MiB |
| 0 18716 G /usr/lib/firefox/firefox 2MiB |
+-----------------------------------------------------------------------------+
(py3-tensorflow) daniel@computer-name:~/baselines$ python -m baselines.her.experiment.train --num_cpu 3
2018-03-11 10:43:55.864451: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-03-11 10:43:55.872111: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-03-11 10:43:55.872111: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-03-11 10:43:56.149303: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-03-11 10:43:56.149754: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties:
name: TITAN X (Pascal) major: 6 minor: 1 memoryClockRate(GHz): 1.531
pciBusID: 0000:01:00.0
totalMemory: 11.91GiB freeMemory: 10.68GiB
2018-03-11 10:43:56.149813: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: TITAN X (Pascal), pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-03-11 10:43:56.153272: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-03-11 10:43:56.153800: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties:
name: TITAN X (Pascal) major: 6 minor: 1 memoryClockRate(GHz): 1.531
pciBusID: 0000:01:00.0
totalMemory: 11.91GiB freeMemory: 10.44GiB
2018-03-11 10:43:56.153829: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: TITAN X (Pascal), pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-03-11 10:43:56.153863: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-03-11 10:43:56.154212: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties:
name: TITAN X (Pascal) major: 6 minor: 1 memoryClockRate(GHz): 1.531
pciBusID: 0000:01:00.0
totalMemory: 11.91GiB freeMemory: 10.44GiB
2018-03-11 10:43:56.154239: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: TITAN X (Pascal), pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-03-11 10:43:56.333249: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 224.44M (235339776 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
Logging to /tmp/openai-2018-03-11-10-43-56-333366
Logging to /tmp/openai-2018-03-11-10-43-56-335464
Logging to /tmp/openai-2018-03-11-10-43-56-358276
and the out of memory error causes the program to abort.
I assumed naively that I could fix this by adjusting the ddpg.py
file in HER:
def _create_network(self, reuse=False):
logger.info("Creating a DDPG agent with action space %d x %s..." % (self.dimu, self.max_u))
#self.sess = tf.get_default_session()
# Add these instead of the default session
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.2
self.sess = tf.Session(config=config)
if self.sess is None:
self.sess = tf.InteractiveSession()
Unfortunately this does not seem to work due to un-initialized variables. (I can post the full error message if it helps.)
The closest existing issue seems to be this one https://github.com/openai/baselines/issues/70 but where @olegklimov suggests that “it’s [PPO] supposed to use the same GPU from several MPI workers. More that each MPI should use its own GPU on multi-GPU machine or multi-machine MPI.” but I only have one GPU on this machine, and I’m not sure if there are subtle differences with PPO vs HER implementations.
Any advice would be appreciated. Thanks!
Issue Analytics
- State:
- Created 6 years ago
- Comments:8 (2 by maintainers)
Top GitHub Comments
Thanks for the information.
It might be useful to add to the HER README the machine and specs that OpenAI uses to run these commands.
I’ve updated the HER readme. We used D15v2 instances on Azure for all experiments.