Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

HER, out of memory using 3 or more CPUs and one GPU

See original GitHub issue

Running the following HER command on my machine (Ubuntu 16.04, Tensorflow 1.5.0, one Titan X GPU, Python 3.5.2, latest version of baselines as of today, etc.) seems to work:

(py3-tensorflow) daniel@computer-name:~/baselines$ python -m baselines.her.experiment.train --num_cpu 2
2018-03-11 10:42:00.828727: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-03-11 10:42:00.833988: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-03-11 10:42:01.035000: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-03-11 10:42:01.035688: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties: 
name: TITAN X (Pascal) major: 6 minor: 1 memoryClockRate(GHz): 1.531
pciBusID: 0000:01:00.0
totalMemory: 11.91GiB freeMemory: 10.72GiB
2018-03-11 10:42:01.035702: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: TITAN X (Pascal), pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-03-11 10:42:01.036552: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-03-11 10:42:01.036967: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties: 
name: TITAN X (Pascal) major: 6 minor: 1 memoryClockRate(GHz): 1.531
pciBusID: 0000:01:00.0
totalMemory: 11.91GiB freeMemory: 10.60GiB
2018-03-11 10:42:01.036979: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: TITAN X (Pascal), pci bus id: 0000:01:00.0, compute capability: 6.1)
Logging to /tmp/openai-2018-03-11-10-42-01-211699
Logging to /tmp/openai-2018-03-11-10-42-01-238422

after this the statistics and logs are reported which make sense and indicate improved performance.

I noticed while that was running, the nvidia-smi command shows that there are two python commands running but one use far more GPU memory than the other:

(py3-tensorflow) daniel@computer-name:~/baselines$ nvidia-smi 
Sun Mar 11 10:43:42 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.25                 Driver Version: 390.25                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN X (Pascal)    Off  | 00000000:01:00.0  On |                  N/A |
| 29%   52C    P2    74W / 250W |  12035MiB / 12194MiB |     34%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     15162      G   /usr/lib/xorg/Xorg                           625MiB |
|    0     15613      G   compiz                                       371MiB |
|    0     16006      G   /usr/lib/firefox/firefox                       2MiB |
|    0     16308      C   ...l/seita-venvs/py3-tensorflow/bin/python   547MiB |
|    0     16309      C   ...l/seita-venvs/py3-tensorflow/bin/python 10449MiB |
|    0     18716      G   /usr/lib/firefox/firefox                       2MiB |
+-----------------------------------------------------------------------------+

(py3-tensorflow) daniel@computer-name:~/baselines$ python -m baselines.her.experiment.train --num_cpu 3
2018-03-11 10:43:55.864451: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-03-11 10:43:55.872111: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-03-11 10:43:55.872111: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-03-11 10:43:56.149303: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-03-11 10:43:56.149754: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties: 
name: TITAN X (Pascal) major: 6 minor: 1 memoryClockRate(GHz): 1.531
pciBusID: 0000:01:00.0
totalMemory: 11.91GiB freeMemory: 10.68GiB
2018-03-11 10:43:56.149813: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: TITAN X (Pascal), pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-03-11 10:43:56.153272: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-03-11 10:43:56.153800: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties: 
name: TITAN X (Pascal) major: 6 minor: 1 memoryClockRate(GHz): 1.531
pciBusID: 0000:01:00.0
totalMemory: 11.91GiB freeMemory: 10.44GiB
2018-03-11 10:43:56.153829: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: TITAN X (Pascal), pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-03-11 10:43:56.153863: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-03-11 10:43:56.154212: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties: 
name: TITAN X (Pascal) major: 6 minor: 1 memoryClockRate(GHz): 1.531
pciBusID: 0000:01:00.0
totalMemory: 11.91GiB freeMemory: 10.44GiB
2018-03-11 10:43:56.154239: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: TITAN X (Pascal), pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-03-11 10:43:56.333249: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 224.44M (235339776 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
Logging to /tmp/openai-2018-03-11-10-43-56-333366
Logging to /tmp/openai-2018-03-11-10-43-56-335464
Logging to /tmp/openai-2018-03-11-10-43-56-358276

and the out of memory error causes the program to abort.

I assumed naively that I could fix this by adjusting the ddpg.py file in HER:

    def _create_network(self, reuse=False):
        logger.info("Creating a DDPG agent with action space %d x %s..." % (self.dimu, self.max_u))

        #self.sess = tf.get_default_session()
        # Add these instead of the default session
        config = tf.ConfigProto()
        config.gpu_options.per_process_gpu_memory_fraction = 0.2 
        self.sess = tf.Session(config=config)

        if self.sess is None:
            self.sess = tf.InteractiveSession()

Unfortunately this does not seem to work due to un-initialized variables. (I can post the full error message if it helps.)

The closest existing issue seems to be this one https://github.com/openai/baselines/issues/70 but where @olegklimov suggests that “it’s [PPO] supposed to use the same GPU from several MPI workers. More that each MPI should use its own GPU on multi-GPU machine or multi-machine MPI.” but I only have one GPU on this machine, and I’m not sure if there are subtle differences with PPO vs HER implementations.

Any advice would be appreciated. Thanks!