question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Decreasing autograd memory usage

See original GitHub issue

I don’t mean “memory leak” in terms of unreachable memory after the Python process quits, I mean memory that is being allocated in the backwards pass, when it should be being freed. I mentioned this problem in #199 , but I think it should be opened as an issue.

For a simple function

import autograd.numpy as np
from autograd import grad

def F(x,z):
    for i in range(100):
        z = np.dot(x,z)
    return np.sum(z)
dF = grad(F)

and a procedure to measure memory usage

from memory_profiler import memory_usage
def make_data():
    np.random.seed(0)
    D = 1000
    x = np.random.randn(D,D)
    x = np.dot(x,x.T)
    z = np.random.randn(D,D)
    return x,z

def m():
    from time import sleep
    x,z = make_data()
    gx = dF(x,z)
    sleep(0.1)
    return gx

mem_usage = np.array(memory_usage(m,interval=0.01))
mem_usage -= mem_usage[0]

and a manual gradient of the same function

def grad_dot_A(g,A,B):
    ga = np.dot(g,B.T)
    ga = np.reshape(ga,np.shape(A))
    return ga

def grad_dot_B(g,A,B):
    gb = np.dot(A.T,g)
    gb = np.reshape(gb, np.shape(B))
    return gb

def dF(x, z):
    z_stack = []
    for i in list(range(100)):
        z_stack.append(z)
        z = np.dot(x, z)
    retval = np.sum(z)

    # Begin Backward Pass
    g_retval = 1
    g_x = 0

    # Reverse of: retval = np.sum(z)
    g_z = repeat_to_match_shape(g_retval, z)
    for i in reversed(list(range(100))):

        # Reverse of: z = np.dot(x, z)
        z = z_stack.pop()
        tmp_g0 = grad_dot_A(g_z, x, z)
        tmp_g1 = grad_dot_B(g_z, x, z)
        g_z = 0
        g_x += tmp_g0
        g_z += tmp_g1
    return g_x

I get the following memory usage profile:

image

If I replace the dot gradient with the ones used in the manual code, I get the same memory profile, nothing improves.

If I replace the dot product with element-wise multiply, I get a different memory profile, but still not what I would expect:

image

I would love to help figure this out, but I’m not sure where to start. First thing is of course to document the problem.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:21 (20 by maintainers)

github_iconTop GitHub Comments

1reaction
j-townscommented, May 9, 2017

On the issue of float32(/float16) support, I believe that arrays are generally already handled correctly by most of the vjps, its only when primitives are applied to scalar values that things get messy.

This is a reflection of numpy’s type system, which respects the dtype of arrays and seems to largely ignore the dtype of scalar values (even if they are wrapped as an ndarray with shape ()).

I think this needs to be thoroughly tested but assuming the above is correct, we could say in the docs that float32 is supported but only for arrays.

In [1]: from autograd import grad

In [2]: import autograd.numpy as np

In [3]: def f(x):
   ...:     return x**2
   ...:

In [4]: grad(f)(np.array([3], dtype=np.float32))
Out[4]: array([ 6.], dtype=float32)

In [5]: grad(f)(np.array(3, dtype=np.float32))
AssertionError:
Grad of power returned unexpected vector space
Vector space is ArrayVSpace_{'shape': (), 'size': 1, 'dtype': dtype('float64'), 'scalartype': <class 'float'>}
Expected        ArrayVSpace_{'shape': (), 'size': 1, 'dtype': dtype('float32'), 'scalartype': <class 'float'>}
1reaction
alexbwcommented, May 4, 2017

Here’s a gist on a more “real-world” benchmark, taking the gradients of an RNN.

https://gist.github.com/alexbw/7b2a0682f65dd1bcb7120ca2d47a2823

Here’s the memory usage:

image

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to reduce autograd memory usage? - PyTorch Forums
My program's memory usage is roughly an order of magnitude greater when I specify requires_grad=True on the parameters of my model.
Read more >
How to reduce GPU memory in PyTorch while avoiding in ...
You can't avoid high memory usage with autograd. Some operations require access to their arguments for backprop, so del won't work.
Read more >
Farewell, CUDA OOM: Automatic Gradient Accumulation
With automatic gradient accumulation, Composer lets users seamlessly change GPU types and number of GPUs without having to worry about batch ...
Read more >
PyTorch 101, Part 4: Memory Management and Using Multiple ...
This article covers PyTorch's advanced GPU management features, how to optimise memory usage and best practises for debugging memory errors.
Read more >
Automatic differentiation package - torch.autograd
Disabling gradient calculation is useful for inference, when you are sure that you will not call Tensor.backward() . It will reduce memory consumption...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found