Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Evaluating a derivative elementwise

See original GitHub issue

Maybe I’m using the library wrong, but I wonder if what I’m observing is a bug or intended. I’m using the latest Github master version.

Say I have a scalar function of two variables x and y, but I want to retain the option to pass in numpy arrays and get the values/derivatives elementwise. I tried the following:

import numpy as np
from autograd import grad

f = lambda x,y: (x+y)**2
fgrad = grad(f, 0)

x = np.linspace(0, 1, 11)

fgrad(x, 0.0)
>>> array([ 0. ,  0.2,  0.4,  0.6,  0.8,  1. ,  1.2,  1.4,  1.6,  1.8,  2. ])

Looks correct, it computes 2*x as it should. But strange things happen if I want to evaluate the derivative on several points lying along the y axis:

fgrad(0.0, x)
>>> 11.000000000000002

I would expect to get back an array of the x-derivatives at the points (0,0), (0,0.1), (0,0.2), and so on. Instead I get a single value, and I don’t even know where this value 11 comes from – it seems to be related to the length of the array x?

Issue Analytics

State:
Created 6 years ago
Comments:14 (7 by maintainers)

Top GitHub Comments

4reactions

dougalmcommented, Jun 28, 2017

Yes, the new semantics of grad can be a bit confusing. We’ve always had precisely the same semantics in elementwise_grad so the confusion itself isn’t new. I’ll try to explain the behavior here.

Originally, grad was only for scalar-output functions. We recently generalized it to accept array-output functions. The new definition of grad is actually very simple: grad(f) applies the transpose of the Jacobian of f to a vector of ones. Under some circumstances we can interpret this as a form of broadcasting (but see terminology note below). With a one-argument function f we can consider four cases:

f(x) is a scalar. In this case grad(f)(x) gives the usual gradient.
f(x) is an array with the same shape as x, and the Jacobian of f is diagonal. In this case, left-multiplying the Jacobian by a vector of ones pulls out the diagonal as a vector. So each grad(f)(x)[i] represents the derivative of f(x)[i] with respect to x[i], which is analogous to numpy’s usual automatic mapping by which scalar functions are applied elementwise to arrays.
f(x) is an array with each dimension either equal to 1 or equal to the corresponding dimension of x (e.g. x.shape == (2,3,4) and f(x).shape == (2,1,4)) and the elements of the Jacobian corresponding to different input and output indices are all zero, with the exception of indices for the singleton output dimensions. In this case we end up with automatic mapping over non-singleton dimensions of f(x) and regular gradients over singleton dimensions of f(x). Note that case 3 (which actually subsumes 1 and 2) only represents a subset of the cases for which x and f(x) are numpy-broadcastable. Specifically, x can’t have singleton dimensions that don’t correspond to singletons in f(x).
In all other cases (such as yours) grad(f)(x) doesn’t have much interpretation besides “the vector-Jacobian product of a vector of ones” or, equivalently, “the gradient of lambda x: sum(f(x))”. We should consider having this case raise an error rather than returning a head-scratching result.

@mattjj and @j-towns, we often describe the new behavior of grad as “broadcasting”. Broadcasting (whether numpy or radio) refers to a one-to-many fanout. I’d prefer to describe the new behavior of grad as “automatic mapping”. I wonder if changing our terminology might help avoid confusion. Thoughts?

2reactions

j-townscommented, Jun 28, 2017

A rule of thumb is that the result of grad(*args, argnum) will always have the same shape as args[argnum], i.e. it will always have the same shape as the element that you’ve differentiated with respect to.

When you do

fgrad(0.0, x)

you’re differentiating w.r.t. 0.0, so the returned value will be a float. When you do

fgrad(x, 0.0)

you’re differentiating w.r.t. x, which has shape (11,), thus the returned array has that shape too.

Top Results From Across the Web

Derivative of an elementwise function and a pesudoinverse

Define F=f(AX), then it appears you are trying to solve Y=BF. using some kind of alternating directions iteration.

Matrix Calculus for 10-301/601 - Carnegie Mellon University

We call what we just evaluated a partial derivative of f with respect ... Element-wise operations are very common for vectors and matrices....

Matrix calculus - Wikipedia

In mathematics, matrix calculus is a specialized notation for doing multivariable calculus, ... The two groups can be distinguished by whether they write...

The Matrix Calculus You Need For Deep Learning (Notes from ...

Derivatives of vector element-wise binary operators ... By “element-wise binary operations” we simply mean applying an operator to the first item ...

Matrix Calculus and Notation | SpringerLink

The key to the matrix calculus of Magnus and Neudecker (1988) is the relationship between the differential and the derivative of a function....