Evaluating a derivative elementwise
See original GitHub issueMaybe I’m using the library wrong, but I wonder if what I’m observing is a bug or intended. I’m using the latest Github master version.
Say I have a scalar function of two variables x and y, but I want to retain the option to pass in numpy arrays and get the values/derivatives elementwise. I tried the following:
import numpy as np
from autograd import grad
f = lambda x,y: (x+y)**2
fgrad = grad(f, 0)
x = np.linspace(0, 1, 11)
fgrad(x, 0.0)
>>> array([ 0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8, 2. ])
Looks correct, it computes 2*x
as it should. But strange things happen if I want to evaluate the derivative on several points lying along the y axis:
fgrad(0.0, x)
>>> 11.000000000000002
I would expect to get back an array of the x-derivatives at the points (0,0), (0,0.1), (0,0.2), and so on. Instead I get a single value, and I don’t even know where this value 11 comes from – it seems to be related to the length of the array x?
Issue Analytics
- State:
- Created 6 years ago
- Comments:14 (7 by maintainers)
Top Results From Across the Web
Derivative of an elementwise function and a pesudoinverse
Define F=f(AX), then it appears you are trying to solve Y=BF. using some kind of alternating directions iteration.
Read more >Matrix Calculus for 10-301/601 - Carnegie Mellon University
We call what we just evaluated a partial derivative of f with respect ... Element-wise operations are very common for vectors and matrices....
Read more >Matrix calculus - Wikipedia
In mathematics, matrix calculus is a specialized notation for doing multivariable calculus, ... The two groups can be distinguished by whether they write...
Read more >The Matrix Calculus You Need For Deep Learning (Notes from ...
Derivatives of vector element-wise binary operators ... By “element-wise binary operations” we simply mean applying an operator to the first item ...
Read more >Matrix Calculus and Notation | SpringerLink
The key to the matrix calculus of Magnus and Neudecker (1988) is the relationship between the differential and the derivative of a function....
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Yes, the new semantics of
grad
can be a bit confusing. We’ve always had precisely the same semantics inelementwise_grad
so the confusion itself isn’t new. I’ll try to explain the behavior here.Originally,
grad
was only for scalar-output functions. We recently generalized it to accept array-output functions. The new definition of grad is actually very simple:grad(f)
applies the transpose of the Jacobian off
to a vector of ones. Under some circumstances we can interpret this as a form of broadcasting (but see terminology note below). With a one-argument functionf
we can consider four cases:f(x)
is a scalar. In this casegrad(f)(x)
gives the usual gradient.f(x)
is an array with the same shape asx
, and the Jacobian off
is diagonal. In this case, left-multiplying the Jacobian by a vector of ones pulls out the diagonal as a vector. So eachgrad(f)(x)[i]
represents the derivative off(x)[i]
with respect tox[i]
, which is analogous to numpy’s usual automatic mapping by which scalar functions are applied elementwise to arrays.f(x)
is an array with each dimension either equal to 1 or equal to the corresponding dimension of x (e.g.x.shape == (2,3,4)
andf(x).shape == (2,1,4)
) and the elements of the Jacobian corresponding to different input and output indices are all zero, with the exception of indices for the singleton output dimensions. In this case we end up with automatic mapping over non-singleton dimensions off(x)
and regular gradients over singleton dimensions off(x)
. Note that case 3 (which actually subsumes 1 and 2) only represents a subset of the cases for whichx
andf(x)
are numpy-broadcastable. Specifically,x
can’t have singleton dimensions that don’t correspond to singletons inf(x)
.grad(f)(x)
doesn’t have much interpretation besides “the vector-Jacobian product of a vector of ones” or, equivalently, “the gradient oflambda x: sum(f(x))
”. We should consider having this case raise an error rather than returning a head-scratching result.@mattjj and @j-towns, we often describe the new behavior of
grad
as “broadcasting”. Broadcasting (whether numpy or radio) refers to a one-to-many fanout. I’d prefer to describe the new behavior ofgrad
as “automatic mapping”. I wonder if changing our terminology might help avoid confusion. Thoughts?A rule of thumb is that the result of
grad(*args, argnum)
will always have the same shape asargs[argnum]
, i.e. it will always have the same shape as the element that you’ve differentiated with respect to.When you do
you’re differentiating w.r.t.
0.0
, so the returned value will be a float. When you doyou’re differentiating w.r.t.
x
, which has shape(11,)
, thus the returned array has that shape too.