Broyden defeats the purpose of DEQs?
See original GitHub issueHeya,
Thanks for your continued work in building better DEQs.
The main selling point of DEQs is that the solver can take as many steps as required to converge without increasing the memory. This isn’t true for your implementation of broyden, which starts off with:
Us = torch.zeros(bsz, total_hsize, seq_len, max_iters).to(dev)
VTs = torch.zeros(bsz, max_iters, total_hsize, seq_len).to(dev)
and therefore has a memory cost linear with max_iters, even though the ops aren’t tracked. Anderson also keeps the previous m
states in memory, where m
is usually larger than the number of solver iterations needed anyways. Don’t those solvers contradict the claim of constant memory cost?
On a related note, I’ve found it quite hard to modify these solvers even after going over the theory. Is there any notes or resources you could point to to help people understand your implementation? Thanks!
Issue Analytics
- State:
- Created a year ago
- Comments:6 (4 by maintainers)
Top GitHub Comments
Hello @polo5,
Thanks for your interest in our repo and DEQ!
To begin with, we want to caution that “constant memory cost” is constant w.r.t. the number of layers. That is, we do have one layer only (e.g., one Transformer layer), and the memory consumption is not that of 2, 3, etc. layers. That said, you are absolutely right that Broyden or Anderson both needs to store some past fixed point estimates. In fact, we analyzed this “hidden cost” in the Jacobian regularization paper (see Sec. 3.4).
However, we do want to point out that this is only very minimal memory cost, because we only need to store the activation tensor. For example, in the DEQ-Transformer case (which is pretty a large-scale model), the hidden units could have shape
[bsz x seq_len x n_hid] = [15 x 150 x 700]
. This is 15*150*700 = 1575000 floating numbers in total, and thus 1575000*4 = 6300000 bytes (each float is 4 bytes). This is only 6.3MB (or 0.0063GB) memory per solver iteration. Therefore, even if we run Broyden through 10-20 iterations, it adds only very little memory cost in itself.However, conventional neural networks are costly because the layers are complex (each layer could cost hundreds or thousands of MBs). For example, in a Transformer layer, not only do we have to memorize what the output is (which is what DEQ only needs), but also everything that happened within the layer— e.g., the activation after self-attention; that after LayerNorm, etc.
On your second question, my Broyden’s method implementation is solely based on its wikipedia page: https://en.wikipedia.org/wiki/Broyden’s_method (look at “good Broyden’s method”).
However, in order to make it GPU-efficient, there are two things that potentially made this resemblance a bit obscure to see, and I’m happy to explain:
dx1
vector), and v^T = \delta x_n^T J_{n-1}^{-1} (a1xd
vector). Moreover, I initialized J_0^{-1} to be -I (i.e., negative identity matrix). Therefore, for any n, I can essentially write J_n^{-1} as: J_n^{-1} = -I + u_1 v_1^T + u_2 v_2^T + …This means that instead of actually keeping a large matrix J^{-1} around in the memory, I can simply store the u vectors and the v vectors. This amounts to keeping a big U matrix and a bid V matrix whose columns are u1,u2,… and v1,v2,… In other words, we can write J_n^{-1} = -I + UV^T. At each Broyden iteration, I append the new u and v to the matrix column, which is here. So after L steps of Broyden iterations, U is of shape
dxL
and V^T has shapeLxd
.Since U has dimension
dxL
, V^T has dimensionLxd
, where L is the # of past Broyden steps, and g has dimension (dx1), it is much more efficient to compute UV^T g by U(V^T g)— because V^T g is simply a matrix-vector product. This is important especially when the dimension d is large. This is therefore this step. Thematvec
operation is the key to making this efficient.Similarly, the update rule itself contains things like J_{n-1}^{-1} \delta f_n, which can be computed in a similar fashion by computing V^T \delta f_n first, and then U (V^T \delta f_n).
I hope this answers your question!
Great points @jerrybai1995 @Gsunshine !
I agree that root solving and minimization aren’t the same thing here, but the line can be quite blurry in some problems. In the vanilla DEQ setting we find the root
x*
ofg(x) = f(x) - x
but of course this is equivalent tox* = argmin ||g(x)||
. In fact I can train my model fine by using basic gradient descent on this objective, but the convergence error is usually much larger than using broyden in my implementation (@jerrybai1995 I should try L-BFGS though thx). Turning a root solving problem into an optimization problem can often make things easier. In my case I want to do root solving forg(x) = f(x) - x
given some regularization/argmin conditions on the rootx = argmin \phi(x)
(which only makes sense because there exists several roots x* in DEQs). This can easily be written as an optimization problem (using Lagrange multipliers). But in practice since broyden is so good at converging fast I’m trying to cast my objective as an efficient root solving problem. @Gsunshine in this case one cannot solve the root ofg(x) = f(x) - x + \nabla \phi(x)
(which could otherwise be nicely rewritten as a FP problem as you suggest) because these roots aren’t roots off(x) - x
. Instead one would need to solve roots forg(x) = |f(x) - x| + |\nabla \phi(x)|
(assuming phi(x) has a single extrema). Annoyingly this isn’t a FP problem anymore.The second issue here is that solvers like broyden seem to struggle with absolute terms. The function
g(x) = |f(x) - x|
has exactly the same roots asg(x) = f(x) - x
, but broyden struggles to find roots in the former (which I guess is what you’d expect from simpler solvers like bisection where the sign is informative?). This can be annoying in some DEQ variants where one may want to find simultaneous roots of two functionsf(x)
andh(x)
. As in (1), you cannot writeg(x) = (f(x) - x) + (h(x) - x)
but instead you’d need something likeg(x) = |f(x) - x| + |h(x) - x|
which broyden would struggle with.Thanks a lot for the discussion! I think DEQs are very promising and don’t get the attention they deserve 😉