Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Broyden defeats the purpose of DEQs?

See original GitHub issue

Heya,

Thanks for your continued work in building better DEQs.

The main selling point of DEQs is that the solver can take as many steps as required to converge without increasing the memory. This isn’t true for your implementation of broyden, which starts off with:

Us = torch.zeros(bsz, total_hsize, seq_len, max_iters).to(dev)
VTs = torch.zeros(bsz, max_iters, total_hsize, seq_len).to(dev)

and therefore has a memory cost linear with max_iters, even though the ops aren’t tracked. Anderson also keeps the previous m states in memory, where m is usually larger than the number of solver iterations needed anyways. Don’t those solvers contradict the claim of constant memory cost?

On a related note, I’ve found it quite hard to modify these solvers even after going over the theory. Is there any notes or resources you could point to to help people understand your implementation? Thanks!

Issue Analytics

State:
Created a year ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

jerrybai1995commented, Aug 23, 2022

Hello @polo5,

Thanks for your interest in our repo and DEQ!

To begin with, we want to caution that “constant memory cost” is constant w.r.t. the number of layers. That is, we do have one layer only (e.g., one Transformer layer), and the memory consumption is not that of 2, 3, etc. layers. That said, you are absolutely right that Broyden or Anderson both needs to store some past fixed point estimates. In fact, we analyzed this “hidden cost” in the Jacobian regularization paper (see Sec. 3.4).

However, we do want to point out that this is only very minimal memory cost, because we only need to store the activation tensor. For example, in the DEQ-Transformer case (which is pretty a large-scale model), the hidden units could have shape [bsz x seq_len x n_hid] = [15 x 150 x 700]. This is 15*150*700 = 1575000 floating numbers in total, and thus 1575000*4 = 6300000 bytes (each float is 4 bytes). This is only 6.3MB (or 0.0063GB) memory per solver iteration. Therefore, even if we run Broyden through 10-20 iterations, it adds only very little memory cost in itself.

However, conventional neural networks are costly because the layers are complex (each layer could cost hundreds or thousands of MBs). For example, in a Transformer layer, not only do we have to memorize what the output is (which is what DEQ only needs), but also everything that happened within the layer— e.g., the activation after self-attention; that after LayerNorm, etc.

On your second question, my Broyden’s method implementation is solely based on its wikipedia page: https://en.wikipedia.org/wiki/Broyden’s_method (look at “good Broyden’s method”).

However, in order to make it GPU-efficient, there are two things that potentially made this resemblance a bit obscure to see, and I’m happy to explain:

Note that in the good Broyden update formula, we can write the update in terms of J^{-1}’ = J^{-1} + uv^T, where u is the \frac{\delta x_n - …}{\delta x_n^T …} part (a dx1 vector), and v^T = \delta x_n^T J_{n-1}^{-1} (a 1xd vector). Moreover, I initialized J_0^{-1} to be -I (i.e., negative identity matrix). Therefore, for any n, I can essentially write J_n^{-1} as: J_n^{-1} = -I + u_1 v_1^T + u_2 v_2^T + …

This means that instead of actually keeping a large matrix J^{-1} around in the memory, I can simply store the u vectors and the v vectors. This amounts to keeping a big U matrix and a bid V matrix whose columns are u1,u2,… and v1,v2,… In other words, we can write J_n^{-1} = -I + UV^T. At each Broyden iteration, I append the new u and v to the matrix column, which is here. So after L steps of Broyden iterations, U is of shape dxL and V^T has shape Lxd.

Note that in both Newton and Quasi-Newton methods, we don’t actually need the J^{-1}. What matters is J^{-1} g(x), where g(x) is a vector (recall that the update rule for solving g(x)=0 essentially has form new_x = x - \alpha * J^{-1} g(x)). Therefore, together with the (1) I mentioned above, we can further write J_n^{-1} g = (-I + UV^T)g = -g + UV^T g.

Since U has dimension dxL, V^T has dimension Lxd, where L is the # of past Broyden steps, and g has dimension (dx1), it is much more efficient to compute UV^T g by U(V^T g)— because V^T g is simply a matrix-vector product. This is important especially when the dimension d is large. This is therefore this step. The matvec operation is the key to making this efficient.

Similarly, the update rule itself contains things like J_{n-1}^{-1} \delta f_n, which can be computed in a similar fashion by computing V^T \delta f_n first, and then U (V^T \delta f_n).

I hope this answers your question!

0reactions

polo5commented, Aug 24, 2022

Great points @jerrybai1995 @Gsunshine !

I agree that root solving and minimization aren’t the same thing here, but the line can be quite blurry in some problems. In the vanilla DEQ setting we find the root x* of g(x) = f(x) - x but of course this is equivalent to x* = argmin ||g(x)|| . In fact I can train my model fine by using basic gradient descent on this objective, but the convergence error is usually much larger than using broyden in my implementation (@jerrybai1995 I should try L-BFGS though thx). Turning a root solving problem into an optimization problem can often make things easier. In my case I want to do root solving for g(x) = f(x) - x given some regularization/argmin conditions on the root x = argmin \phi(x) (which only makes sense because there exists several roots x* in DEQs). This can easily be written as an optimization problem (using Lagrange multipliers). But in practice since broyden is so good at converging fast I’m trying to cast my objective as an efficient root solving problem. @Gsunshine in this case one cannot solve the root of g(x) = f(x) - x + \nabla \phi(x) (which could otherwise be nicely rewritten as a FP problem as you suggest) because these roots aren’t roots of f(x) - x. Instead one would need to solve roots for g(x) = |f(x) - x| + |\nabla \phi(x)| (assuming phi(x) has a single extrema). Annoyingly this isn’t a FP problem anymore.
The second issue here is that solvers like broyden seem to struggle with absolute terms. The function g(x) = |f(x) - x| has exactly the same roots as g(x) = f(x) - x, but broyden struggles to find roots in the former (which I guess is what you’d expect from simpler solvers like bisection where the sign is informative?). This can be annoying in some DEQ variants where one may want to find simultaneous roots of two functions f(x) and h(x). As in (1), you cannot write g(x) = (f(x) - x) + (h(x) - x) but instead you’d need something like g(x) = |f(x) - x| + |h(x) - x| which broyden would struggle with.

Thanks a lot for the discussion! I think DEQs are very promising and don’t get the attention they deserve 😉

Top Results From Across the Web

Untitled

After bath oil mist, Clabe interbancario bancomer, Jeffren suarez goal com, ... Rack of lamb chops, Lostallo beat circus festival, The oregonian crossword ......

Glossary | Harvard's Geoffrey Chaucer Website

The headwords from the MED and OED are provided for the use of students wishing to ... (1) "chantry, endowment to pray for...

Afghanistan: The Imperial Dream | Firuz Kazemzadeh

Turkestan, Afghanistan, Transcaspia, Persia—to many these words breath only a sense of utter remoteness or a memory of strange vicissitudes ...

The Millennial Turns and the New Period: An Introduction

Twenty years ago a movement of movements came together in the streets of the largest city of the U.S. Pacific Northwest and defeated...

Amazing Stories, Christian Testimonies, Healing Miracles and ...

From seeing his mother beaten to living on the run, this young rap star saw a lot of pain growing up. Finding Christ...