Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Dense layer doesn't flatten higher dimensional tensors

See original GitHub issue

https://github.com/keras-team/keras/blob/aedad3986200b825d94f847d52bd6b81f0419a06/keras/layers/core.py#L776

The documentation of the dense layer claims to flatten the input if a tensor with rank > 2 is provided. However, what actually happens is that the the dense layer picks the last dimension and computes the result element wise along remaining axis.

https://github.com/keras-team/keras/blob/aedad3986200b825d94f847d52bd6b81f0419a06/keras/layers/core.py#L858

You can verify that by comparing two models one adding a Flatten() layer and the other one not adding one: https://gist.github.com/FirefoxMetzger/44e9e056e45c1a3cc8000ab8d6f2cebe

The first model only has 10 + bias = 11 trainable parameters (reusing weights along the 1st input dimension). The second model has 10*10 + bias = 101 trainable parameters. Also the output shapes are completely different. I would have expected the result to be indifferent wrt. the Flatten() layer…

It might very well be that I am misunderstanding something. If so, kindly point out my mistake =)

Issue Analytics

State:
Created 5 years ago
Reactions:1
Comments:9 (7 by maintainers)

Top GitHub Comments

4reactions

bethardcommented, May 2, 2018

I just want to chime in that I was also confused by this documentation, as was this StackOverflow user: https://stackoverflow.com/questions/44611006/timedistributeddense-vs-dense-in-keras-same-number-of-parameters

If nothing else, I would really appreciate it if the note explicitly stated that “flatten” here means something different from the Flatten layer. Ideally, the documentation would give an example of an input of some shape and how that is “flattened” to produce the output shape.

0reactions

rmanakcommented, Jul 14, 2018

I am also confused with this, However applying a dense layer D[k,l] (of shape (K, L)to each of the temporal components of an input X[?,m,k] (of shape (?, M, K)) is mathematically identical to matrix multiplication X * D. This is just a happy coincidence. However for TimeDistributed layer to work with arbitrary layer, keras needs to have a “for loop” implementation of this multiplication rather than full vectorized implementation.

If input was flattened to the shape (?, M*K) the layer needs to have dimension (M*K, L) and far more parameters, this does not “corresponds, conceptually, to a dot product with a flattened version of the input” but does correspond conceptually to a dot product with flattened version where there are in fact M different copies of the dense layer of shape (K, L) so the temporal component do not share the weights. Perhaps that is what they meant by conceptual equivalency.