Dense layer doesn't flatten higher dimensional tensors
See original GitHub issueThe documentation of the dense layer claims to flatten the input if a tensor with rank > 2 is provided. However, what actually happens is that the the dense layer picks the last dimension and computes the result element wise along remaining axis.
You can verify that by comparing two models one adding a Flatten() layer and the other one not adding one:
https://gist.github.com/FirefoxMetzger/44e9e056e45c1a3cc8000ab8d6f2cebe
The first model only has 10 + bias = 11 trainable parameters (reusing weights along the 1st input dimension). The second model has 10*10 + bias = 101 trainable parameters. Also the output shapes are completely different. I would have expected the result to be indifferent wrt. the Flatten() layer…
It might very well be that I am misunderstanding something. If so, kindly point out my mistake =)
Issue Analytics
- State:
- Created 5 years ago
- Reactions:1
- Comments:9 (7 by maintainers)

Top Related StackOverflow Question
I just want to chime in that I was also confused by this documentation, as was this StackOverflow user: https://stackoverflow.com/questions/44611006/timedistributeddense-vs-dense-in-keras-same-number-of-parameters
If nothing else, I would really appreciate it if the note explicitly stated that “flatten” here means something different from the
Flattenlayer. Ideally, the documentation would give an example of an input of some shape and how that is “flattened” to produce the output shape.I am also confused with this, However applying a dense layer
D[k,l](of shape(K, L)to each of the temporal components of an inputX[?,m,k](of shape(?, M, K)) is mathematically identical to matrix multiplicationX * D. This is just a happy coincidence. However forTimeDistributedlayer to work with arbitrary layer, keras needs to have a “for loop” implementation of this multiplication rather than full vectorized implementation.If input was flattened to the shape
(?, M*K)the layer needs to have dimension(M*K, L)and far more parameters, this does not “corresponds, conceptually, to a dot product with a flattened version of the input” but does correspond conceptually to a dot product with flattened version where there are in factMdifferent copies of the dense layer of shape(K, L)so the temporal component do not share the weights. Perhaps that is what they meant by conceptual equivalency.