question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Passing explicit training=True to DNNRankingNetwork instance when invoking its call() function produces different output during training

See original GitHub issue

Hello Team,

Library versions:

  • TensorFlow 2.5.0
  • TensorFlow Ranking 0.4.2

TL;DR:

I trained two models (let’s calls them A and B (with same fixed seed, same training dataset input, the same hyper params). On each re-training run, each model consistently and deterministically produces the same results during training. The only difference is when training the two models, in the model B, I am explicitly passing argument training=True to the DNNRankingNetwork instance (basically when invoking its parent’s call() function) when building an instance of tf.keras.Model for training.

Because I pass in training=True when constructing model B, its training output consistently and deterministically different to the output of model A.

Details

First, I noticed that in https://github.com/tensorflow/ranking/blob/v0.4.2/tensorflow_ranking/python/keras/model.py#L31-L78 when calling the instance of DNNRankingNetwork (it is actually an instance of tf.keras.layers.Layer through the inheritance chain) , you can pass another parameter training=True, e.g.:

network(inputs=keras_inputs, mask=mask, training=True)

According to the docs in the code, this boolean arg controls whether running in training or inference mode. By default it is set to None.

I tried to trace what it is doing and how it is being used. When we pass the input of type TensorSpec to DNNRankingNetwork, we actually invoke its .call(..) under the hood. This respectively is calling the tensorflow_ranking/python/keras.EncodeListwiseFeatures.call(…), which is calling the tf.keras.layers.DenseFeatures.call(…)

By default, in the DenseFeatures , the training arg is None, which causes the invocation:

training = backend.learning_phase()

https://github.com/keras-team/keras/blob/2f1cc1032ea51e5762ea4ed24e33deb33bc37075/keras/backend.py#L311-L337

Eventually, in the DenseFeatures, the ‘training’ param gets passed to FeatureColumn.get_dense_tensor(..)

The only place that I see that sets the training param not to be None is score() definition in DNNRankingNetwork which has training=True by default.

But, having said that, at the same time it seems that when DNNRankingNetwork instance is called (i.e., network(inputs=keras_inputs, mask=mask)), the training=None.

Therefore, the DNNRankingNetwork’s respective parent RankingNetwork.call(..) would invoke the UnivariateRankingNetwork.compute_logits(...) with passing the training=None. The compute_logits(..) then invokes the listwise_scoring(...) with training=None also, which respectively passes the training=None to the DNNRankingNetwork.score(...), which is the scorer variable in the listwise_scoring(..) function. Thus overriding the default training=True.

In other words, it seems that at runtime, training=None gets propagated to the score(..) of the DNNRankingNetwork instance. Could you please confirm / sanity check?

Questions

It seems that passing training=True to the DNNRankingNetwork instance (when invoking its call() function) affects the training output.

  1. Is there any chance that DNNRankingNetwork’s score(..) at runtime is called with training=None in the current implementation of the function in dnn.py?
  2. Why model.py is not calling an instance DNNRankingNetwork with an explicit training=True when creating an instance of tf.keras.Model?
  3. As a better practice, should the instance of DNNRankingNetwork layer be called with an explicit training=True in general, when creating a model instance for training? I am asking because eventually the following call function will be invoked in keras.layers.Layer which has a not so trivial decision tree related to handling of the training variable to determine if we are in training mode. Check the _set_training_mode(…) and _functional_construction_call(…)

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
azagniotovcommented, Feb 7, 2022

Training model A without passing explicit training=True to the instance of DNNRankingNetwork layer when calling it (i.e.: network(inputs=keras_inputs, mask=mask)). The following are epoch val loss output observed during training:

Epoch 00001: val_loss improved from inf to -0.83492
Epoch 00002: val_loss improved from -0.83492 to -0.84086
Epoch 00003: val_loss improved from -0.84086 to -0.84360
Epoch 00004: val_loss did not improve from -0.84360
Epoch 00005: val_loss improved from -0.84360 to -0.84488
Epoch 00006: val_loss improved from -0.84488 to -0.84637
...
...
Epoch 00010: val_loss did not improve from -0.84837
...
...
Epoch 00020: val_loss did not improve from -0.84837
...
...
Epoch 00050: val_loss did not improve from -0.84837

Now, I am training model B, by calling DNNRankingNetwork layer instance with an explicit training=True. The following are epoch output observed during training:

Epoch 00001: val_loss improved from inf to -0.83945
Epoch 00002: val_loss improved from -0.83945 to -0.83998
Epoch 00003: val_loss improved from -0.83998 to -0.84312
Epoch 00004: val_loss improved from -0.84312 to -0.84719
Epoch 00005: val_loss improved from -0.84719 to -0.84959
...
...
Epoch 00010: val_loss did not improve from -0.84959
...
...
Epoch 00020: val_loss did not improve from -0.85404
...
...
Epoch 00050: val_loss did not improve from -0.85860

The results are different JUST because I passed that argument in as True (To remind: both models were trained with the same fixed seed, same training dataset input, the same hyper params, i.e.: the training runs are deterministic, as per the TL;DR)

I expected that explicitly setting training=True would NOT produce a different output, since we already should be in a training mode, hopefully correctly determined by the Framework

0reactions
azagniotovcommented, Feb 17, 2022

Hi @ramakumar1729 , thank you. Did you have a chance to take a look?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Python Tutorial: classes and instances - 2021 - BogoToBogo
It's a statement that does nothing, and it's a good placeholder when we're stubbing out functions or classes. The pass statement in Python...
Read more >
5. Functions — Beginning Python Programming for Aspiring ...
The relationship between the parameter and the argument in the definition and calling of a function is that of an implicit assignment.
Read more >
How to Call a Function in Python (Example) - Guru99
Let us define a function by using the command ” def func1():” and call the function. The output of the function will be...
Read more >
Why can I use model(x, training =True) when I define my own ...
To call a model on an input, always use the __call__()̀ method, i.e. model(inputs), which relies on the underlying call() method.
Read more >
1.11. Defining Functions of your Own — Hands-on Python ...
The main method above uses three different sets of actual parameters in the three calls to sumProblem. 1.11.5.1. Quotient Function Exercise¶. The example...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found