Understanding batch_T
See original GitHub issueHi,
batch_T (int) – number of time-steps per sample batch
I don’t understand the effect of batch_T
in samplers. I see another batch_T
in R2D1 too. So what is the difference? What is the relation between them and how we should set these two values? And also batch_B
values for R2D1 and its sampler?
I want to understand the effect of this parameter, batch_T
, especially in recurrent algos such as R2D1 and PPO_LSTM. Does it affect the memory/history information that the LSTM can learn/memorize? Based on the code, the agent uses a trajectory of size batch_T
to train LSTM, so it can limit the time horizon the network can memorize info. So it should be set to average trajectory size of the env, based on each env, right?
Thank you
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:6 (2 by maintainers)
Top Results From Across the Web
[1806.02375] Understanding Batch Normalization - arXiv
Abstract: Batch normalization (BN) is a technique to normalize activations in intermediate layers of deep neural networks.
Read more >Batch Script Tutorial
This tutorial has been prepared for beginners to understand the basic concepts of Batch Script. Prerequisites. A reasonable knowledge of computer programming ...
Read more >Difference Between a Batch and an Epoch in a Neural Network
When the batch is the size of one sample, the learning algorithm is called stochastic gradient descent.
Read more >Beginners Guides: Understanding and Creating Batch Files
At their simplest, batch files are text files which execute one or more command prompt commands in a specific order. The power of...
Read more >Batch normalization in 3 levels of understanding
An updated explanation of Batch Normalization through 3 levels of understanding : in 30 seconds, 3 minutes, and a comprehensive guide ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi, good questions!
To clarify some earlier questions…in the policy gradient algorithms, like PPO, there is only the sampler’s batch_T and batch_B, and then whatever the sampler returns in one iteration forms the minibatch for the algorithm. In replay-based algorithms like DQN, there is still the batch_T and batch_B of the sampler, which keep the same meaning as the amount of data collected per iteration. But these algorithms also have their own batch_size–or in the case of R2D1 batch_T and batch_B–to determine how much data is replayed from the buffer for each training minibatch.
Regarding
done=True
for multiple timesteps, yes that is because when and environment episode ends during sampling, the environment might not reset until the beginning of the following sampling batch, so that the start of an episode aligns with the interval for storing the RNN state. But in the meantime, all the (dummy) data from the inactive environment still gets written to the replay buffer. Populatingdone=True
for all those steps makes it obvious where the new episode actually begins in the buffer, which is the first new step wheredone=False
. And if you look at thevalid_from_done()
function which generates the mask for the RNN, it masks out all data after the firstdone=True
, so it’s ok to have moredone=True
after that. Kind of a long explanation, but does that make sense?@bmazoure The discrepancy between the length of the observations returned is because it also includes the target observations, which extend out to n steps past the agent observations, for n-step returns: https://github.com/astooke/rlpyt/blob/668290d1ca94e9d193388a599d4f719bc3a23fba/rlpyt/replays/sequence/n_step.py#L88
Then inside the R2D1 algorithm it moves the one copy of the whole observation set to the GPU once, and then creates sliced views to this data for the agent inputs and target inputs. R2D1 default n_step_return is 5, so that should add up. Sorry that’s a tricky one!
Hi! That is correct, the environment state carries forward to the next sampling batch. The environment only resets when an episode finishes, even if the sampler
batch_T
is much shorter than this. So the sampler’sbatch_T
should have small-to-no effect on training, whereas the algorithm’sbatch_T
can have a large effect, because this the is the length of LSTM backprop-through-time for training. Hope that helps!