[FEA] Data loader: support to padding sparse sequential features on the left side
See original GitHub issueIs your feature request related to a problem? Please describe.
The PyT and TF Dataloader support padding list (sparse) features to the right, which means that shorter list sequences will be completed with 0s in the right.
For sequential recommendation, a common use case is to keep the last N user interactions, what can be done either in the preprocessing or in the model side. The NVT Slice op, supports truncating to the last N elements (by providing negative limits).
But it is also useful to be able to do additional truncation in the model side (e.g. truncating with larger max seq. threshold with Slice op and tuning the best max sequence length according to model accuracy and training speed. To do such truncation in the model side, the padding needs to be applied by the Data Loader on the left side of the sequence features, so that when they are converted to dense tensors the padding 0s are placed on the left side. Thus, features could be sliced in the model like feature[:, -keep_last_n:]
without loosing the sequence features of users with less than N interactions.
Describe the solution you’d like
Create an argument for the datalodader sparse_padding_side
, which by default is right
, but can be set to left
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (11 by maintainers)
Top GitHub Comments
Hey Adam. The user-facing class is the DataLoader. For example, in PyTorch it is the
TorchAsyncItr
class and in TF it isKerasSequenceLoader
. We should create an argumentsparse_padding_side
, which should accept ‘right’ (default) or ‘right’.Hey @lesnikow . I have created some time ago a preliminary version of the code that converts the internal NVTabular representation of sparse features (values, offsets) to sparse tensors and @jperez999 ported it and integrated in the NVTabular dataloader later.
To give an example, let’s say a column in parquet file has sequence/list values, with 3 rows like this
The internal representation of NVTabular (values, offset) would be something like, as the offset informs how many values we have for each row
Then the equivalent sparse matrix can be build with values and indices like this
Finally the sparse tensor is converted to dense tensor in this line, which is padded on the right. In this example I assume
seq_limit=5
for this featureTo pad the items on the left, I believe we just need to substract the 2nd column of the indices for sparse matrix from the
seq_limit
,so that it becomesFrom the current implementation in NVTabular, I understand that the _get_indices() method is responsible to return the indices for each value. So maybe including this code after this line (if padding_direction==True) can make the trick 😉
If we currently don’t have tests for those data loader methods that converts the offset representation to sparse and dense features, it woud be good to create such tests using as test case something similar I have described here.