Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[FEA] Data loader: support to padding sparse sequential features on the left side

See original GitHub issue

Is your feature request related to a problem? Please describe. The PyT and TF Dataloader support padding list (sparse) features to the right, which means that shorter list sequences will be completed with 0s in the right. For sequential recommendation, a common use case is to keep the last N user interactions, what can be done either in the preprocessing or in the model side. The NVT Slice op, supports truncating to the last N elements (by providing negative limits). But it is also useful to be able to do additional truncation in the model side (e.g. truncating with larger max seq. threshold with Slice op and tuning the best max sequence length according to model accuracy and training speed. To do such truncation in the model side, the padding needs to be applied by the Data Loader on the left side of the sequence features, so that when they are converted to dense tensors the padding 0s are placed on the left side. Thus, features could be sliced in the model like feature[:, -keep_last_n:] without loosing the sequence features of users with less than N interactions.

Describe the solution you’d like Create an argument for the datalodader sparse_padding_side, which by default is right, but can be set to left

Issue Analytics

State:
Created 2 years ago
Comments:11 (11 by maintainers)

Top GitHub Comments

1reaction

gabrielspmoreiracommented, Sep 17, 2021

Hey Adam. The user-facing class is the DataLoader. For example, in PyTorch it is the TorchAsyncItr class and in TF it is KerasSequenceLoader. We should create an argument sparse_padding_side, which should accept ‘right’ (default) or ‘right’.

1reaction

gabrielspmoreiracommented, Sep 16, 2021

Hey @lesnikow . I have created some time ago a preliminary version of the code that converts the internal NVTabular representation of sparse features (values, offsets) to sparse tensors and @jperez999 ported it and integrated in the NVTabular dataloader later.

To give an example, let’s say a column in parquet file has sequence/list values, with 3 rows like this

[10,20]
[30]
[40,50,60]

The internal representation of NVTabular (values, offset) would be something like, as the offset informs how many values we have for each row

values = [10,20,30,40,50,60]
offsets = [2,1,3]

Then the equivalent sparse matrix can be build with values and indices like this

values = [10,20,30,40,50,60]
indices = [[0,0],
                 [0,1],
                 [1,0],
                 [2,0],
                 [2,1],
                 [2,2]

Finally the sparse tensor is converted to dense tensor in this line, which is padded on the right. In this example I assume seq_limit=5 for this feature

[10. 20,  0,  0, 0]
[30,   0,  0,  0, 0]
[40. 50, 60, 0, 0]

To pad the items on the left, I believe we just need to substract the 2nd column of the indices for sparse matrix from the seq_limit ,so that it becomes

indices = [[0,4],
                 [0,3],
                 [1,4],
                 [2,4],
                 [2,3],
                 [2,2]

From the current implementation in NVTabular, I understand that the _get_indices() method is responsible to return the indices for each value. So maybe including this code after this line (if padding_direction==True) can make the trick 😉

indices[:,1] = seq_limit - 1 - indices[:,1]

If we currently don’t have tests for those data loader methods that converts the offset representation to sparse and dense features, it woud be good to create such tests using as test case something similar I have described here.