question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[FEA] Data loader: support to padding sparse sequential features on the left side

See original GitHub issue

Is your feature request related to a problem? Please describe. The PyT and TF Dataloader support padding list (sparse) features to the right, which means that shorter list sequences will be completed with 0s in the right. For sequential recommendation, a common use case is to keep the last N user interactions, what can be done either in the preprocessing or in the model side. The NVT Slice op, supports truncating to the last N elements (by providing negative limits). But it is also useful to be able to do additional truncation in the model side (e.g. truncating with larger max seq. threshold with Slice op and tuning the best max sequence length according to model accuracy and training speed. To do such truncation in the model side, the padding needs to be applied by the Data Loader on the left side of the sequence features, so that when they are converted to dense tensors the padding 0s are placed on the left side. Thus, features could be sliced in the model like feature[:, -keep_last_n:] without loosing the sequence features of users with less than N interactions.

Describe the solution you’d like Create an argument for the datalodader sparse_padding_side, which by default is right, but can be set to left

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:11 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
gabrielspmoreiracommented, Sep 17, 2021

Hey Adam. The user-facing class is the DataLoader. For example, in PyTorch it is the TorchAsyncItr class and in TF it is KerasSequenceLoader. We should create an argument sparse_padding_side, which should accept ‘right’ (default) or ‘right’.

1reaction
gabrielspmoreiracommented, Sep 16, 2021

Hey @lesnikow . I have created some time ago a preliminary version of the code that converts the internal NVTabular representation of sparse features (values, offsets) to sparse tensors and @jperez999 ported it and integrated in the NVTabular dataloader later.

To give an example, let’s say a column in parquet file has sequence/list values, with 3 rows like this

[10,20]
[30]
[40,50,60]

The internal representation of NVTabular (values, offset) would be something like, as the offset informs how many values we have for each row

values = [10,20,30,40,50,60]
offsets = [2,1,3]

Then the equivalent sparse matrix can be build with values and indices like this

values = [10,20,30,40,50,60]
indices = [[0,0],
                 [0,1],
                 [1,0],
                 [2,0],
                 [2,1],
                 [2,2]

Finally the sparse tensor is converted to dense tensor in this line, which is padded on the right. In this example I assume seq_limit=5 for this feature

[10. 20,  0,  0, 0]
[30,   0,  0,  0, 0]
[40. 50, 60, 0, 0]

To pad the items on the left, I believe we just need to substract the 2nd column of the indices for sparse matrix from the seq_limit ,so that it becomes

indices = [[0,4],
                 [0,3],
                 [1,4],
                 [2,4],
                 [2,3],
                 [2,2]

From the current implementation in NVTabular, I understand that the _get_indices() method is responsible to return the indices for each value. So maybe including this code after this line (if padding_direction==True) can make the trick 😉

indices[:,1] = seq_limit - 1 - indices[:,1]

If we currently don’t have tests for those data loader methods that converts the offset representation to sparse and dense features, it woud be good to create such tests using as test case something similar I have described here.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[FEA] Abstract PyTorch Data loader to provide Sparse Tensors ...
I have implemented a Data loader extension that converts the current offset representation of list columns to a sparse or dense tensor, so...
Read more >
Configure Data Loader - Salesforce Developers
Use the Settings menu to change the Data Loader default operation settings. Open the Data Loader. Select Settings | Settings. Edit the fields...
Read more >
RecD: Deduplication for End-to-End Deep Learning ... - arXiv
We present RecD (Recommendation Deduplication), a suite of end-to-end infrastructure optimizations across the Deep Learning Recommendation Model ...
Read more >
Deep Learning with PyTorch
1.4 An overview of how PyTorch supports deep learning ... At left in figure 1.2, we see that quite a bit of data...
Read more >
d2l-en-mxnet.pdf - Dive into Deep Learning
3.3.3 Concise Implementation of the Data Loader ... Converting Raw Text into Sequence Data ... 12.7.1 Sparse Features and Learning Rates.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found