[RMP] T4R fixes: MultiGPU data parallel training for next-item prediction and fixed serving

Problem:

We have customers who would like to use multi-GPU Transformers4Rec but are blocked by issues with our existing support for session-based models.

Goal:

Unblock customer use cases so they can try out T4R to give us feedback

Constraints:

We don’t yet have Torchscript support (which is out of scope this issue)

Starting Point:

Enable DataParallel / DistributedDataParallel training using HF Trainer for next-item prediction
- Next item prediction - https://github.com/NVIDIA-Merlin/Transformers4Rec/issues/473 - DataParallel works if the model is wrapped manually by the user (i.e. model = torch.nn.DataParallel(model) for training, but that wrapping should happen automatically by the HF Trainer here
- https://github.com/NVIDIA-Merlin/Transformers4Rec/pull/483
- https://github.com/NVIDIA-Merlin/Transformers4Rec/pull/496#pullrequestreview-1131707972
- NVIDIA-Merlin/Transformers4Rec#492
Fix the serving sections of the existing T4R notebooks
- https://github.com/NVIDIA-Merlin/NVTabular/pull/1628
- https://github.com/NVIDIA-Merlin/Transformers4Rec/pull/468
[Task] Add multi-GPU example for Transformer4Rec PyTorch (https://github.com/NVIDIA-Merlin/Transformers4Rec/issues/508)
https://github.com/NVIDIA-Merlin/Transformers4Rec/issues/526

Note: The multi-GPU training of the specific use cases of session binary classification / regression is addressed by RMP #708

Issue Analytics

State:
Created a year ago
Comments:6 (4 by maintainers)

Top GitHub Comments

2reactions

gabrielspmoreiracommented, Aug 15, 2022

@viswa-nvidia @EvenOldridge Me, @rnyak, @sararb and @nzarif met today about the issues related to DataParallel. We tested DataParallel for Next Item Prediction for one of the examples and it is not working, differently from what Sara found some weeks ago in another example. So we have both Next Item Prediction and Binary Classification not working with DataParallel currently. We have associated the issues for both in this RMP ticket description. Should we remove the scope of DataParallel from this RMP and create another RMP ticket focused in DataParallel support (targeted for release 22.09)?

0reactions

karlhigleycommented, Aug 16, 2022

I don’t think we should split the issue, let’s just target this for 22.09