Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training the model for varmisuse task

See original GitHub issue

Hey! I tried to run training of the varmisuse model in order to explore how it works on data from unseen projects. I have a few questions regarding it:

Seems like the dataset format has changed compared to the published version of data. I’ve found the following issue in another repository. Unfortunately, I had already reorganized data before finding the issue: converted json files into jsonlines and changed structure from project/{train|test|valid}/files to {train|test|valid}/files. It would be nice to either duplicate the reorganizing script to this repo, or add a link to the issue in README.
After reorganizing the data, I tried to run training with default settings (minibatch size = 300) on an instance with 94 GB RAM and 48 CPUs. The instance doesn’t have GPU because I wanted to measure the memory usage so that I can allocate a proper GPU instance afterward. Unfortunately, training fails with OOM error, because it quickly utilizes 94 GB and asks for more. Moreover, I’ve tried to create a smaller version of the dataset by picking only 1 project from train/validation/test, and it didn’t really help: with a minibatch size of 100 and a single project in train part I still got OOM. Is it expected behavior?
Which instance do you recommend for training the model? In particular, how much RAM do I need and how long does the training take on, let’s say, V100?
Do you have a pre-trained model that you can share? Maybe I can avoid the training at all and just run the already trained model on different data.

Thanks a lot in advance and thanks for great projects and papers!

Issue Analytics

State:
Created 3 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

egor-bogomolovcommented, Jun 26, 2020

@mallamanis thanks a lot for the lightning-fast reply!

The convert script is very similar to my one. I will try to use subtoken model and report the results.

1reaction

mallamaniscommented, Jun 26, 2020

I just run this in the CPU and I can replicate this issue… I assume that the problem is that pyTorch fuses some operations in the GPU but not in the CPU for the character CNN. If you change the model to use subtokens (change "char" to "subtoken" here ), then the problem goes away.

The performance of subtoken/char models is fairly similar, so this might be good enough for now. I’ll try to investigate why the charCNN has such a terrible performance on CPU, hopefully next week…