question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training the model for varmisuse task

See original GitHub issue

Hey! I tried to run training of the varmisuse model in order to explore how it works on data from unseen projects. I have a few questions regarding it:

  1. Seems like the dataset format has changed compared to the published version of data. I’ve found the following issue in another repository. Unfortunately, I had already reorganized data before finding the issue: converted json files into jsonlines and changed structure from project/{train|test|valid}/files to {train|test|valid}/files. It would be nice to either duplicate the reorganizing script to this repo, or add a link to the issue in README.
  2. After reorganizing the data, I tried to run training with default settings (minibatch size = 300) on an instance with 94 GB RAM and 48 CPUs. The instance doesn’t have GPU because I wanted to measure the memory usage so that I can allocate a proper GPU instance afterward. Unfortunately, training fails with OOM error, because it quickly utilizes 94 GB and asks for more. Moreover, I’ve tried to create a smaller version of the dataset by picking only 1 project from train/validation/test, and it didn’t really help: with a minibatch size of 100 and a single project in train part I still got OOM. Is it expected behavior?
  3. Which instance do you recommend for training the model? In particular, how much RAM do I need and how long does the training take on, let’s say, V100?
  4. Do you have a pre-trained model that you can share? Maybe I can avoid the training at all and just run the already trained model on different data.

Thanks a lot in advance and thanks for great projects and papers!

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
egor-bogomolovcommented, Jun 26, 2020

@mallamanis thanks a lot for the lightning-fast reply!

The convert script is very similar to my one. I will try to use subtoken model and report the results.

1reaction
mallamaniscommented, Jun 26, 2020

I just run this in the CPU and I can replicate this issue… I assume that the problem is that pyTorch fuses some operations in the GPU but not in the CPU for the character CNN. If you change the model to use subtokens (change "char" to "subtoken" here ), then the problem goes away.

The performance of subtoken/char models is fairly similar, so this might be good enough for now. I’ll try to investigate why the charCNN has such a terrible performance on CPU, hopefully next week…

Read more comments on GitHub >

github_iconTop Results From Across the Web

Learning to Represent Programs with Graphs
We evaluate our method on two tasks: VarNaming, in which a network attempts to predict the name of a variable given its usage,...
Read more >
arXiv:1711.00740v3 [cs.LG] 4 May 2018
(ii) We present deep learning models for solving the VARNAMING and VARMISUSE tasks by modeling the code's graph structure and learning program ...
Read more >
Setting a Benchmark for Representation Learning of Source ...
on the representation of source code produced from training a Deep Learning model. Identifying the tasks that each of these papers were trying...
Read more >
IN4334 - Machine Learning for Software Engineering
Project 5: VarMisuse in a different programming language (Maurício). ... In this task, you will train ML models to recommend (or maybe even ......
Read more >
Learning to Represent Programs with Graphs - OpenReview
Abstract: Learning tasks on source code (i.e., formal languages) have been ... our models learn to infer meaningful names and to solve the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found