Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Setup of a large dataset

See original GitHub issue

Describe the bug

I can’t manage to “Finish setup” of a large dataset. The first training never ends. I think it has something to do with the fact that the file is almost 2GB, with >1M papers.

What happened? And what did you expect to happen?

When I click FINISH in the FINISH SETUP screen, it stalls:

Stays like this forever:

To Reproduce Steps to reproduce the problem:

Upload a large file
Run the first training

Version information

OS: Debian
ASReview version 0.17.1

Additional context

The process of using such a large file has been problematic. To upload it, I had to send the file directly to the server and then run a small python line to add the project with the dataset manually:

# use the utils functions
import asreview.webapp.utils.project as proj
# assign the necessary variables
info = ['proj_id', ...]; data = 'loc_to_data'
# add the project to asreview
proj.init_project(*info)
# manually add the dataset
proj.add_dataset_to_project(info[0], data)

Before using the current setup, the VM was too weak and would give me a “low memory” error, but now with a stronger VM it simply stops training and doesn’t even warn me that it had an error.

By running “htop”, I see that the CPU activity stops after about 30 seconds.

Issue Analytics

State:
Created 2 years ago
Comments:9 (6 by maintainers)

Top GitHub Comments

1reaction

ottomattascommented, Sep 20, 2021

Thanks for the confirmation and feedback. As for the memory consumption - I was happily running the workload on a laptop with 16GB available memory. The process did not exhaust the full capacity of my memory, though and I was able to keep running my other applications at the same time as per usual. So if you have a bigger dataset, you can do it also under lower technical specification while having a little more patience. No need to spend a lot of money on just the setup.

I’m glad it worked out. Always happy to collaborate and see a problem get solved! I hope you can get some amazing and relevant results. Happy hunting! 😃

0reactions

tlasocommented, Sep 20, 2021

You are right about the size of the sample. We will, for sure, learn that lesson the hard way.

I had to increase the RAM up to 50GB so that it would finish the initial training. Memory was still the bottleneck. It touched the 48GB of usage at the apex. The csv file that I shared the second time, with the 1.1M papers, worked just fine.

The problem laid on my side. Thank you for your help.