Setup of a large dataset
See original GitHub issueDescribe the bug
I can’t manage to “Finish setup” of a large dataset. The first training never ends. I think it has something to do with the fact that the file is almost 2GB, with >1M papers.
What happened? And what did you expect to happen?
When I click FINISH in the FINISH SETUP screen, it stalls:
Stays like this forever:
To Reproduce Steps to reproduce the problem:
- Upload a large file
- Run the first training
Version information
- OS: Debian
- ASReview version 0.17.1
Additional context
The process of using such a large file has been problematic. To upload it, I had to send the file directly to the server and then run a small python line to add the project with the dataset manually:
# use the utils functions
import asreview.webapp.utils.project as proj
# assign the necessary variables
info = ['proj_id', ...]; data = 'loc_to_data'
# add the project to asreview
proj.init_project(*info)
# manually add the dataset
proj.add_dataset_to_project(info[0], data)
Before using the current setup, the VM was too weak and would give me a “low memory” error, but now with a stronger VM it simply stops training and doesn’t even warn me that it had an error.
By running “htop”, I see that the CPU activity stops after about 30 seconds.
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (6 by maintainers)
Top GitHub Comments
Thanks for the confirmation and feedback. As for the memory consumption - I was happily running the workload on a laptop with 16GB available memory. The process did not exhaust the full capacity of my memory, though and I was able to keep running my other applications at the same time as per usual. So if you have a bigger dataset, you can do it also under lower technical specification while having a little more patience. No need to spend a lot of money on just the setup.
I’m glad it worked out. Always happy to collaborate and see a problem get solved! I hope you can get some amazing and relevant results. Happy hunting! 😃
You are right about the size of the sample. We will, for sure, learn that lesson the hard way.
I had to increase the RAM up to 50GB so that it would finish the initial training. Memory was still the bottleneck. It touched the 48GB of usage at the apex. The csv file that I shared the second time, with the 1.1M papers, worked just fine.
The problem laid on my side. Thank you for your help.