question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Setup of a large dataset

See original GitHub issue

Describe the bug

I can’t manage to “Finish setup” of a large dataset. The first training never ends. I think it has something to do with the fact that the file is almost 2GB, with >1M papers.

What happened? And what did you expect to happen?

When I click FINISH in the FINISH SETUP screen, it stalls:

image

Stays like this forever:

image

To Reproduce Steps to reproduce the problem:

  1. Upload a large file
  2. Run the first training

Version information

  • OS: Debian
  • ASReview version 0.17.1

Additional context

The process of using such a large file has been problematic. To upload it, I had to send the file directly to the server and then run a small python line to add the project with the dataset manually:

# use the utils functions
import asreview.webapp.utils.project as proj
# assign the necessary variables
info = ['proj_id', ...]; data = 'loc_to_data'
# add the project to asreview
proj.init_project(*info)
# manually add the dataset
proj.add_dataset_to_project(info[0], data)

Before using the current setup, the VM was too weak and would give me a “low memory” error, but now with a stronger VM it simply stops training and doesn’t even warn me that it had an error.

By running “htop”, I see that the CPU activity stops after about 30 seconds.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
ottomattascommented, Sep 20, 2021

Thanks for the confirmation and feedback. As for the memory consumption - I was happily running the workload on a laptop with 16GB available memory. The process did not exhaust the full capacity of my memory, though and I was able to keep running my other applications at the same time as per usual. So if you have a bigger dataset, you can do it also under lower technical specification while having a little more patience. No need to spend a lot of money on just the setup.

I’m glad it worked out. Always happy to collaborate and see a problem get solved! I hope you can get some amazing and relevant results. Happy hunting! 😃

0reactions
tlasocommented, Sep 20, 2021

You are right about the size of the sample. We will, for sure, learn that lesson the hard way.

I had to increase the RAM up to 50GB so that it would finish the initial training. Memory was still the bottleneck. It touched the 48GB of usage at the apex. The csv file that I shared the second time, with the 1.1M papers, worked just fine.

The problem laid on my side. Thank you for your help.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Eleven tips for working with large data sets - Nature
Eleven tips for working with large data sets · Cherish your data · Visualize the information · Show your workflow · Use version...
Read more >
Large datasets in Power BI Premium - Microsoft Learn
In the workspace, select Settings > Premium. In Default storage format, select Large dataset storage format, and then select Save. Enable ...
Read more >
1.9 Working with Large Datasets - Bookdown
1.9 Working with Large Datasets. The learning objectives of this section are to: Read and manipulate large datasets. R now offers now offers...
Read more >
Working efficiently with large datasets - Coding Club
This data frame contains data from lots of different sources so to help answer our question of how populations have changed since 1970,...
Read more >
7 Ways to Handle Large Data Files for Machine Learning
Some machine learning tools or libraries may be limited by a default memory configuration. Check if you can re-configure your tool or library...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found