question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Increased sample size makes train very slow

See original GitHub issue

After increasing sample size (as in https://github.com/dedupeio/dedupe/commit/da29f24823c4066b81f558fb8794452a04a8b15a), training has become extremely slow; in particular it gets stuck at:

comparison_count = self.comparisons(self.total_cover, compound_length)

I stopped execution after over 20 minutes (getting no result) while the entire run previously took about 30 seconds (with the exact same data and previous sample sizes (200 for RecordLink, 900 for Dedupe)). The data has about 2000 rows for the messy data and 6000 rows for the canonical data; I am linking on three fields (address, name, and area code all declared as Strings)

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:9 (8 by maintainers)

github_iconTop GitHub Comments

2reactions
Davidmp11commented, Jul 19, 2019

I had the same problem, I solved by replacing the values NULL of my data base with empty fields

0reactions
fgreggcommented, Dec 28, 2017

could you check this again?

Read more comments on GitHub >

github_iconTop Results From Across the Web

What impact does increasing the training data have on the ...
In these cases, the test risk first decreases as the size of the training set increases, transiently *increases* when a bit more training...
Read more >
Effect of batch size on training dynamics | by Kevin Shen
Contrary to our hypothesis, the mean gradient norm increases with batch size! We expected the gradients to be smaller for larger batch size...
Read more >
4.1.3 - Impact of Sample Size | STAT 200
When the sample size increased the standard error decreased. Also know that the population was strongly skewed to the right. With the smaller...
Read more >
SVC classifier taking too much time for training - Stack Overflow
Reducing training set size. Quoting the docs: The fit time complexity is more than quadratic with the number of samples which makes it...
Read more >
Impact of Dataset Size on Deep Learning Model Skill And ...
It is critical to make this “common knowledge” concrete with worked examples. ... More samples give a learning algorithm more opportunity to ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found