question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Active Learning sampling quality

See original GitHub issue

After using the dedupe library for a while in the context of video content reconciliation, we encountered some situations where the Active Learning sampling is very poor. This makes it difficult to built a good training set for the classifier and by consequence the reconciliation results are poor as well.

For instance, we made some tests and tried to reconcile 2 well known public data provider (iMDB and TMDB), which contain reciprocal references to be used as ground truth and good metadata, plus we could build the dataset knowing that all entries could be reconciled in a many-to-one approach (set 1 is contained in set 2 -> 100% recall is theoretically possible).

We tried to reconcile episodes and used few fields in the process (episode title, series title, season number, episode number, series year). The Active Learning sampling were quite balanced between positive and negative examples, therefore it has been quite effortless to collect 10 samples of positive and negative pairs. The final results were quite satisfying as well: recall 78%, precision 98%. Moreover, by scrolling through the results, we noticed that the model learned to ignore the episode title field, which were not consistent between datasets.

Afterwards we decided to perform a second test by removing the episode title field, but keeping everything else as in the previous test (same dataset, same configurations). This time the Active Learning sample were quite poor: almost all pairs were wrong (it took more than 200 pairs to obtain 8 positive). The final reconciliation in this case were also poor: recall 15% and precision 91%.

I would ask then if it is possible to mitigate this kind of issues:

  1. Is it important to balance the active learning pairs? in the second test we fed 200 negative vs 8 positive pairs. Can this be the cause of a low recall?
  2. How is it the model explainable? How would you suggest to investigate bad reconciliation results in general?
  3. Do you have any idea of which are the possible causes of bad sampling in this specific test case?

Thank you for your great work,

Antonio

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
fgreggcommented, Mar 18, 2022

can you try the better sampling branch and let me know if that is giving you better results?

0reactions
fgreggcommented, Apr 12, 2022

closing for now due to lack of feedback.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Active Learning Sampling Strategies | by Hardik Dave - Medium
In this strategy, the active learner focuses on those data samples more, which would bring the most change to the model. The active...
Read more >
Learning to Sample: an Active Learning Framework - arXiv
Within this framework, the sampling model incorporates uncertainty sampling and diversity sampling into a unified process for optimization, ...
Read more >
How to measure uncertainty in uncertainty sampling for active ...
In this case, a strong active learning strategy may succeed in selecting the informative, high-quality examples first, leading to a good model ...
Read more >
Active Learning 101: A Complete Guide to Higher Quality Data ...
In active learning, ML teams properly label a small sample of data to train their model, which in turn learns to identify specific...
Read more >
Sampling in the Least Probable Disagreement Region
Active learning strategy to query samples closest to the decision boundary can be an effective strategy for sampling the most uncertain and thus...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found