question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Update an existing model, rather than learning a new one from scratch each time?

See original GitHub issue

Thanks for this project, it’s a been very useful to me! However, I have a small question/issue:

Say I have a trained model and periodically get new training data that I’d like to use to update my model.

From what I can tell, it’s impossible to load an existing settings file (which I believe contains the previously learned predicates?), add some new marked pairs, then train the model. Instead, it seems I have to:

  1. Re-load & resample my data.
  2. Load up my existing training file.
  3. Call ‘markPairs’ with my new training data.
  4. Re-write my training file.
  5. Call ‘train’.
  6. Re-write my settings file.

It would be nice if I could skip step 1., since this seems to take the longest and in theory simply loading my existing settings file should get me to that point.

what I’m doing now (pseudocode):

data_d = readData()
deduper = dedupe.Dedupe(fieldDefinitions)
deduper.sample(data_d)
deduper.readTraining(trainingFile)
deduper.markPairs(newTrainingPairs)
deduper.writeTraining(trainingFile)
deduper.train()
deduper.writeSettings(settingsFile)

What I’d like to be able to do:

deduper = dedupe.Dedupe(fieldDefinitions)
deduper.readSettings(settingsFile)
deduper.readTraining(trainingFile)
deduper.markPairs(newTrainingPairs)
deduper.writeTraining(trainingFile)
deduper.train()
deduper.writeSettings(settingsFile)

Am I missing something?

Edit: thinking about it a bit more, I guess the model would need to have samples loaded up to re-train on the new data anyhow…so there’s no way to skip the data load/sample step?

Thanks, Dustin

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:9
  • Comments:7

github_iconTop GitHub Comments

4reactions
adriennefrankecommented, Oct 15, 2018

The StaticDedupe and the other Static* classes do not have the attribute ‘train’. So if you want to do any additional training, active labeling, etc you would have to reload and resample the data into the Deduper/RecordLinker/Gazetteer and write those settings and labeled examples off and then you could reload that using the Static* class later. See #679. This is a little annoying though. Does anyone know a way to just add in labeled examples to the model without having to reload/resample all the data?

0reactions
ieriiicommented, May 27, 2020

Hi, I’ve just created pull request to pandas-dedupe library, a wrapper of dedupe. The changes allow users to update an existing model, rather than training it from scratch each time. See pandas-dedupe/pull/26

Note: the pull request has now been merged to master. Updating model is default in pandas_dedupe

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Update Neural Network Models With More Data
Ignore new data, do nothing. Update existing model on new data. Fit new model on new data, discard old model and data.
Read more >
Should a model be re-trained if new observations are available?
Online: each time a new observation is available, you use this single data point to further train your model (e.g. load your current...
Read more >
Is it possible to update a model with new data without ... - GitHub
I need to update a model with new data without retraining the model from scratch. That is, incremental training for the cases when...
Read more >
Transfer learning from pre-trained models | by Pedro Marcelino
With transfer learning, instead of starting the learning process from scratch, you start from patterns that have been learned when solving a  ......
Read more >
To retrain, or not to retrain? Let's get analytical about ML ...
They vary from minor online calibration to a complete offline update or a combination of both. Some models drift quickly, some not that...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found