Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Update an existing model, rather than learning a new one from scratch each time?

See original GitHub issue

Thanks for this project, it’s a been very useful to me! However, I have a small question/issue:

Say I have a trained model and periodically get new training data that I’d like to use to update my model.

From what I can tell, it’s impossible to load an existing settings file (which I believe contains the previously learned predicates?), add some new marked pairs, then train the model. Instead, it seems I have to:

Re-load & resample my data.
Load up my existing training file.
Call ‘markPairs’ with my new training data.
Re-write my training file.
Call ‘train’.
Re-write my settings file.

It would be nice if I could skip step 1., since this seems to take the longest and in theory simply loading my existing settings file should get me to that point.

what I’m doing now (pseudocode):

data_d = readData()
deduper = dedupe.Dedupe(fieldDefinitions)
deduper.sample(data_d)
deduper.readTraining(trainingFile)
deduper.markPairs(newTrainingPairs)
deduper.writeTraining(trainingFile)
deduper.train()
deduper.writeSettings(settingsFile)

What I’d like to be able to do:

deduper = dedupe.Dedupe(fieldDefinitions)
deduper.readSettings(settingsFile)
deduper.readTraining(trainingFile)
deduper.markPairs(newTrainingPairs)
deduper.writeTraining(trainingFile)
deduper.train()
deduper.writeSettings(settingsFile)

Am I missing something?

Edit: thinking about it a bit more, I guess the model would need to have samples loaded up to re-train on the new data anyhow…so there’s no way to skip the data load/sample step?

Thanks, Dustin

Issue Analytics

State:
Created 5 years ago
Reactions:9
Comments:7

Top GitHub Comments

4reactions

adriennefrankecommented, Oct 15, 2018

The StaticDedupe and the other Static* classes do not have the attribute ‘train’. So if you want to do any additional training, active labeling, etc you would have to reload and resample the data into the Deduper/RecordLinker/Gazetteer and write those settings and labeled examples off and then you could reload that using the Static* class later. See #679. This is a little annoying though. Does anyone know a way to just add in labeled examples to the model without having to reload/resample all the data?

0reactions

ieriiicommented, May 27, 2020

Hi, I’ve just created pull request to pandas-dedupe library, a wrapper of dedupe. The changes allow users to update an existing model, rather than training it from scratch each time. See pandas-dedupe/pull/26

Note: the pull request has now been merged to master. Updating model is default in pandas_dedupe

Top Results From Across the Web

How to Update Neural Network Models With More Data

Ignore new data, do nothing. Update existing model on new data. Fit new model on new data, discard old model and data.

Should a model be re-trained if new observations are available?

Online: each time a new observation is available, you use this single data point to further train your model (e.g. load your current...

Is it possible to update a model with new data without ... - GitHub

I need to update a model with new data without retraining the model from scratch. That is, incremental training for the cases when...

Transfer learning from pre-trained models | by Pedro Marcelino

With transfer learning, instead of starting the learning process from scratch, you start from patterns that have been learned when solving a ......

To retrain, or not to retrain? Let's get analytical about ML ...

They vary from minor online calibration to a complete offline update or a combination of both. Some models drift quickly, some not that...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Update an existing model, rather than learning a new one from scratch each time?

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

AttributeError: 'PTNCPredicate' object has no attribute 'tag'

The dbm solution seems making the blocking process extremely slow