Update an existing model, rather than learning a new one from scratch each time?
See original GitHub issueThanks for this project, it’s a been very useful to me! However, I have a small question/issue:
Say I have a trained model and periodically get new training data that I’d like to use to update my model.
From what I can tell, it’s impossible to load an existing settings file (which I believe contains the previously learned predicates?), add some new marked pairs, then train the model. Instead, it seems I have to:
- Re-load & resample my data.
- Load up my existing training file.
- Call ‘markPairs’ with my new training data.
- Re-write my training file.
- Call ‘train’.
- Re-write my settings file.
It would be nice if I could skip step 1., since this seems to take the longest and in theory simply loading my existing settings file should get me to that point.
what I’m doing now (pseudocode):
data_d = readData()
deduper = dedupe.Dedupe(fieldDefinitions)
deduper.sample(data_d)
deduper.readTraining(trainingFile)
deduper.markPairs(newTrainingPairs)
deduper.writeTraining(trainingFile)
deduper.train()
deduper.writeSettings(settingsFile)
What I’d like to be able to do:
deduper = dedupe.Dedupe(fieldDefinitions)
deduper.readSettings(settingsFile)
deduper.readTraining(trainingFile)
deduper.markPairs(newTrainingPairs)
deduper.writeTraining(trainingFile)
deduper.train()
deduper.writeSettings(settingsFile)
Am I missing something?
Edit: thinking about it a bit more, I guess the model would need to have samples loaded up to re-train on the new data anyhow…so there’s no way to skip the data load/sample step?
Thanks, Dustin
Issue Analytics
- State:
- Created 5 years ago
- Reactions:9
- Comments:7
Top GitHub Comments
The
StaticDedupe
and the otherStatic*
classes do not have the attribute ‘train’. So if you want to do any additional training, active labeling, etc you would have to reload and resample the data into the Deduper/RecordLinker/Gazetteer and write those settings and labeled examples off and then you could reload that using theStatic*
class later. See #679. This is a little annoying though. Does anyone know a way to just add in labeled examples to the model without having to reload/resample all the data?Hi, I’ve just created pull request to pandas-dedupe library, a wrapper of dedupe. The changes allow users to update an existing model, rather than training it from scratch each time. See pandas-dedupe/pull/26
Note: the pull request has now been merged to master. Updating model is default in pandas_dedupe