Parallelize blocking (Fingerprinter)
See original GitHub issueAFAIK, Fingerprinter.__call__
is embarrassingly parallel: you just need to partition your records by the number of CPUs you have, call Fingerprinter.__call__
for each partition, then reduce write results to single blocking_map
table.
Currently that’s left for the implementer to do. Isn’t that something the library could do, considering it already has a num_cores
parameter? I could help with this.
I’ve found https://github.com/dedupeio/dedupe/issues/305 but it’s too old. That issue mentions message passing costs. But for DB-based “big dedupe” applications, that’s not an issue since data isn’t at main memory. Each worker process can read its own partition of data from the DB.
Even if we decide the library won’t do that by default, maybe we should update “big dedupe” DB-based examples like pgsql_big_dedupe_example.py to parallelize blocking?
Issue Analytics
- State:
- Created 3 years ago
- Comments:13 (12 by maintainers)
Top GitHub Comments
anyway, as a next step, you plan makes sense, Flávio
#856 would be a good way around that.