The dbm solution seems making the blocking process extremely slow
See original GitHub issueFor me I have a 30K records to match agains to, and if I use the default dbm way it takes more than 10 minutes to match on, for example a 2 entries records. During the matching you could see output like this:
[2017-06-29 20:43:56,841: INFO/PoolWorker-1] 10000, 182.5327482 seconds
[2017-06-29 20:53:40,909: INFO/PoolWorker-1] 20000, 758.9884932 seconds
which I believe is the output from https://github.com/dedupeio/dedupe/blob/master/dedupe/blocking.py#L42
As so far we have enough memory, I had to change the code here to let the blocking happen in a dictionary in memory : https://github.com/dedupeio/dedupe/blob/master/dedupe/api.py#L1072
Basically, instead of returning shelf
, return an empty python dictionary:
def _temp_shelve():
fd, file_path = tempfile.mkstemp()
os.close(fd)
try:
shelf = shelve.open(file_path, 'n',
protocol=pickle.HIGHEST_PROTOCOL)
except Exception as e:
if 'db type could not be determined' in str(e):
os.remove(file_path)
shelf = shelve.open(file_path, 'n',
protocol=pickle.HIGHEST_PROTOCOL)
else:
raise
return {}, file_path # return python dictionary instead of shelf
This will make the blocking and matching process takes lots of memory but it can finish a 2 entries matching against 30K records in a few seconds.
Does this looks normal?
Also the dbm thing is not working for large data set on macOS, as by default there is no gdbm available for python3 on macOS (not exactly sure why) and it causes issue like this:
HASH: Out of overflow pages. Increase page size
Traceback (most recent call last):
File "/Users/tendres/PycharmProjects/dedupe/tests/test_shelve.py", line 25, in <module>
shelf[k] += [(i, record, ids)]
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/shelve.py", line 125, in __setitem__
self.dict[key.encode(self.keyencoding)] = f.getvalue()
_dbm.error: cannot add item to database
Process finished with exit code 1
also mentioned here: https://github.com/dedupeio/csvdedupe/issues/67
And it would be nice if we could have an option on the matching API to decide whether using shelve(or dbm), I suppose.
Issue Analytics
- State:
- Created 6 years ago
- Comments:16 (6 by maintainers)
Top GitHub Comments
@betocollin worked great! Thanks for the tip.
@fgregg looks like a bug in 1.7 on Mac OSX
If you do end up testing 1.7.0, have the larger dataset be the second one.