Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

The dbm solution seems making the blocking process extremely slow

See original GitHub issue

For me I have a 30K records to match agains to, and if I use the default dbm way it takes more than 10 minutes to match on, for example a 2 entries records. During the matching you could see output like this:

[2017-06-29 20:43:56,841: INFO/PoolWorker-1] 10000, 182.5327482 seconds
[2017-06-29 20:53:40,909: INFO/PoolWorker-1] 20000, 758.9884932 seconds

which I believe is the output from https://github.com/dedupeio/dedupe/blob/master/dedupe/blocking.py#L42

As so far we have enough memory, I had to change the code here to let the blocking happen in a dictionary in memory : https://github.com/dedupeio/dedupe/blob/master/dedupe/api.py#L1072

Basically, instead of returning shelf, return an empty python dictionary:

def _temp_shelve():
    fd, file_path = tempfile.mkstemp()
    os.close(fd)

    try:
        shelf = shelve.open(file_path, 'n',
                                      protocol=pickle.HIGHEST_PROTOCOL)
    except Exception as e:
        if 'db type could not be determined' in str(e):
            os.remove(file_path)
            shelf = shelve.open(file_path, 'n',
                                protocol=pickle.HIGHEST_PROTOCOL)
        else:
            raise

    return {}, file_path # return python dictionary instead of shelf

This will make the blocking and matching process takes lots of memory but it can finish a 2 entries matching against 30K records in a few seconds.

Does this looks normal?

Also the dbm thing is not working for large data set on macOS, as by default there is no gdbm available for python3 on macOS (not exactly sure why) and it causes issue like this:

HASH: Out of overflow pages.  Increase page size
Traceback (most recent call last):
  File "/Users/tendres/PycharmProjects/dedupe/tests/test_shelve.py", line 25, in <module>
    shelf[k] += [(i, record, ids)]
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/shelve.py", line 125, in __setitem__
    self.dict[key.encode(self.keyencoding)] = f.getvalue()
_dbm.error: cannot add item to database

Process finished with exit code 1

also mentioned here: https://github.com/dedupeio/csvdedupe/issues/67

And it would be nice if we could have an option on the matching API to decide whether using shelve(or dbm), I suppose.