question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

The dbm solution seems making the blocking process extremely slow

See original GitHub issue

For me I have a 30K records to match agains to, and if I use the default dbm way it takes more than 10 minutes to match on, for example a 2 entries records. During the matching you could see output like this:

[2017-06-29 20:43:56,841: INFO/PoolWorker-1] 10000, 182.5327482 seconds
[2017-06-29 20:53:40,909: INFO/PoolWorker-1] 20000, 758.9884932 seconds

which I believe is the output from https://github.com/dedupeio/dedupe/blob/master/dedupe/blocking.py#L42

As so far we have enough memory, I had to change the code here to let the blocking happen in a dictionary in memory : https://github.com/dedupeio/dedupe/blob/master/dedupe/api.py#L1072

Basically, instead of returning shelf, return an empty python dictionary:

def _temp_shelve():
    fd, file_path = tempfile.mkstemp()
    os.close(fd)

    try:
        shelf = shelve.open(file_path, 'n',
                                      protocol=pickle.HIGHEST_PROTOCOL)
    except Exception as e:
        if 'db type could not be determined' in str(e):
            os.remove(file_path)
            shelf = shelve.open(file_path, 'n',
                                protocol=pickle.HIGHEST_PROTOCOL)
        else:
            raise

    return {}, file_path # return python dictionary instead of shelf

This will make the blocking and matching process takes lots of memory but it can finish a 2 entries matching against 30K records in a few seconds.

Does this looks normal?


Also the dbm thing is not working for large data set on macOS, as by default there is no gdbm available for python3 on macOS (not exactly sure why) and it causes issue like this:

HASH: Out of overflow pages.  Increase page size
Traceback (most recent call last):
  File "/Users/tendres/PycharmProjects/dedupe/tests/test_shelve.py", line 25, in <module>
    shelf[k] += [(i, record, ids)]
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/shelve.py", line 125, in __setitem__
    self.dict[key.encode(self.keyencoding)] = f.getvalue()
_dbm.error: cannot add item to database

Process finished with exit code 1

also mentioned here: https://github.com/dedupeio/csvdedupe/issues/67


And it would be nice if we could have an option on the matching API to decide whether using shelve(or dbm), I suppose.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:16 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
jimclousecommented, Dec 22, 2017

@betocollin worked great! Thanks for the tip.

@fgregg looks like a bug in 1.7 on Mac OSX

1reaction
fgreggcommented, Jul 11, 2017

If you do end up testing 1.7.0, have the larger dataset be the second one.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Reasons Your Query Is Slow Right Now: Blocking, Blocking ...
It's not so helpful for the query doing the blocking, but it'll tell you what the queries being blocked are stuck on. The...
Read more >
Wi-Fi Dropping Issues: 2022's Real Tips - Dong Knows Tech
The real issue here is there's no concrete solution because it's nuanced. However, changing the dBm trigger value will make a difference.
Read more >
The Very Best Method to Increase Your 4G LTE Data Speeds
The #1 solution to permanently improve your 4G LTE speeds in your house, vehicle, or office is a cell phone signal booster.
Read more >
AF with wide QRS - LSU School of Medicine
In 2 nd degree AV block, Mobitz type 2, you see dropped P wave without preceding PR prolongation (a P drops all of...
Read more >
Database hang and Row Cache Lock concurrency ... - DaDBm
Probably the most likely cause is the allocation of new extents. If extent sizes are set low then the application may constantly be...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found