question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

slow memory retrieval (significantly slower then simple pickle)

See original GitHub issue

Hi,

I’m little confused by why does reading and writing to (file based) “memory” take so enormous amount of time compared to bare pickling/unpickling.

In my case, func() is a tiny memorized function that takes a short string argument and returns a (short) dict with (long) lists of ~complex objects. For some reason, function retrieval from cache takes significantly more time then just unpickleing file. Resulting file is approximately 70Mb.

I observe same thing for any other function.

%prun func(some_str)

        1   12.436   12.436   52.011   52.011 pickle.py:1014(load)
 41531482    7.665    0.000   11.931    0.000 pickle.py:226(read)
  1922386    5.547    0.000    7.339    0.000 pickle.py:1504(load_build)
 41531483    4.266    0.000    4.266    0.000 {method 'read' of '_io.BufferedReader' objects}
  6490284    3.753    0.000    6.666    0.000 pickle.py:1439(load_long_binput)
  2645763    2.666    0.000    4.764    0.000 pickle.py:1192(load_binunicode)
 30070039    2.403    0.000    2.403    0.000 {built-in method builtins.isinstance}
  4140172    1.870    0.000    3.225    0.000 pickle.py:1415(load_binget)
  1922386    1.369    0.000    2.049    0.000 pickle.py:1316(load_newobj)
  9196954    1.359    0.000    1.359    0.000 {built-in method _struct.unpack}
  1922386    1.114    0.000    8.724    0.000 numpy_pickle.py:319(load_build)
 10857316    0.962    0.000    0.962    0.000 {method 'pop' of 'list' objects}
 14536246    0.873    0.000    0.873    0.000 {method 'append' of 'list' objects}
  1922386    0.873    0.000    1.218    0.000 pickle.py:1472(load_setitem)
  1922393    0.816    0.000    0.816    0.000 {built-in method builtins.getattr}
   676815    0.765    0.000    1.384    0.000 pickle.py:1458(load_appends)
  1922387    0.730    0.000    0.832    0.000 pickle.py:1257(load_empty_dictionary)
        1    0.715    0.715   53.099   53.099 <string>:1(<module>)
  1245385    0.559    0.000    0.848    0.000 pickle.py:1451(load_append)
...

%prun len(pickle.load(open("..file..", 'rb')))
        1    4.587    4.587    4.587    4.587 {built-in method _pickle.load}
        1    0.553    0.553    5.140    5.140 <string>:1(<module>)
        1    0.000    0.000    5.140    5.140 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {built-in method io.open}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Issue Analytics

  • State:open
  • Created 7 years ago
  • Comments:36 (29 by maintainers)

github_iconTop GitHub Comments

1reaction
ghostcommented, Jan 9, 2018

I think I got the following patch to memory.py to work:

index 14d7552..536826e 100644
--- a/../joblib/joblib/memory.py
+++ b/joblib/memory.py
@@ -34,7 +34,7 @@ from .func_inspect import format_call
 from .func_inspect import format_signature
 from ._memory_helpers import open_py_source
 from .logger import Logger, format_time, pformat
-from . import numpy_pickle
+import pickle
 from .disk import mkdirp, rm_subdirs, memstr_to_bytes
 from ._compat import _basestring, PY3_OR_LATER
 from .backports import concurrency_safe_rename
@@ -134,7 +134,7 @@ def _load_output(output_dir, func_name, timestamp=None, metadata=None,
         raise KeyError(
             "Non-existing cache value (may have been cleared).\n"
             "File %s does not exist" % filename)
-    result = numpy_pickle.load(filename, mmap_mode=mmap_mode)
+    result = pickle.load(open(filename, "rb"))
 
     return result
 
@@ -208,7 +208,7 @@ def concurrency_safe_write(to_write, filename, write_func):
     thread_id = id(threading.current_thread())
     temporary_filename = '{}.thread-{}-pid-{}'.format(
         filename, thread_id, os.getpid())
-    write_func(to_write, temporary_filename)
+    write_func(to_write, open(temporary_filename,"wb"))
     concurrency_safe_rename(temporary_filename, filename)
 
 
@@ -759,8 +759,7 @@ class MemorizedFunc(Logger):
         try:
             filename = os.path.join(dir, 'output.pkl')
             mkdirp(dir)
-            write_func = functools.partial(numpy_pickle.dump,
-                                           compress=self.compress)
+            write_func = pickle.dump
             concurrency_safe_write(output, filename, write_func)
             if self._verbose > 10:
                 print('Persisting in %s' % dir)

Of course its a huge hack that just bypasses everything. I wonder if it breaks anything.

1reaction
lestevecommented, Jan 9, 2017

Actually thinking about it, maybe the cleanest thing to do is to add a use_joblib_pickling (for lack of a better name) argument to Memory, which should be True by default.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why pickle eat memory? - python - Stack Overflow
Why does Pickle consume so much more memory? The reason is that HDF is a binary data pipe, while Pickle is an object...
Read more >
Advanced Pandas: Optimize speed and memory - Medium
When retrieving a single value, using .at[] is faster than using .loc[] ... then converting to numpy arrays and lastly by using some...
Read more >
What are the differences between long-term, short-term ... - NCBI
In the recent literature there has been considerable confusion about the three types of memory: long-term, short-term, and working memory.
Read more >
Python mmap: Improved File I/O With Memory Mapping
In this tutorial, you'll learn how to use Python's mmap module to improve your code's performance when you're working with files. You'll get...
Read more >
Tutorial — zarr 2.13.3 documentation - Read the Docs
If you are already familiar with HDF5 then Zarr arrays provide similar ... can be significantly slower than retrieving data from a local...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found