Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

slow memory retrieval (significantly slower then simple pickle)

See original GitHub issue

Hi,

I’m little confused by why does reading and writing to (file based) “memory” take so enormous amount of time compared to bare pickling/unpickling.

In my case, func() is a tiny memorized function that takes a short string argument and returns a (short) dict with (long) lists of ~complex objects. For some reason, function retrieval from cache takes significantly more time then just unpickleing file. Resulting file is approximately 70Mb.

I observe same thing for any other function.

%prun func(some_str)

        1   12.436   12.436   52.011   52.011 pickle.py:1014(load)
 41531482    7.665    0.000   11.931    0.000 pickle.py:226(read)
  1922386    5.547    0.000    7.339    0.000 pickle.py:1504(load_build)
 41531483    4.266    0.000    4.266    0.000 {method 'read' of '_io.BufferedReader' objects}
  6490284    3.753    0.000    6.666    0.000 pickle.py:1439(load_long_binput)
  2645763    2.666    0.000    4.764    0.000 pickle.py:1192(load_binunicode)
 30070039    2.403    0.000    2.403    0.000 {built-in method builtins.isinstance}
  4140172    1.870    0.000    3.225    0.000 pickle.py:1415(load_binget)
  1922386    1.369    0.000    2.049    0.000 pickle.py:1316(load_newobj)
  9196954    1.359    0.000    1.359    0.000 {built-in method _struct.unpack}
  1922386    1.114    0.000    8.724    0.000 numpy_pickle.py:319(load_build)
 10857316    0.962    0.000    0.962    0.000 {method 'pop' of 'list' objects}
 14536246    0.873    0.000    0.873    0.000 {method 'append' of 'list' objects}
  1922386    0.873    0.000    1.218    0.000 pickle.py:1472(load_setitem)
  1922393    0.816    0.000    0.816    0.000 {built-in method builtins.getattr}
   676815    0.765    0.000    1.384    0.000 pickle.py:1458(load_appends)
  1922387    0.730    0.000    0.832    0.000 pickle.py:1257(load_empty_dictionary)
        1    0.715    0.715   53.099   53.099 <string>:1(<module>)
  1245385    0.559    0.000    0.848    0.000 pickle.py:1451(load_append)
...

%prun len(pickle.load(open("..file..", 'rb')))
        1    4.587    4.587    4.587    4.587 {built-in method _pickle.load}
        1    0.553    0.553    5.140    5.140 <string>:1(<module>)
        1    0.000    0.000    5.140    5.140 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {built-in method io.open}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Issue Analytics

State:
Created 7 years ago
Comments:36 (29 by maintainers)

Top GitHub Comments

1reaction

ghostcommented, Jan 9, 2018

I think I got the following patch to memory.py to work:

index 14d7552..536826e 100644
--- a/../joblib/joblib/memory.py
+++ b/joblib/memory.py
@@ -34,7 +34,7 @@ from .func_inspect import format_call
 from .func_inspect import format_signature
 from ._memory_helpers import open_py_source
 from .logger import Logger, format_time, pformat
-from . import numpy_pickle
+import pickle
 from .disk import mkdirp, rm_subdirs, memstr_to_bytes
 from ._compat import _basestring, PY3_OR_LATER
 from .backports import concurrency_safe_rename
@@ -134,7 +134,7 @@ def _load_output(output_dir, func_name, timestamp=None, metadata=None,
         raise KeyError(
             "Non-existing cache value (may have been cleared).\n"
             "File %s does not exist" % filename)
-    result = numpy_pickle.load(filename, mmap_mode=mmap_mode)
+    result = pickle.load(open(filename, "rb"))
 
     return result
 
@@ -208,7 +208,7 @@ def concurrency_safe_write(to_write, filename, write_func):
     thread_id = id(threading.current_thread())
     temporary_filename = '{}.thread-{}-pid-{}'.format(
         filename, thread_id, os.getpid())
-    write_func(to_write, temporary_filename)
+    write_func(to_write, open(temporary_filename,"wb"))
     concurrency_safe_rename(temporary_filename, filename)
 
 
@@ -759,8 +759,7 @@ class MemorizedFunc(Logger):
         try:
             filename = os.path.join(dir, 'output.pkl')
             mkdirp(dir)
-            write_func = functools.partial(numpy_pickle.dump,
-                                           compress=self.compress)
+            write_func = pickle.dump
             concurrency_safe_write(output, filename, write_func)
             if self._verbose > 10:
                 print('Persisting in %s' % dir)

Of course its a huge hack that just bypasses everything. I wonder if it breaks anything.

1reaction

lestevecommented, Jan 9, 2017

Actually thinking about it, maybe the cleanest thing to do is to add a use_joblib_pickling (for lack of a better name) argument to Memory, which should be True by default.