Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to run MultiZarrToZarr efficiently?

See original GitHub issue

I have observed that I only get about 3-10% cpu use per python process while running MultiZarrToZarr.

Tasks:  10 total,   1 running,   9 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.3 us,  0.2 sy,  0.0 ni, 99.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   5943.5 total,   2256.4 free,   1163.2 used,   2524.0 buff/cache
MiB Swap:   1024.0 total,    785.8 free,    238.2 used.   4332.5 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
10393 dstuebe   20   0  671772 145984  36432 S   2.7   2.4   0:05.08 python3
  340 dstuebe   20   0 7275124 676712  18412 S   0.3  11.1   3:13.20 java
10850 dstuebe   20   0    7188   3448   2896 R   0.3   0.1   0:00.02 top
    1 dstuebe   20   0    1020      4      0 S   0.0   0.0   0:01.45 docker-init

Looking over the MultiZarrToZarr code and looking at the behavior in profiler I am a bit stumped on how to speed things up (Full logs)

2022-07-19T21:44:11.973Z P:MainProcess T:MainThread INFO:service.py:transform:All done aggregations!
         2150132 function calls (2148840 primitive calls) in 183.545 seconds

   Ordered by: internal time
   List reduced from 869 to 50 due to restriction <50>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2588  154.792    0.060  154.792    0.060 {method 'acquire' of '_thread.lock' objects}
        6    2.850    0.475    2.850    0.475 {method 'read' of '_io.BufferedReader' objects}
    18379    1.509    0.000    5.152    0.000 /Users/dstuebe/.cache/bazel/_bazel_dstuebe/76bbf57da584b86027104686797623fa/external/python3_10_x86_64-unknown-linux-gnu/lib/python3.10/logging/__init__.py:283(__init__)
   245265    1.342    0.000    2.643    0.000 {built-in method builtins.isinstance}

It seems like the for loops over the files in each pass of translate are intrinsically serial, updating a stateful object in order.

I am not complaining, I just want to make sure I am not missing an opportunity to run this more efficiently.

With #194 working better (at least the tmp files are gone, I still need to assess the memory usage) I should be able to run a large number of processes at once using dask - each one will just take a long time.

Is that the best way forward here for the foreseeable future?

Issue Analytics

State:
Created a year ago
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

martindurantcommented, Jul 20, 2022

If the CPU use is low, then the time is spent in IO. It might well be worthwhile to fs.cat all of the references files in fss() rather than create them one-by-one in the list comprehension.

1reaction

martindurantcommented, Jul 20, 2022

Yes, you are quite right: much of what MultiZarrToZarr does could be parallel, or even daskified. The order of the references shouldn’t matter, or can be sorted at the end.

What you can also do, though, is a “tree” combine, where you run MultiZarrToZarr on batches of inputs, and then again on the outputs of the batches. This we have successfully done before to save a lot of time.