question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to run MultiZarrToZarr efficiently?

See original GitHub issue

I have observed that I only get about 3-10% cpu use per python process while running MultiZarrToZarr.

Tasks:  10 total,   1 running,   9 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.3 us,  0.2 sy,  0.0 ni, 99.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   5943.5 total,   2256.4 free,   1163.2 used,   2524.0 buff/cache
MiB Swap:   1024.0 total,    785.8 free,    238.2 used.   4332.5 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
10393 dstuebe   20   0  671772 145984  36432 S   2.7   2.4   0:05.08 python3
  340 dstuebe   20   0 7275124 676712  18412 S   0.3  11.1   3:13.20 java
10850 dstuebe   20   0    7188   3448   2896 R   0.3   0.1   0:00.02 top
    1 dstuebe   20   0    1020      4      0 S   0.0   0.0   0:01.45 docker-init

Looking over the MultiZarrToZarr code and looking at the behavior in profiler I am a bit stumped on how to speed things up (Full logs)

2022-07-19T21:44:11.973Z P:MainProcess T:MainThread INFO:service.py:transform:All done aggregations!
         2150132 function calls (2148840 primitive calls) in 183.545 seconds

   Ordered by: internal time
   List reduced from 869 to 50 due to restriction <50>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2588  154.792    0.060  154.792    0.060 {method 'acquire' of '_thread.lock' objects}
        6    2.850    0.475    2.850    0.475 {method 'read' of '_io.BufferedReader' objects}
    18379    1.509    0.000    5.152    0.000 /Users/dstuebe/.cache/bazel/_bazel_dstuebe/76bbf57da584b86027104686797623fa/external/python3_10_x86_64-unknown-linux-gnu/lib/python3.10/logging/__init__.py:283(__init__)
   245265    1.342    0.000    2.643    0.000 {built-in method builtins.isinstance}

It seems like the for loops over the files in each pass of translate are intrinsically serial, updating a stateful object in order.

I am not complaining, I just want to make sure I am not missing an opportunity to run this more efficiently.

With #194 working better (at least the tmp files are gone, I still need to assess the memory usage) I should be able to run a large number of processes at once using dask - each one will just take a long time.

Is that the best way forward here for the foreseeable future?

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
martindurantcommented, Jul 20, 2022

If the CPU use is low, then the time is spent in IO. It might well be worthwhile to fs.cat all of the references files in fss() rather than create them one-by-one in the list comprehension.

1reaction
martindurantcommented, Jul 20, 2022

Yes, you are quite right: much of what MultiZarrToZarr does could be parallel, or even daskified. The order of the references shouldn’t matter, or can be sorted at the end.

What you can also do, though, is a “tree” combine, where you run MultiZarrToZarr on batches of inputs, and then again on the outputs of the batches. This we have successfully done before to save a lot of time.

Read more comments on GitHub >

github_iconTop Results From Across the Web

API Reference — kerchunk documentation - GitHub Pages
Can be efficient storage where there are few unique values. ... This method is the main entry point to execute the workflow, and...
Read more >
Efficient access of ensemble data on AWS - Pangeo Discourse
Is there a way to efficiently access all the ensemble members? Right now I am creating new json files each time I download...
Read more >
pangeo-data/cloud-performant-netcdf4 - Gitter
It seems you can't run the same instance of MZZ multiple times. ... We have been talking about possible efficient and/or lazy storage...
Read more >
Pangeo Forge at Ocean Sciences Meeting | by Rachel Wegener
Pangeo Forge is a new community-driven platform that aims to make it easy to extract data from traditional data repositories and deposit it...
Read more >
"OSError: [Errno 24] Too many open files" when using ... - Devscope.io
I had a quick glance inside MultiZarrToZarr.fss inside of kerchunk ... but we have quite a lot of old simulations lying around that...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found