How to run MultiZarrToZarr efficiently?
See original GitHub issueI have observed that I only get about 3-10% cpu use per python process while running MultiZarrToZarr.
Tasks: 10 total, 1 running, 9 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.3 us, 0.2 sy, 0.0 ni, 99.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 5943.5 total, 2256.4 free, 1163.2 used, 2524.0 buff/cache
MiB Swap: 1024.0 total, 785.8 free, 238.2 used. 4332.5 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10393 dstuebe 20 0 671772 145984 36432 S 2.7 2.4 0:05.08 python3
340 dstuebe 20 0 7275124 676712 18412 S 0.3 11.1 3:13.20 java
10850 dstuebe 20 0 7188 3448 2896 R 0.3 0.1 0:00.02 top
1 dstuebe 20 0 1020 4 0 S 0.0 0.0 0:01.45 docker-init
Looking over the MultiZarrToZarr code and looking at the behavior in profiler I am a bit stumped on how to speed things up (Full logs)
2022-07-19T21:44:11.973Z P:MainProcess T:MainThread INFO:service.py:transform:All done aggregations!
2150132 function calls (2148840 primitive calls) in 183.545 seconds
Ordered by: internal time
List reduced from 869 to 50 due to restriction <50>
ncalls tottime percall cumtime percall filename:lineno(function)
2588 154.792 0.060 154.792 0.060 {method 'acquire' of '_thread.lock' objects}
6 2.850 0.475 2.850 0.475 {method 'read' of '_io.BufferedReader' objects}
18379 1.509 0.000 5.152 0.000 /Users/dstuebe/.cache/bazel/_bazel_dstuebe/76bbf57da584b86027104686797623fa/external/python3_10_x86_64-unknown-linux-gnu/lib/python3.10/logging/__init__.py:283(__init__)
245265 1.342 0.000 2.643 0.000 {built-in method builtins.isinstance}
It seems like the for loops over the files in each pass of translate are intrinsically serial, updating a stateful object in order.
I am not complaining, I just want to make sure I am not missing an opportunity to run this more efficiently.
With #194 working better (at least the tmp files are gone, I still need to assess the memory usage) I should be able to run a large number of processes at once using dask - each one will just take a long time.
Is that the best way forward here for the foreseeable future?
Issue Analytics
- State:
- Created a year ago
- Comments:7 (4 by maintainers)
Top Results From Across the Web
API Reference — kerchunk documentation - GitHub Pages
Can be efficient storage where there are few unique values. ... This method is the main entry point to execute the workflow, and...
Read more >Efficient access of ensemble data on AWS - Pangeo Discourse
Is there a way to efficiently access all the ensemble members? Right now I am creating new json files each time I download...
Read more >pangeo-data/cloud-performant-netcdf4 - Gitter
It seems you can't run the same instance of MZZ multiple times. ... We have been talking about possible efficient and/or lazy storage...
Read more >Pangeo Forge at Ocean Sciences Meeting | by Rachel Wegener
Pangeo Forge is a new community-driven platform that aims to make it easy to extract data from traditional data repositories and deposit it...
Read more >"OSError: [Errno 24] Too many open files" when using ... - Devscope.io
I had a quick glance inside MultiZarrToZarr.fss inside of kerchunk ... but we have quite a lot of old simulations lying around that...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
If the CPU use is low, then the time is spent in IO. It might well be worthwhile to
fs.cat
all of the references files infss()
rather than create them one-by-one in the list comprehension.Yes, you are quite right: much of what MultiZarrToZarr does could be parallel, or even daskified. The order of the references shouldn’t matter, or can be sorted at the end.
What you can also do, though, is a “tree” combine, where you run MultiZarrToZarr on batches of inputs, and then again on the outputs of the batches. This we have successfully done before to save a lot of time.