Parallelize storage reading/writing
See original GitHub issueI’m planning to make some changes in the Reporter
to split the two monolithic netcdf
files into more manageable chunks. This is what I’m thinking now:
- Solute trajectory: One
xtc
file per replica. - Checkpoint trajectory: One
xtc
file per replica. - Thermodynamic states: YAML (for thermodynamic parameters) and XML (for standard
System
) files. - MCMC moves: One YAML file per move.
metadata
: One or many YAML files.- Constructor options: One or many YAML files.
- everything else (energies, mixing statistics, logZ,
any_numeric_array_of_variable_dimension
): a singlenetcdf
file for all.
I think splitting over multiple small files (whose directory structure is hidden by the Reporter
class) means reading operations will be faster. Moreover, we’ll be able to parallelize writing on disk, which is currently a big bottleneck for multi-replica methods.
Question: Should we keep the old reporter around and maybe change its name to NetCDFReporter
to allow reading data generated with previous versions or do we anticipate installing a previous version of openmmtools
/yank
will suffice for our needs?
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:19 (18 by maintainers)
Top Results From Across the Web
How to Use Object Storage for Data Parallelization and ...
In this article we are going to talk about how you can make sure you're doing such experimentation and parallel processing efficiently and...
Read more >Performance Impact of Parallel Disk Access | Piotr Kołaczkowski
Random I/O and reading metadata benefits from parallelism on both types of drives: SSD and HDD · SSDs generally benefit from parallelism much ......
Read more >Reliable, Parallel Storage Architecture: RAID & Beyond
Motivating need for more storage parallelism. Striping for parallel transfer, ... internal controller and buffer for caching, read-ahead, write-behind.
Read more >c# - Reading and Writing as Parallel Tasks
Reading and Writing as Parallel Tasks ... Looking for a best approach to reading from data source such as Azure Table Storage which...
Read more >Why are read / write sequential and not parallel?
This case is using multiple disks. The program is single-buffering. It needs to use double (or multiple) buffering.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@jaimergp implemented the parallel XTC files and we have now merged them in a separate feature branch (
parallel-writing
). If I remember correctly, it’s not in master right now because we didn’t observe a substantial speedup over netcdf (see Jaime’s timings in #434), although we still need to test it on real calculations and there is much margin of improvement. For example, I think we essentially writing on disk twice now because MDTraj XTC file does not support append.If you want to try the current state, let me know and I can update it with the new code from
master
.For this, the bulk of the calculation is in imaging the trajectory, I believe. Having parallel xtc files means we’ll have to penalize reading the trajectory along a state in favor of replica trajectories. The netcdf file allows instead blocking the file by frame, which means reading state or replica trajectories will be roughly equally expensive so this issue may turn out to be quite complicated performance-wise.
I totally agree about the current pain of using the single NetCDF file, and am hoping we can split both the checkpoint files and solute-only files into separate XTC or DCD files, leaving only the smaller numpy arrays in the NetCDF file, without too much pain.
Longer term, we would love to switch to some sort of distributed database that can handle multiple calculations streaming to it at once, but we haven’t started to design this yet.