Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Parallelize storage reading/writing

See original GitHub issue

I’m planning to make some changes in the Reporter to split the two monolithic netcdf files into more manageable chunks. This is what I’m thinking now:

Solute trajectory: One xtc file per replica.
Checkpoint trajectory: One xtc file per replica.
Thermodynamic states: YAML (for thermodynamic parameters) and XML (for standard System) files.
MCMC moves: One YAML file per move.
metadata: One or many YAML files.
Constructor options: One or many YAML files.
everything else (energies, mixing statistics, logZ, any_numeric_array_of_variable_dimension): a single netcdf file for all.

I think splitting over multiple small files (whose directory structure is hidden by the Reporter class) means reading operations will be faster. Moreover, we’ll be able to parallelize writing on disk, which is currently a big bottleneck for multi-replica methods.

Question: Should we keep the old reporter around and maybe change its name to NetCDFReporter to allow reading data generated with previous versions or do we anticipate installing a previous version of openmmtools/yank will suffice for our needs?

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:19 (18 by maintainers)

Top GitHub Comments

1reaction

andrrizzicommented, Oct 31, 2019

@jaimergp implemented the parallel XTC files and we have now merged them in a separate feature branch (parallel-writing). If I remember correctly, it’s not in master right now because we didn’t observe a substantial speedup over netcdf (see Jaime’s timings in #434), although we still need to test it on real calculations and there is much margin of improvement. For example, I think we essentially writing on disk twice now because MDTraj XTC file does not support append.

If you want to try the current state, let me know and I can update it with the new code from master.

Extracting trajectory information from storage file takes hours.

For this, the bulk of the calculation is in imaging the trajectory, I believe. Having parallel xtc files means we’ll have to penalize reading the trajectory along a state in favor of replica trajectories. The netcdf file allows instead blocking the file by frame, which means reading state or replica trajectories will be roughly equally expensive so this issue may turn out to be quite complicated performance-wise.

1reaction

jchoderacommented, Oct 30, 2019

I totally agree about the current pain of using the single NetCDF file, and am hoping we can split both the checkpoint files and solute-only files into separate XTC or DCD files, leaving only the smaller numpy arrays in the NetCDF file, without too much pain.

Longer term, we would love to switch to some sort of distributed database that can handle multiple calculations streaming to it at once, but we haven’t started to design this yet.

Top Results From Across the Web

How to Use Object Storage for Data Parallelization and ...

In this article we are going to talk about how you can make sure you're doing such experimentation and parallel processing efficiently and...

Performance Impact of Parallel Disk Access | Piotr Kołaczkowski

Random I/O and reading metadata benefits from parallelism on both types of drives: SSD and HDD · SSDs generally benefit from parallelism much ......

Reliable, Parallel Storage Architecture: RAID & Beyond

Motivating need for more storage parallelism. Striping for parallel transfer, ... internal controller and buffer for caching, read-ahead, write-behind.

c# - Reading and Writing as Parallel Tasks

Reading and Writing as Parallel Tasks ... Looking for a best approach to reading from data source such as Azure Table Storage which...

Why are read / write sequential and not parallel?

This case is using multiple disks. The program is single-buffering. It needs to use double (or multiple) buffering.