question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Parallelize storage reading/writing

See original GitHub issue

I’m planning to make some changes in the Reporter to split the two monolithic netcdf files into more manageable chunks. This is what I’m thinking now:

  • Solute trajectory: One xtc file per replica.
  • Checkpoint trajectory: One xtc file per replica.
  • Thermodynamic states: YAML (for thermodynamic parameters) and XML (for standard System) files.
  • MCMC moves: One YAML file per move.
  • metadata: One or many YAML files.
  • Constructor options: One or many YAML files.
  • everything else (energies, mixing statistics, logZ, any_numeric_array_of_variable_dimension): a single netcdf file for all.

I think splitting over multiple small files (whose directory structure is hidden by the Reporter class) means reading operations will be faster. Moreover, we’ll be able to parallelize writing on disk, which is currently a big bottleneck for multi-replica methods.

Question: Should we keep the old reporter around and maybe change its name to NetCDFReporter to allow reading data generated with previous versions or do we anticipate installing a previous version of openmmtools/yank will suffice for our needs?

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:1
  • Comments:19 (18 by maintainers)

github_iconTop GitHub Comments

1reaction
andrrizzicommented, Oct 31, 2019

@jaimergp implemented the parallel XTC files and we have now merged them in a separate feature branch (parallel-writing). If I remember correctly, it’s not in master right now because we didn’t observe a substantial speedup over netcdf (see Jaime’s timings in #434), although we still need to test it on real calculations and there is much margin of improvement. For example, I think we essentially writing on disk twice now because MDTraj XTC file does not support append.

If you want to try the current state, let me know and I can update it with the new code from master.

Extracting trajectory information from storage file takes hours.

For this, the bulk of the calculation is in imaging the trajectory, I believe. Having parallel xtc files means we’ll have to penalize reading the trajectory along a state in favor of replica trajectories. The netcdf file allows instead blocking the file by frame, which means reading state or replica trajectories will be roughly equally expensive so this issue may turn out to be quite complicated performance-wise.

1reaction
jchoderacommented, Oct 30, 2019

I totally agree about the current pain of using the single NetCDF file, and am hoping we can split both the checkpoint files and solute-only files into separate XTC or DCD files, leaving only the smaller numpy arrays in the NetCDF file, without too much pain.

Longer term, we would love to switch to some sort of distributed database that can handle multiple calculations streaming to it at once, but we haven’t started to design this yet.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Use Object Storage for Data Parallelization and ...
In this article we are going to talk about how you can make sure you're doing such experimentation and parallel processing efficiently and...
Read more >
Performance Impact of Parallel Disk Access | Piotr Kołaczkowski
Random I/O and reading metadata benefits from parallelism on both types of drives: SSD and HDD · SSDs generally benefit from parallelism much ......
Read more >
Reliable, Parallel Storage Architecture: RAID & Beyond
Motivating need for more storage parallelism. Striping for parallel transfer, ... internal controller and buffer for caching, read-ahead, write-behind.
Read more >
c# - Reading and Writing as Parallel Tasks
Reading and Writing as Parallel Tasks ... Looking for a best approach to reading from data source such as Azure Table Storage which...
Read more >
Why are read / write sequential and not parallel?
This case is using multiple disks. The program is single-buffering. It needs to use double (or multiple) buffering.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found