Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Possibility of including a feature of dealing with dhdl files from extended simulations

See original GitHub issue

Dear alchemlyb developers, First I want to thank you all for your hard work in developing this pretty user-friendly package. Today I was using alchemlyb to analyze the dhdl files of a replica-exchange simulation. Since I was running long simulations, I extended the simulation of each replica for several times. However, I found that this might cause two problems when parsing the GROMACS dhdl files.

Specifically, when parsing one of the files, I got the error shown as below. This error happened because the last line of the file to be parsed was incomplete as the simulation was ended by timeout. As a result, the end of the last line was -1.5258789e- instead of -1.5258789e-5, leading to ValueError when converting the last string of the line into a float when dtype was specified as np.float64. (See Line 265 in _extract_dataframe.)

TypeError: Cannot cast array from dtype('O') to dtype('float64') according to the rule 'safe'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/wehs7661/anaconda3/lib/python3.7/site-packages/alchemlyb/parsing/gmx.py", line 133, in extract_dHdl
    df = _extract_dataframe(xvg, headers)
  File "/home/wehs7661/anaconda3/lib/python3.7/site-packages/alchemlyb/parsing/gmx.py", line 267, in _extract_dataframe
    float_precision='high')
  File "/home/wehs7661/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 702, in parser_f     
    return _read(filepath_or_buffer, kwds)
  File "/home/wehs7661/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 435, in _read        
    data = parser.read(nrows)
  File "/home/wehs7661/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1139, in read
    ret = self._engine.read(nrows)
  File "/home/wehs7661/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1995, in read        
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 899, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 914, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 991, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 1123, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas/_libs/parsers.pyx", line 1197, in pandas._libs.parsers.TextReader._convert_tokens
ValueError: could not convert string to float: '-1.5258789e-'

In addition, it seems that currently, the GROMACS parser is not able to deal with the overlapped time frames when the simulation is extended. Specifically, say that the simulation of the first replica was ended by timeout and the last time frame in system_dhdl.xvg was 1592 ps, but the last time frame of the corresponding .cpt file was only updated to 1562 ps since the .cpt file updates only every 15 minutes. As a result, if we use run gmx mdrun with the -cpi option to extend the simulation, the dhdl file of the extended simulation, system_dhdl.part0002.xvg will start from 1562 ns rather than 1592 ns. In this situation, when we use dHdl_coul = pd.concat([extract_dHdl(xvg, T=300) for xvg in files['Coulomb']]) or u_nk_coul = pd.concat([extract_u_nk(xvg, T=300) for xvg in files['Coulomb']]), it seems that extract_dHdl or extract_u_nk are not able to discard the part of data corresponding to the overlapped time frames (from 1562 ps to 1592 ps) in system_dhdl.xvg and adopt the data of these time frames in system_dhdl.part0002.xvg.

While apparently, with another Python script, both problems above can be externally solved by modifying the dhdl files such that the incomplete lines and the duplicated time frames are discarded, I’m wondering if it is worthy to address these issues internally in alchemlyb instead. After all, this situation happens a lot when users extend their simulations.

Thanks a lot in advance!

Issue Analytics

State:
Created 3 years ago
Comments:8 (8 by maintainers)

Top GitHub Comments

1reaction

dotsdlcommented, May 19, 2020

We can add this as a preprocessor, yes. I quite like this philosophy of making these things easy for our data structures, which double as reference implementations for some pandas-fu.

0reactions

orbeckstcommented, May 19, 2020

(I tagged it “invalid” because we don’t have “wont fix” as a tag – it does not mean that it wasn’t a valid question.)