Possibility of including a feature of dealing with dhdl files from extended simulations
See original GitHub issueDear alchemlyb
developers,
First I want to thank you all for your hard work in developing this pretty user-friendly package. Today I was using alchemlyb
to analyze the dhdl files of a replica-exchange simulation. Since I was running long simulations, I extended the simulation of each replica for several times. However, I found that this might cause two problems when parsing the GROMACS dhdl files.
Specifically, when parsing one of the files, I got the error shown as below. This error happened because the last line of the file to be parsed was incomplete as the simulation was ended by timeout. As a result, the end of the last line was -1.5258789e-
instead of -1.5258789e-5
, leading to ValueError when converting the last string of the line into a float when dtype
was specified as np.float64
. (See Line 265 in _extract_dataframe
.)
TypeError: Cannot cast array from dtype('O') to dtype('float64') according to the rule 'safe'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/wehs7661/anaconda3/lib/python3.7/site-packages/alchemlyb/parsing/gmx.py", line 133, in extract_dHdl
df = _extract_dataframe(xvg, headers)
File "/home/wehs7661/anaconda3/lib/python3.7/site-packages/alchemlyb/parsing/gmx.py", line 267, in _extract_dataframe
float_precision='high')
File "/home/wehs7661/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 702, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/wehs7661/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 435, in _read
data = parser.read(nrows)
File "/home/wehs7661/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1139, in read
ret = self._engine.read(nrows)
File "/home/wehs7661/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1995, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 899, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 914, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 991, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 1123, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas/_libs/parsers.pyx", line 1197, in pandas._libs.parsers.TextReader._convert_tokens
ValueError: could not convert string to float: '-1.5258789e-'
In addition, it seems that currently, the GROMACS parser is not able to deal with the overlapped time frames when the simulation is extended. Specifically, say that the simulation of the first replica was ended by timeout and the last time frame in system_dhdl.xvg
was 1592 ps, but the last time frame of the corresponding .cpt
file was only updated to 1562 ps since the .cpt
file updates only every 15 minutes. As a result, if we use run gmx mdrun
with the -cpi
option to extend the simulation, the dhdl file of the extended simulation, system_dhdl.part0002.xvg
will start from 1562 ns rather than 1592 ns. In this situation, when we use dHdl_coul = pd.concat([extract_dHdl(xvg, T=300) for xvg in files['Coulomb']])
or u_nk_coul = pd.concat([extract_u_nk(xvg, T=300) for xvg in files['Coulomb']])
, it seems that extract_dHdl
or extract_u_nk
are not able to discard the part of data corresponding to the overlapped time frames (from 1562 ps to 1592 ps) in system_dhdl.xvg
and adopt the data of these time frames in system_dhdl.part0002.xvg
.
While apparently, with another Python script, both problems above can be externally solved by modifying the dhdl files such that the incomplete lines and the duplicated time frames are discarded, I’m wondering if it is worthy to address these issues internally in alchemlyb
instead. After all, this situation happens a lot when users extend their simulations.
Thanks a lot in advance!
Issue Analytics
- State:
- Created 3 years ago
- Comments:8 (8 by maintainers)
We can add this as a preprocessor, yes. I quite like this philosophy of making these things easy for our data structures, which double as reference implementations for some
pandas
-fu.(I tagged it “invalid” because we don’t have “wont fix” as a tag – it does not mean that it wasn’t a valid question.)