Help to run in two nodes: Yank stops when log files are not found.
See original GitHub issueI am trying to run the tutorial examples/binding/t4-lysozyme
in two nodes with two gpus each. I am using yank 0.20.1 with the mpi4py coming from conda.
With the following hostfile and configfile:
node01
node01
node02
node02
-np 1 -env CUDA_VISIBLE_DEVICES 0 yank script --yaml=p-xylene-implicit.yaml
-np 1 -env CUDA_VISIBLE_DEVICES 1 yank script --yaml=p-xylene-implicit.yaml
-np 1 -env CUDA_VISIBLE_DEVICES 0 yank script --yaml=p-xylene-implicit.yaml
-np 1 -env CUDA_VISIBLE_DEVICES 1 yank script --yaml=p-xylene-implicit.yaml
And executing mpirun straightforward (without SLURM):
mpirun -f hostfile -configfile configfile
I get the following error:
2018-03-06 22:11:02,233: Setting CUDA platform to use precision model 'mixed'.
2018-03-06 22:11:02,421: Node 1/4: executing <function ExperimentBuilder._check_resume at 0x2aab07c962f0>
2018-03-06 22:11:02,422: Node 1/4: waiting for barrier after <function ExperimentBuilder._check_resume at 0x2aab07c962f0>
2018-03-06 22:11:02,466: Group 1/4 Node 1/1: execute _setup_molecules(p-xylene)
2018-03-06 22:11:03,449: Fixing net charge from -2.000000000015878e-06 to 4.163336342344337e-17
2018-03-06 22:11:03,462: Node 1/4: waiting for barrier after _setup_molecules
2018-03-06 22:11:03,466: Group 1/4 Node 1/1: execute get_system(t4-xylene)
2018-03-06 22:11:03,469: Setting up the systems for t4-lysozyme, p-xylene and GBSA
2018-03-06 22:11:03,469: Setting up solvent phase
2018-03-06 22:11:04,220: Setting up complex phase
2018-03-06 22:11:05,534: WARNING - yank.pipeline - TLeap: The unperturbed charge of the unit: 8.000000 is not zero.
2018-03-06 22:11:05,534: WARNING - yank.pipeline - TLeap: The unperturbed charge of the unit: 8.000000 is not zero.
2018-03-06 22:11:05,535: Node 1/4: waiting for barrier after get_system
2018-03-06 22:11:05,537: Node 1/4: executing <function ExperimentBuilder._safe_makedirs at 0x2aab07c968c8>
2018-03-06 22:11:05,540: Node 1/4: waiting for barrier after <function ExperimentBuilder._safe_makedirs at 0x2aab07c968c8>
2018-03-06 22:11:05,541: Node 1/4: executing <function ExperimentBuilder._generate_yaml at 0x2aab07c96730>
2018-03-06 22:11:05,567: Node 1/4: waiting for barrier after <function ExperimentBuilder._generate_yaml at 0x2aab07c96730>
2018-03-06 22:11:05,567: Node 1/4: waiting for barrier after _generate_experiment_protocol
2018-03-06 22:11:05,579: DSL string for the ligand: "resname MOL"
2018-03-06 22:11:05,579: DSL string for the solvent: "auto"
2018-03-06 22:11:05,582: Reading phase complex
2018-03-06 22:11:05,582: prmtop: p-xylene-implicit-output/setup/systems/t4-xylene/complex.prmtop
2018-03-06 22:11:05,582: inpcrd: p-xylene-implicit-output/setup/systems/t4-xylene/complex.inpcrd
Traceback (most recent call last):
File "/opt/apps/conda/intel-2018.1.163_miniconda/envs/yank/bin/yank", line 11, in <module>
load_entry_point('yank==0.20.1', 'console_scripts', 'yank')()
File "/opt/apps/conda/intel-2018.1.163_miniconda/envs/yank/lib/python3.6/site-packages/yank/cli.py", line 72, in main
dispatched = getattr(commands, command).dispatch(command_args)
File "/opt/apps/conda/intel-2018.1.163_miniconda/envs/yank/lib/python3.6/site-packages/yank/commands/script.py", line 114, in dispatch
yaml_builder.run_experiments()
File "/opt/apps/conda/intel-2018.1.163_miniconda/envs/yank/lib/python3.6/site-packages/yank/experiment.py", line 811, in run_experiments
completed[exp_index] = self._run_experiment(exp)
File "/opt/apps/conda/intel-2018.1.163_miniconda/envs/yank/lib/python3.6/site-packages/yank/experiment.py", line 2900, in _run_experiment
built_experiment = self._build_experiment(experiment_path, experiment)
File "/opt/apps/conda/intel-2018.1.163_miniconda/envs/yank/lib/python3.6/site-packages/yank/experiment.py", line 2669, in _build_experiment
utils.config_root_logger(self._options['verbose'], experiment_log_file_path)
File "/opt/apps/conda/intel-2018.1.163_miniconda/envs/yank/lib/python3.6/site-packages/yank/utils.py", line 161, in config_root_logger
file_handler = logging.FileHandler(log_file_path)
File "/opt/apps/conda/intel-2018.1.163_miniconda/envs/yank/lib/python3.6/logging/__init__.py", line 1030, in __init__
StreamHandler.__init__(self, self._open())
File "/opt/apps/conda/intel-2018.1.163_miniconda/envs/yank/lib/python3.6/logging/__init__.py", line 1059, in _open
return open(self.baseFilename, self.mode, encoding=self.encoding)
FileNotFoundError: [Errno 2] No such file or directory: '/NFS/home/diego/Projects/YANK/tutorials/examples/binding/t4-lysozyme/p-xylene-implicit-output/experiments/experiments_2.log'
2018-03-06 22:11:05,585: ERROR - yank.mpi - MPI node 3/4 raised exception.
NoneType: None
2018-03-06 22:11:05,585: CRITICAL - yank.mpi - MPI node 3/4 called Abort()!
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2
Traceback (most recent call last):
File "/opt/apps/conda/intel-2018.1.163_miniconda/envs/yank/bin/yank", line 11, in <module>
load_entry_point('yank==0.20.1', 'console_scripts', 'yank')()
File "/opt/apps/conda/intel-2018.1.163_miniconda/envs/yank/lib/python3.6/site-packages/yank/cli.py", line 72, in main
dispatched = getattr(commands, command).dispatch(command_args)
File "/opt/apps/conda/intel-2018.1.163_miniconda/envs/yank/lib/python3.6/site-packages/yank/commands/script.py", line 114, in dispatch
yaml_builder.run_experiments()
File "/opt/apps/conda/intel-2018.1.163_miniconda/envs/yank/lib/python3.6/site-packages/yank/experiment.py", line 811, in run_experiments
completed[exp_index] = self._run_experiment(exp)
File "/opt/apps/conda/intel-2018.1.163_miniconda/envs/yank/lib/python3.6/site-packages/yank/experiment.py", line 2900, in _run_experiment
built_experiment = self._build_experiment(experiment_path, experiment)
File "/opt/apps/conda/intel-2018.1.163_miniconda/envs/yank/lib/python3.6/site-packages/yank/experiment.py", line 2669, in _build_experiment
utils.config_root_logger(self._options['verbose'], experiment_log_file_path)
File "/opt/apps/conda/intel-2018.1.163_miniconda/envs/yank/lib/python3.6/site-packages/yank/utils.py", line 161, in config_root_logger
file_handler = logging.FileHandler(log_file_path)
File "/opt/apps/conda/intel-2018.1.163_miniconda/envs/yank/lib/python3.6/logging/__init__.py", line 1030, in __init__
StreamHandler.__init__(self, self._open())
File "/opt/apps/conda/intel-2018.1.163_miniconda/envs/yank/lib/python3.6/logging/__init__.py", line 1059, in _open
return open(self.baseFilename, self.mode, encoding=self.encoding)
FileNotFoundError: [Errno 2] No such file or directory: '/NFS/home/diego/Projects/YANK/tutorials/examples/binding/t4-lysozyme/p-xylene-implicit-output/experiments/experiments_3.log'
2018-03-06 22:11:05,585: ERROR - yank.mpi - MPI node 4/4 raised exception.
NoneType: None
2018-03-06 22:11:05,585: CRITICAL - yank.mpi - MPI node 4/4 called Abort()!
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
Both log files experiments_2.log and experiments_3.log were never created. In p-xylene-implicit-output/experiments/experiments only experiments_1.log and experiments.log are found. It is like experiments.log was created three times: experiments.log and experiments_1.log by the two processes in node01, and experiments.log twice again by the two processes in node02.
I thought maybe n_jobs and job_ids in the config file could solve it, but then yank stays for a long time after (until I kill it):
2018-03-06 22:38:07,517: Setting CUDA platform to use precision model 'mixed'.
2018-03-06 22:38:07,697: Setting CUDA platform to use precision model 'mixed'.
2018-03-06 22:38:07,806: Node 1/4: executing <function ExperimentBuilder._check_resume at 0x2aab07c932f0>
2018-03-06 22:38:07,809: Node 1/4: waiting for barrier after <function ExperimentBuilder._check_resume at 0x2aab07c932f0>
2018-03-06 22:38:07,863: Group 1/4 Node 1/1: execute _setup_molecules(p-xylene)
2018-03-06 22:38:08,621: Fixing net charge from -2.000000000015878e-06 to 4.163336342344337e-17
2018-03-06 22:38:08,633: Node 1/4: waiting for barrier after _setup_molecules
2018-03-06 22:38:08,640: Group 1/4 Node 1/1: execute get_system(t4-xylene)
2018-03-06 22:38:08,642: Setting up the systems for t4-lysozyme, p-xylene and GBSA
2018-03-06 22:38:08,642: Setting up solvent phase
2018-03-06 22:38:08,984: Setting up complex phase
2018-03-06 22:38:10,234: WARNING - yank.pipeline - TLeap: The unperturbed charge of the unit: 8.000000 is not zero.
2018-03-06 22:38:10,235: WARNING - yank.pipeline - TLeap: The unperturbed charge of the unit: 8.000000 is not zero.
2018-03-06 22:38:10,236: Node 1/4: waiting for barrier after get_system
2018-03-06 22:38:10,237: Node 1/4: executing <function ExperimentBuilder._safe_makedirs at 0x2aab07c938c8>
2018-03-06 22:38:10,241: Node 1/4: waiting for barrier after <function ExperimentBuilder._safe_makedirs at 0x2aab07c938c8>
What am I doing wrong?
A bit of help would be appreciated, pretty much. Thanks a lot.
Issue Analytics
- State:
- Created 6 years ago
- Comments:9 (9 by maintainers)
Top GitHub Comments
This is expected. The function
config_root_logger
“appends” the MPI process rank to the file name internally, but it expects the file name without index number to be passed as an argument.This is consistent with some sort of coordination problem between nodes rather than MPI processes. Running YANK on multiple nodes with MPI works for me on our cluster so this is likely a configuration problem rather than a bug. I’d try these:
mpi4py
package installed with conda comes from theconda-forge
channel.p-xylene-implicit-output/experiments/
exists before running YANK.If 2. works, it may be that something weird happens when
mpi.barrier()
is called to sync the processes after the output directory is created by the MPI process with rank 0.Glad it’s working! I’d still be careful and double check that everything is running as expected. If this is a problem with MPI configuration, it’s possible there will be more problems later.