question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Help to run in two nodes: Yank stops when log files are not found.

See original GitHub issue

I am trying to run the tutorial examples/binding/t4-lysozyme in two nodes with two gpus each. I am using yank 0.20.1 with the mpi4py coming from conda.

With the following hostfile and configfile:

node01
node01
node02
node02
-np 1 -env CUDA_VISIBLE_DEVICES 0 yank script --yaml=p-xylene-implicit.yaml 
-np 1 -env CUDA_VISIBLE_DEVICES 1 yank script --yaml=p-xylene-implicit.yaml 
-np 1 -env CUDA_VISIBLE_DEVICES 0 yank script --yaml=p-xylene-implicit.yaml 
-np 1 -env CUDA_VISIBLE_DEVICES 1 yank script --yaml=p-xylene-implicit.yaml 

And executing mpirun straightforward (without SLURM):

mpirun -f hostfile -configfile configfile

I get the following error:

2018-03-06 22:11:02,233: Setting CUDA platform to use precision model 'mixed'.
2018-03-06 22:11:02,421: Node 1/4: executing <function ExperimentBuilder._check_resume at 0x2aab07c962f0>
2018-03-06 22:11:02,422: Node 1/4: waiting for barrier after <function ExperimentBuilder._check_resume at 0x2aab07c962f0>
2018-03-06 22:11:02,466: Group 1/4 Node 1/1: execute _setup_molecules(p-xylene)
2018-03-06 22:11:03,449: Fixing net charge from -2.000000000015878e-06 to 4.163336342344337e-17
2018-03-06 22:11:03,462: Node 1/4: waiting for barrier after _setup_molecules
2018-03-06 22:11:03,466: Group 1/4 Node 1/1: execute get_system(t4-xylene)
2018-03-06 22:11:03,469: Setting up the systems for t4-lysozyme, p-xylene and GBSA
2018-03-06 22:11:03,469: Setting up solvent phase
2018-03-06 22:11:04,220: Setting up complex phase
2018-03-06 22:11:05,534: WARNING - yank.pipeline - TLeap: The unperturbed charge of the unit: 8.000000 is not zero.
2018-03-06 22:11:05,534: WARNING - yank.pipeline - TLeap: The unperturbed charge of the unit: 8.000000 is not zero.
2018-03-06 22:11:05,535: Node 1/4: waiting for barrier after get_system
2018-03-06 22:11:05,537: Node 1/4: executing <function ExperimentBuilder._safe_makedirs at 0x2aab07c968c8>
2018-03-06 22:11:05,540: Node 1/4: waiting for barrier after <function ExperimentBuilder._safe_makedirs at 0x2aab07c968c8>
2018-03-06 22:11:05,541: Node 1/4: executing <function ExperimentBuilder._generate_yaml at 0x2aab07c96730>
2018-03-06 22:11:05,567: Node 1/4: waiting for barrier after <function ExperimentBuilder._generate_yaml at 0x2aab07c96730>
2018-03-06 22:11:05,567: Node 1/4: waiting for barrier after _generate_experiment_protocol
2018-03-06 22:11:05,579: DSL string for the ligand: "resname MOL"
2018-03-06 22:11:05,579: DSL string for the solvent: "auto"
2018-03-06 22:11:05,582: Reading phase complex
2018-03-06 22:11:05,582: prmtop: p-xylene-implicit-output/setup/systems/t4-xylene/complex.prmtop
2018-03-06 22:11:05,582: inpcrd: p-xylene-implicit-output/setup/systems/t4-xylene/complex.inpcrd
Traceback (most recent call last):
  File "/opt/apps/conda/intel-2018.1.163_miniconda/envs/yank/bin/yank", line 11, in <module>
    load_entry_point('yank==0.20.1', 'console_scripts', 'yank')()
  File "/opt/apps/conda/intel-2018.1.163_miniconda/envs/yank/lib/python3.6/site-packages/yank/cli.py", line 72, in main
    dispatched = getattr(commands, command).dispatch(command_args)
  File "/opt/apps/conda/intel-2018.1.163_miniconda/envs/yank/lib/python3.6/site-packages/yank/commands/script.py", line 114, in dispatch
    yaml_builder.run_experiments()
  File "/opt/apps/conda/intel-2018.1.163_miniconda/envs/yank/lib/python3.6/site-packages/yank/experiment.py", line 811, in run_experiments
    completed[exp_index] = self._run_experiment(exp)
  File "/opt/apps/conda/intel-2018.1.163_miniconda/envs/yank/lib/python3.6/site-packages/yank/experiment.py", line 2900, in _run_experiment
    built_experiment = self._build_experiment(experiment_path, experiment)
  File "/opt/apps/conda/intel-2018.1.163_miniconda/envs/yank/lib/python3.6/site-packages/yank/experiment.py", line 2669, in _build_experiment
    utils.config_root_logger(self._options['verbose'], experiment_log_file_path)
  File "/opt/apps/conda/intel-2018.1.163_miniconda/envs/yank/lib/python3.6/site-packages/yank/utils.py", line 161, in config_root_logger
    file_handler = logging.FileHandler(log_file_path)
  File "/opt/apps/conda/intel-2018.1.163_miniconda/envs/yank/lib/python3.6/logging/__init__.py", line 1030, in __init__
    StreamHandler.__init__(self, self._open())
  File "/opt/apps/conda/intel-2018.1.163_miniconda/envs/yank/lib/python3.6/logging/__init__.py", line 1059, in _open
    return open(self.baseFilename, self.mode, encoding=self.encoding)
FileNotFoundError: [Errno 2] No such file or directory: '/NFS/home/diego/Projects/YANK/tutorials/examples/binding/t4-lysozyme/p-xylene-implicit-output/experiments/experiments_2.log'
2018-03-06 22:11:05,585: ERROR - yank.mpi - MPI node 3/4 raised exception.
NoneType: None
2018-03-06 22:11:05,585: CRITICAL - yank.mpi - MPI node 3/4 called Abort()!
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2
Traceback (most recent call last):
  File "/opt/apps/conda/intel-2018.1.163_miniconda/envs/yank/bin/yank", line 11, in <module>
    load_entry_point('yank==0.20.1', 'console_scripts', 'yank')()
  File "/opt/apps/conda/intel-2018.1.163_miniconda/envs/yank/lib/python3.6/site-packages/yank/cli.py", line 72, in main
    dispatched = getattr(commands, command).dispatch(command_args)
  File "/opt/apps/conda/intel-2018.1.163_miniconda/envs/yank/lib/python3.6/site-packages/yank/commands/script.py", line 114, in dispatch
    yaml_builder.run_experiments()
  File "/opt/apps/conda/intel-2018.1.163_miniconda/envs/yank/lib/python3.6/site-packages/yank/experiment.py", line 811, in run_experiments
    completed[exp_index] = self._run_experiment(exp)
  File "/opt/apps/conda/intel-2018.1.163_miniconda/envs/yank/lib/python3.6/site-packages/yank/experiment.py", line 2900, in _run_experiment
    built_experiment = self._build_experiment(experiment_path, experiment)
  File "/opt/apps/conda/intel-2018.1.163_miniconda/envs/yank/lib/python3.6/site-packages/yank/experiment.py", line 2669, in _build_experiment
    utils.config_root_logger(self._options['verbose'], experiment_log_file_path)
  File "/opt/apps/conda/intel-2018.1.163_miniconda/envs/yank/lib/python3.6/site-packages/yank/utils.py", line 161, in config_root_logger
    file_handler = logging.FileHandler(log_file_path)
  File "/opt/apps/conda/intel-2018.1.163_miniconda/envs/yank/lib/python3.6/logging/__init__.py", line 1030, in __init__
    StreamHandler.__init__(self, self._open())
  File "/opt/apps/conda/intel-2018.1.163_miniconda/envs/yank/lib/python3.6/logging/__init__.py", line 1059, in _open
    return open(self.baseFilename, self.mode, encoding=self.encoding)
FileNotFoundError: [Errno 2] No such file or directory: '/NFS/home/diego/Projects/YANK/tutorials/examples/binding/t4-lysozyme/p-xylene-implicit-output/experiments/experiments_3.log'
2018-03-06 22:11:05,585: ERROR - yank.mpi - MPI node 4/4 raised exception.
NoneType: None
2018-03-06 22:11:05,585: CRITICAL - yank.mpi - MPI node 4/4 called Abort()!
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3

Both log files experiments_2.log and experiments_3.log were never created. In p-xylene-implicit-output/experiments/experiments only experiments_1.log and experiments.log are found. It is like experiments.log was created three times: experiments.log and experiments_1.log by the two processes in node01, and experiments.log twice again by the two processes in node02.

I thought maybe n_jobs and job_ids in the config file could solve it, but then yank stays for a long time after (until I kill it):

2018-03-06 22:38:07,517: Setting CUDA platform to use precision model 'mixed'.
2018-03-06 22:38:07,697: Setting CUDA platform to use precision model 'mixed'.
2018-03-06 22:38:07,806: Node 1/4: executing <function ExperimentBuilder._check_resume at 0x2aab07c932f0>
2018-03-06 22:38:07,809: Node 1/4: waiting for barrier after <function ExperimentBuilder._check_resume at 0x2aab07c932f0>
2018-03-06 22:38:07,863: Group 1/4 Node 1/1: execute _setup_molecules(p-xylene)
2018-03-06 22:38:08,621: Fixing net charge from -2.000000000015878e-06 to 4.163336342344337e-17
2018-03-06 22:38:08,633: Node 1/4: waiting for barrier after _setup_molecules
2018-03-06 22:38:08,640: Group 1/4 Node 1/1: execute get_system(t4-xylene)
2018-03-06 22:38:08,642: Setting up the systems for t4-lysozyme, p-xylene and GBSA
2018-03-06 22:38:08,642: Setting up solvent phase
2018-03-06 22:38:08,984: Setting up complex phase
2018-03-06 22:38:10,234: WARNING - yank.pipeline - TLeap: The unperturbed charge of the unit: 8.000000 is not zero.
2018-03-06 22:38:10,235: WARNING - yank.pipeline - TLeap: The unperturbed charge of the unit: 8.000000 is not zero.
2018-03-06 22:38:10,236: Node 1/4: waiting for barrier after get_system
2018-03-06 22:38:10,237: Node 1/4: executing <function ExperimentBuilder._safe_makedirs at 0x2aab07c938c8>
2018-03-06 22:38:10,241: Node 1/4: waiting for barrier after <function ExperimentBuilder._safe_makedirs at 0x2aab07c938c8>

What am I doing wrong?

A bit of help would be appreciated, pretty much. Thanks a lot.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

2reactions
andrrizzicommented, Mar 7, 2018

The message is printed out twice with the name (without index-number): experiments.log.

This is expected. The function config_root_logger “appends” the MPI process rank to the file name internally, but it expects the file name without index number to be passed as an argument.

if the hostfile is changed to (3xnode01 + 1x node02) … it only complains about experiments_3.log

This is consistent with some sort of coordination problem between nodes rather than MPI processes. Running YANK on multiple nodes with MPI works for me on our cluster so this is likely a configuration problem rather than a bug. I’d try these:

  1. Check that the mpi4py package installed with conda comes from the conda-forge channel.
  2. Check what happens when the p-xylene-implicit-output/experiments/ exists before running YANK.

If 2. works, it may be that something weird happens when mpi.barrier() is called to sync the processes after the output directory is created by the MPI process with rank 0.

1reaction
andrrizzicommented, Mar 8, 2018

Glad it’s working! I’d still be careful and double check that everything is running as expected. If this is a problem with MPI configuration, it’s possible there will be more problems later.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Fluentd not picking new log files · Issue #3239 - GitHub
I have a situation where fluentd running as a daemonset in kubernetes cluster not picking new log files and this happens randomly.
Read more >
Configure Node.js to log to a file instead of the console
The logs are visible in console. and no file is created immediately !! Am I missing something ? – Nigilan. Aug 11, 2017...
Read more >
Find errors with transactional replication - SQL Server
Describes how to locate and identify errors with Transactional Replication, as well as the troubleshooting methodology for addressing issues ...
Read more >
Troubleshooting GitLab Runner
Where are logs stored when run as a service on Windows? ... GitLab service and GitLab Runner exist in two different networks that...
Read more >
Troubleshooting Clusters | Kubernetes
And verify that all of the nodes you expect to see are present and that ... you may need to use journalctl instead...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found