question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

MPI repex trailblazing dies at end of trailblazing

See original GitHub issue

I just noticed that an MPI run using repex and trailblazing for the BRD4 test system using this YAML file died at the end of trailblazing:

2018-11-19 04:58:52,202 - DEBUG - yank.pipeline - trailblazing: state_parameter lambda_sterics, simulated_value 0.08887985583853127, current_parameter_value 0.04839281265346118, std_du 0.5134917528793879
2018-11-19 04:59:25,224 - DEBUG - yank.pipeline - trailblazing: state_parameter lambda_sterics, simulated_value 0.04839281265346118, current_parameter_value 0.0, std_du 0.4132120996374473
2018-11-19 04:59:25,225 - DEBUG - yank.pipeline - Alchemical path found: {'lambda_electrostatics': [1.0, 0.95, 0.8999999999999999, 0.8499999999999999, 0.7999999999999998, 0.7499999999999998, 0.6999999999999997, 0.6387209700669697, 0.5790253242758696, 0.5204060356681246, 0.4609442605375813, 0.40433981917479267, 0.33567070513317865, 0.27416285442810556, 0.21103242826939206, 0.14972564502170452, 0.07998556932087207, 0.0009736900429375486, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 'lambda_sterics': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9575795865483813, 0.9121943363134406, 0.8686284868410266, 0.8281742054378803, 0.7835707001058548, 0.7395478276369869, 0.6977469237017255, 0.6567304355151427, 0.6159670050468764, 0.5850659760831278, 0.5438792978602357, 0.5114994213498266, 0.48223554046780187, 0.4522580699381295, 0.4260537632558338, 0.397933628369692, 0.3732724389208485, 0.35396829064262253, 0.3340455062541764, 0.31613594671256723, 0.2986968556315033, 0.28266408417845745, 0.26755113898802224, 0.2544825877884863, 0.24134717331761885, 0.22876015799568095, 0.21898392917678333, 0.2079355647081965, 0.19668728551527873, 0.18387991041683313, 0.16512157756140372, 0.14415015504378637, 0.11860319976227818, 0.08887985583853127, 0.04839281265346118, 0.0]}
2018-11-19 04:59:25,226 - DEBUG - yank.mpi - Node 1/1: executing <function ExperimentBuilder._generate_yaml at 0x2ab72708ed08>
2018-11-19 04:59:25,261 - DEBUG - yank.mpi - Node 1/1: waiting for barrier after <function ExperimentBuilder._generate_yaml at 0x2ab72708ed08>
2018-11-19 08:55:18,458 - CRITICAL - yank.mpi - MPI node 12/16 raised an exception and called Abort()! The exception traceback follows
Traceback (most recent call last):
  File "/home/chodera/miniconda/bin/yank", line 11, in <module>
    load_entry_point('yank==0.23.8', 'console_scripts', 'yank')()
  File "/home/chodera/miniconda/lib/python3.6/site-packages/yank-0.23.8-py3.6-linux-x86_64.egg/yank/cli.py", line 73, in main
    dispatched = getattr(commands, command).dispatch(command_args)
  File "/home/chodera/miniconda/lib/python3.6/site-packages/yank-0.23.8-py3.6-linux-x86_64.egg/yank/commands/script.py", line 138, in dispatch
    yaml_builder.run_experiments()
  File "/home/chodera/miniconda/lib/python3.6/site-packages/yank-0.23.8-py3.6-linux-x86_64.egg/yank/experiment.py", line 778, in run_experiments
    self._generate_experiments_protocols()
  File "/home/chodera/miniconda/lib/python3.6/site-packages/yank-0.23.8-py3.6-linux-x86_64.egg/yank/experiment.py", line 2452, in _generate_experiments_protocols
    send_results_to=None, group_size=1, sync_nodes=True)
  File "/home/chodera/miniconda/lib/python3.6/site-packages/yank-0.23.8-py3.6-linux-x86_64.egg/yank/mpi.py", line 531, in distribute
    *other_args, **kwargs)
  File "/home/chodera/miniconda/lib/python3.6/site-packages/yank-0.23.8-py3.6-linux-x86_64.egg/yank/mpi.py", line 386, in exec_tasks
    raise error
TypeError: Group 12/16 Node 1/1 received an exception from another MPI process. Original stack trace follow:
Traceback (most recent call last):
      File "/home/chodera/miniconda/lib/python3.6/site-packages/yank-0.23.8-py3.6-linux-x86_64.egg/yank/mpi.py", line 357, in exec_tasks
        results.append(task(distributed_arg, *other_args, **kwargs))
      File "/home/chodera/miniconda/lib/python3.6/site-packages/yank-0.23.8-py3.6-linux-x86_64.egg/yank/experiment.py", line 2548, in _generate_experiment_protocol
        alchemical_phase = phase.initialize_alchemical_phase()
      File "/home/chodera/miniconda/lib/python3.6/site-packages/yank-0.23.8-py3.6-linux-x86_64.egg/yank/experiment.py", line 359, in initialize_alchemical_phase
        alchemical_phase.equilibrate(n_iterations, mcmc_moves=mcmc_move)
      File "/home/chodera/miniconda/lib/python3.6/site-packages/yank-0.23.8-py3.6-linux-x86_64.egg/yank/yank.py", line 1197, in equilibrate
        self._sampler.equilibrate(n_iterations=n_iterations, mcmc_moves=mcmc_moves)
      File "/home/chodera/miniconda/lib/python3.6/site-packages/yank-0.23.8-py3.6-linux-x86_64.egg/yank/multistate/multistatesampler.py", line 616, in equilibrate
        self._propagate_replicas()
      File "/home/chodera/miniconda/lib/python3.6/site-packages/openmmtools-0.17.0-py3.6.egg/openmmtools/utils.py", line 87, in _wrapper
        return func(*args, **kwargs)
      File "/home/chodera/miniconda/lib/python3.6/site-packages/yank-0.23.8-py3.6-linux-x86_64.egg/yank/multistate/multistatesampler.py", line 1195, in _propagate_replicas
        send_results_to=0)
      File "/home/chodera/miniconda/lib/python3.6/site-packages/yank-0.23.8-py3.6-linux-x86_64.egg/yank/mpi.py", line 531, in distribute
        *other_args, **kwargs)
      File "/home/chodera/miniconda/lib/python3.6/site-packages/yank-0.23.8-py3.6-linux-x86_64.egg/yank/mpi.py", line 376, in exec_tasks
        raise error
      File "/home/chodera/miniconda/lib/python3.6/site-packages/yank-0.23.8-py3.6-linux-x86_64.egg/yank/mpi.py", line 357, in exec_tasks
        results.append(task(distributed_arg, *other_args, **kwargs))
      File "/home/chodera/miniconda/lib/python3.6/site-packages/yank-0.23.8-py3.6-linux-x86_64.egg/yank/multistate/multistatesampler.py", line 1226, in _propagate_replica
        output_dir = os.path.join(os.path.dirname(self._reporter.filepath), 'nan-error-logs')
      File "/home/chodera/miniconda/lib/python3.6/posixpath.py", line 156, in dirname
        p = os.fspath(p)
    TypeError: Node 1/1: expected str, bytes or os.PathLike object, not method

This simulation is on lilac in

/data/chodera/chodera/gsk/yank-benchmark/BRD4/repex-rmsd-2

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
jchoderacommented, Dec 16, 2018

I’ll implement an explicit filepath attribute to the DummyReporter, but I don’t think that will solve the the fundamental issue. That line is executed because a NaN was detected during the equilibration before starting the trailblaze protocol. So if this fails reliably I’d expect the trailblaze algorithm to fail with the NaN error instead of a type error.

I was just hoping to get more information about what the actual error was so I could debug it!

0reactions
andrrizzicommented, Dec 16, 2018

Of course! Opening the PR soon.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Trailblazing Palestinian Journalist Killed in West Bank
Ms. Abu Akleh, 51, a Palestinian American reporter who was killed in the West Bank on Wednesday, was a household name across the...
Read more >
No More Excuses: Israel's Attack On The Press Requires ...
... story “Shireen Abu Akleh, Trailblazing Palestinian Journalist, Dies at 51,” making it sound as if she died peacefully in her sleep.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found