question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

mpi4py error during getting results (in pare with SLURM)

See original GitHub issue

ERROR: Traceback (most recent call last): File “/opt/software/anaconda/3/lib/python3.6/runpy.py”, line 193, in _run_module_as_main “main”, mod_spec) File “/opt/software/anaconda/3/lib/python3.6/runpy.py”, line 85, in _run_code exec(code, run_globals) File “/home/vasko/.local/lib/python3.6/site-packages/mpi4py/futures/main.py”, line 72, in <module> main() File “/home/vasko/.local/lib/python3.6/site-packages/mpi4py/futures/main.py”, line 60, in main run_command_line() File “/home/vasko/.local/lib/python3.6/site-packages/mpi4py/run.py”, line 47, in run_command_line run_path(sys.argv[0], run_name=‘main’) File “/opt/software/anaconda/3/lib/python3.6/runpy.py”, line 263, in run_path pkg_name=pkg_name, script_name=fname) File “/opt/software/anaconda/3/lib/python3.6/runpy.py”, line 96, in _run_module_code mod_name, mod_spec, pkg_name, script_name) File “/opt/software/anaconda/3/lib/python3.6/runpy.py”, line 85, in _run_code exec(code, run_globals) File “cali_send_2.py”, line 137, in <module> globals()[sys.argv[1]](sys.argv[2], sys.argv[3]) File “cali_send_2.py”, line 94, in solve_on_cali sols = list(executor.map(solve_matrix, repeat(inputs), range(len(wls)), wls)) File “/home/vasko/.local/lib/python3.6/site-packages/mpi4py/futures/pool.py”, line 207, in result_iterator yield futures.pop().result() File “/opt/software/anaconda/3/lib/python3.6/concurrent/futures/_base.py”, line 432, in result return self.__get_result() File “/opt/software/anaconda/3/lib/python3.6/concurrent/futures/_base.py”, line 384, in __get_result raise self._exception UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xc1 in position 5: invalid start byte

ENV CentOS release 6.5 (Final) Python 3.6 anaconda mpiexec (OpenRTE) 1.8.2 mpi4py 3.0.3

Piece of Code:

inputs = [der_mats, ref_ind_yee_grid, n_xy_sq, param_sweep_on, i_m, inv_eps, sol_params]
with MPIPoolExecutor(max_workers=int(nodes)) as executor:
   sols = list(executor.map(solve_matrix, repeat(inputs), range(len(wls)), wls))
   executor.shutdown(wait=True)  # wait for all complete
   zipobj = ZipFile(zp_fl_nm, 'w')

   for sol in sols:
      w, v, solnum, vq = sol
      print(w[0], solnum) # this line will shows if data have duplicates.
      w.tofile(f"w_sol_{solnum}.npy")
      v.tofile(f"v_sol_{solnum}.npy")
      vq.tofile(f"vq_sol_{solnum}.npy")
      zipobj.write(f"w_sol_{solnum}.npy")
      zipobj.write(f"v_sol_{solnum}.npy")
      zipobj.write(f"vq_sol_{solnum}.npy")
      os.remove(f"w_sol_{solnum}.npy")
      os.remove(f"v_sol_{solnum}.npy")
      os.remove(f"vq_sol_{solnum}.npy")

Call of method I do with sending command like this: f'srun --mpi=pmi2 -n ${{SLURM_NTASKS}} python -m mpi4py.futures cali_send_2.py solve_on_cali \"\"{name}\"\" {num_nodes}'

Sometimes this error not appear if I use another range for wls with (wls = np.arange(0.4e-6, 1.8e-6, 0.01e-6)) it crush with this error or return duplicates of some solutions if step 0.1e-6. If I use this range (wls = np.arange(0.55e-6, 1.55e-6, 0.01e-6)) with any step 0.1e-6 or 0.001e-6 it’s NOT crush with mentioned error and returns good results without duplicates.

Could someone please explain me what is the origin of this error? My suspicion is pointing on float numbers like 1.699999999999999999999e-6

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:15 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
dalcinlcommented, Apr 7, 2021

@byquip You are using Python from a miniconda environment, however mpi4py is installed in $HOME/.local. That’s suspicious, conda users should just pip install in the environment. Or perhaps the problem is what @leofang pointed out, the environment is not active in all the compute nodes.

1reaction
dalcinlcommented, Apr 7, 2021

This kind of questions is better suited for mpi4py’s mailing list in Google Groups. I understand that shooting an issue in GitHub is very convenient for users, but this increases the load on core developers, and the community watching the mailing list is usually larger. Chaces of getting a good tip and advice are higher on the mailing list.

Read more comments on GitHub >

github_iconTop Results From Across the Web

13339 – MPI job performance - SchedMD - Slurm Support
Hi Michael, When launching MPI jobs, it appears as if those processes that run on the non-head node (not the first in the...
Read more >
Using SLURM and MPI(4PY): Cannot allocate requested ...
I have a setup/installation of SLURM on my desktop computer to do some testing and understand how it works before deploying it to...
Read more >
MPI for Python - Read the Docs
This document describes the MPI for Python package. MPI for Python provides Python bindings for the Message.
Read more >
[mpich-discuss] using mpi4py in a Singluarity container run at ...
[mpich-discuss] using mpi4py in a Singluarity container run at a large computing center with Slurm installed.
Read more >
Instructor Notes - GitHub Pages
GNU Parallel lets a single Slurm job start multiple subprocesses · This helps to use all the CPUs on a node effectively.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found