question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Incomplete node list

See original GitHub issue

On lumi the command sacct -S <start_tim> -P -j <jobid> -o jobid,state,exitcode,end,nodelist doesn’t always give the complete node list since the beginning of the job. It would print for instance

JobID|State|ExitCode|End|NodeList
1305969|PENDING|0:0|Unknown|nid005828
1305969.batch|RUNNING|0:0|Unknown|nid005828

and a bit later it starts giving the whole information

JobID|State|ExitCode|End|NodeList
1305969|RUNNING|0:0|Unknown|nid[005828-005831]
1305969.batch|RUNNING|0:0|Unknown|nid005828
1305969.0|RUNNING|0:0|Unknown|nid[005829-005831]

This is a problem for reframe because it updates the node list only once https://github.com/reframe-hpc/reframe/blob/a5b66c7c41d7cc884893642fd4d9331b146a3c16/reframe/core/schedulers/slurm.py#L383-L392

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
ireedcommented, Oct 18, 2022

Getting the nodelist through job.nodelist does not work consistently, regardless of which stage i try to do it. I do not expect it to work before or during the compile stage. But, for performance or sanity stage (or, any time after the run stage), it is incredibly useful to know which nodes i have and how many. At the moment, i do not know how to reliably get this information through the framework without hacks.

0reactions
vkarakcommented, Oct 19, 2022

The problem is that we peek into the nodelist only once, the first time we get back a non-empty node list by sacct or squeue as @rsarm has pointed out. Apparently, Slurm does not fill it up at once, so that’s why we miss it sometimes and sometimes we don’t. I think the best solution is to retrieve the value everytime, but do not issue scontrol to unwrap the node list until the job has finished, so that we issue scontrol once.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Incomplete view in Node list · Issue #1208 · rundeck/rundeck · GitHub
With 2.5 I can no longer see all of my nodes or have an option to expand the list from the Nodes tab....
Read more >
NodeJs API responding with incomplete object - Stack Overflow
The objects are updating successfully but the API response that I recieve on Frontend is not updated with second collection values. router.get(' ...
Read more >
What makes this definition of a linked list in C an incomplete ...
A linked list is a set of dynamically allocated nodes, arranged in such a way that each node contains one value and one...
Read more >
How to deal with DataIncomplete error when accessing ...
Ideally, I'd like to have shapely polygons / multipolygons for the nodes forming part of a way. python · overpass-api · overpass-turbo ·...
Read more >
35 Incomplete Types - Beej's Guide to C Programming
But what if we're doing a linked list? Each linked list node needs to have a reference to another node. But how can...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found