question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unexpected exit status 1 while running "logging"

See original GitHub issue

I’m running a task array of 22 jobs (one per autosome) and I’ve noticed a message I have not seen before. Seemingly, the logging component of the job failed(?). (That may be inaccurate, it’s just my interpretation of the message.)

The message that I get when I run dstat on the job using -f is:

script-name: dsub-command.sh
  start-time: '2019-05-22 16:35:13.783803'
  status: RUNNING
  status-detail: |-
    logging:
    Unexpected exit status 1 while running "logging"
  status-message: Unexpected exit status 1 while running "logging"
  task-attempt: 1
  task-id: '19'
  user-id: jamesp

The other tasks seem to be continuing without issue. Am not yet sure whether this task will complete. (It says it is RUNNING, as you can see above.)

This is dsub version: 0.3.1

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
mbookmancommented, May 23, 2019

Thanks for following up with those details. It is good that the logging failure did ultimately fail the workflow. That is expected. It did take longer to fail the workflow than I would have expected. I’ll follow up with Cloud Health to see if a failure in a background action should trigger failure more quickly.

I want to correct one thing I indicated yesterday - we actually do retry gsutil cp 3 times due to the occasional arbitrary failure (see https://github.com/DataBiosphere/dsub/blob/master/dsub/providers/google_v2.py#L103). However, we put no delay in between failures which prevents us recovering from this particular auth issue. We will add a modest delay here. The refresh of credentials should not take long and we don’t want to overly delay reporting a genuine failure.

Specific follow-ups:

  • Add a delay in the gsutil cp retry logic
  • Surface more action failure details in dstat --full (“ServiceException: 401…” was there in the underlying operation)
  • Determine whether the logging failure can be made to fail the operation more quickly
1reaction
mbookmancommented, May 22, 2019

Various environments, including GCE and Cloud Shell, episodically have problems where the credentials used become unavailable. So in this particular case, the GCE service account token has become unavailable or stale and gsutil is unable to successfully make necessary calls.

In various places, dsub has explicit retries for such conditions. See https://github.com/DataBiosphere/dsub/blob/master/dsub/providers/google_base.py#L69:

# Auth errors should be permanent errors that a user needs to go fix
# (enable a service, grant access on a resource).
# However we have seen them occur transiently, so let's retry them when we
# see them, but not as patiently.
HTTP_AUTH_ERROR_CODES = set([401, 403])

The case observed here (logging) does not explicitly include any retries. Unfortunately, catching gsutil errors is a little clunky since you have to capture STDERR and do string parsing.

It will be interesting to see if this problem transiently caused the logging action to fail and if the credentials will have been refreshed by the time delocalization and final_logging occur.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshooting 'terminated with exit code 1' error - ContainIQ
As exit code 1 is issued from within a pod, checking its logs should be your first troubleshooting step. Although containers may seem...
Read more >
How to Fix 'Terminated With Exit Code 1' Error - Komodor
Exit Code 1 means that a container terminated, typically due to an application error or an invalid reference. An application error is a...
Read more >
Unexpected exit code when running a batch file from ...
The calling powershell has the correct result. echo $LASTEXITCODE is 1 . Example #2: .\test2.bat. ECHO ON setlocal enabledelayedexpansion if ...
Read more >
Child process /sbin/parted - unexpected exit status 1: Error ...
I try to define volume to VM but failed. The same result I receive in GUI. # virsh pool-list Name State Autostart ------------ ......
Read more >
Two Source Match Stage, long running job and fails at last
Then job runs for long time and at end it fails with following log ... APT_PMsectionLeader(1, node1), player 12 - Unexpected exit status...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found