Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unexpected exit status 1 while running "logging"

See original GitHub issue

I’m running a task array of 22 jobs (one per autosome) and I’ve noticed a message I have not seen before. Seemingly, the logging component of the job failed(?). (That may be inaccurate, it’s just my interpretation of the message.)

The message that I get when I run dstat on the job using -f is:

script-name: dsub-command.sh
  start-time: '2019-05-22 16:35:13.783803'
  status: RUNNING
  status-detail: |-
    logging:
    Unexpected exit status 1 while running "logging"
  status-message: Unexpected exit status 1 while running "logging"
  task-attempt: 1
  task-id: '19'
  user-id: jamesp

The other tasks seem to be continuing without issue. Am not yet sure whether this task will complete. (It says it is RUNNING, as you can see above.)

This is dsub version: 0.3.1

Issue Analytics

State:
Created 4 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

mbookmancommented, May 23, 2019

Thanks for following up with those details. It is good that the logging failure did ultimately fail the workflow. That is expected. It did take longer to fail the workflow than I would have expected. I’ll follow up with Cloud Health to see if a failure in a background action should trigger failure more quickly.

I want to correct one thing I indicated yesterday - we actually do retry gsutil cp 3 times due to the occasional arbitrary failure (see https://github.com/DataBiosphere/dsub/blob/master/dsub/providers/google_v2.py#L103). However, we put no delay in between failures which prevents us recovering from this particular auth issue. We will add a modest delay here. The refresh of credentials should not take long and we don’t want to overly delay reporting a genuine failure.

Specific follow-ups:

Add a delay in the gsutil cp retry logic
Surface more action failure details in dstat --full (“ServiceException: 401…” was there in the underlying operation)
Determine whether the logging failure can be made to fail the operation more quickly

1reaction

mbookmancommented, May 22, 2019

Various environments, including GCE and Cloud Shell, episodically have problems where the credentials used become unavailable. So in this particular case, the GCE service account token has become unavailable or stale and gsutil is unable to successfully make necessary calls.

In various places, dsub has explicit retries for such conditions. See https://github.com/DataBiosphere/dsub/blob/master/dsub/providers/google_base.py#L69:

# Auth errors should be permanent errors that a user needs to go fix
# (enable a service, grant access on a resource).
# However we have seen them occur transiently, so let's retry them when we
# see them, but not as patiently.
HTTP_AUTH_ERROR_CODES = set([401, 403])

The case observed here (logging) does not explicitly include any retries. Unfortunately, catching gsutil errors is a little clunky since you have to capture STDERR and do string parsing.

It will be interesting to see if this problem transiently caused the logging action to fail and if the credentials will have been refreshed by the time delocalization and final_logging occur.