question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Jobs results can be very slow / unable to display due to thousands of `ensure_git_repository` calls

See original GitHub issue

Environment

  • Python version: 3.10
  • Nautobot version: 1.2.8

Steps to Reproduce

  1. Configure Git backed Jobs
  2. Call them thousand times
  3. (probably not related, but to be complete on the history: disable the Jobs provider for that Git source, add again the same Git source with another slug and enable Jobs provider)
  4. Navigate to the Extra > Job results page

Expected Behavior

Job results are displayed in less than a few seconds max

Observed Behavior

Navigation fails, reverse proxy displays 502 Bad Gateway after 60s or more time, or if there’s too much trafic, the container itself is considered unhealthy and a reverse proxy like Traefik can remove it as backend, resulting in 404.

Additional informations

(maybe this title could be added by default to the bug template ?)

In the container logs (of the nautobot container, not the celery_worker one), I’ve noticed thousands of thoses lines:

Attaching to nautobot_nautobot_1
nautobot_1       |   Repository successfully refreshed
nautobot_1       | 10:56:00.816 INFO    nautobot.jobs :
nautobot_1       |   Repository successfully refreshed
nautobot_1       | 10:56:00.889 INFO    nautobot.jobs :
nautobot_1       |   Repository successfully refreshed
[..]

I’ve noticed this in the past (https://github.com/nautobot/nautobot/issues/744) but it seemed harmless at the time; today, with way more JobResults, it’s impossible for the page to display.

The ensure_git_repository call seems to be made for each JobResult related to a Git job, maybe limited to the MAX_PAGE_SIZE count or the user page count, but I’m not sure since there’s really a lot of logs.

This can be critical, because when called multiple times, it can cause the healthcheck to fail / timeout, and it may be possible to DoS the instance this way.

In the Docker (itself) logs, it’s also possible to notice:

# journalctl -fu docker
Mar 17 10:47:13 my_hostname dockerd[12283]: time="2022-03-17T10:47:13.700931230+01:00" level=warning msg="Health check for container <my_container_id> error: context deadline exceeded
[..]

I tried to find where this call originate from and noted:

ensure_git_repository
>pull_git_repository_and_refresh_data
>>enqueue_pull_git_repository_and_refresh_data
>Jobs._get_job_source_paths
>>Jobs.get_jobs
>>>get_job_classpaths
>>>get_job
>>>>JobResult.related_object()
>>>JobListView.get()
>>>JobView.get()

The path I think is problematic is JobResult.related_object() > Jobs.get_job > Jobs.get_jobs > Jobs._get_job_source_paths -> ensure_git_repository.

Maybe Jobs.get_jobs could be cached (per request?) (to deduplicate ensure_git_repository calls)? Or maybe ensure_git_repository should not be called at all (updating list on git creation/sync only? And a beat/watcher could monitor changes for local jobs). And shouldn’t ensure_git_repository run only on the celery worker?

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:21 (21 by maintainers)

github_iconTop GitHub Comments

2reactions
glennmatthewscommented, Apr 4, 2022

Thinking about this from a different angle - as you pointed out, the main critical path here is JobResultTable > JobResult.related_object() > Jobs.get_job > Jobs.get_jobs > Jobs._get_job_source_paths -> ensure_git_repository, which is a lot of unnecessary work when all we’re using to actually render the table column, in the case where related_object is a Job, is its class_path (actually loading the Jobs into memory, etc is a completely unnecessary bit of processing).

I wonder therefore if we should revisit how this table column is calculated/rendered so that it only calls record.related_object in the case where we can determine in advance that related_object is not a Job, something like:

def job_creator_link(value, record):
    """
    Get a link to the related object, if any, associated with the given JobResult record.
    """
    # record.related_object is potentially slow if the related object is a Job class,
    # as it needs to actually (re)load the Job class into memory. That's unnecessary
    # computation as we don't actually need the class itself, just its class_path which is already
    # available as record.name on the JobResult itself. So save some trouble:
    if record.obj_type == ContentType.objects.get(app_label="extras", model="job"):
        return reverse("extras:job", kwargs={"class_path": record.name})

    # If it's not a Job class, maybe it's something like a GitRepository, which we can look up cheaply:
    related_object = record.related_object
    if related_object:
        return related_object.get_absolute_url()
    return None
1reaction
jathanismcommented, Apr 8, 2022

@u1735067 We revised this again to hybridize the approach here to assert that the caching is in place for Git-based jobs, and that the related_object is only retrieved once per row if it’s a not a Job class. If you could please give the latest for this a chance?

One thing however, is that we updated the base for this fix from develop (v1.2.x) to next (v1.3.0-beta.x) as we are about to release v1.3.0. So you will need to run database migrations as well, OR just hot fix your v1.2.x install w/ the updated code. Thanks in advance!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Git is really slow for 100000 objects. Any fixes? - Stack Overflow
It came down to a couple of items that I can see right now. git gc --aggressive; Opening up file permissions to 777....
Read more >
Troubleshoot Dataflow errors - Google Cloud
If you run into problems with your Dataflow pipeline or job, this page lists error messages that you might see and provides suggestions...
Read more >
Troubleshooting CI/CD - GitLab Docs
A common reason a job is added to a pipeline unexpectedly is because the changes keyword always evaluates to true in certain cases....
Read more >
Unable to clone Git repositories using the Eclipse Git ... - IBM
Symptom. Open the Git Repositories view (Window > Show View > Other... > Git > Git Repositories). Select to Clone a Git Repository....
Read more >
git-maintenance Documentation - Git
Initialize Git config values so any scheduled maintenance will start running on this repository. This adds the repository to the maintenance.repo config ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found