Jobs results can be very slow / unable to display due to thousands of `ensure_git_repository` calls
See original GitHub issueEnvironment
- Python version: 3.10
- Nautobot version: 1.2.8
Steps to Reproduce
- Configure Git backed Jobs
- Call them thousand times
- (probably not related, but to be complete on the history: disable the Jobs provider for that Git source, add again the same Git source with another slug and enable Jobs provider)
- Navigate to the Extra > Job results page
Expected Behavior
Job results are displayed in less than a few seconds max
Observed Behavior
Navigation fails, reverse proxy displays 502 Bad Gateway after 60s or more time, or if there’s too much trafic, the container itself is considered unhealthy and a reverse proxy like Traefik can remove it as backend, resulting in 404.
Additional informations
(maybe this title could be added by default to the bug template ?)
In the container logs (of the nautobot
container, not the celery_worker
one), I’ve noticed thousands of thoses lines:
Attaching to nautobot_nautobot_1
nautobot_1 | Repository successfully refreshed
nautobot_1 | 10:56:00.816 INFO nautobot.jobs :
nautobot_1 | Repository successfully refreshed
nautobot_1 | 10:56:00.889 INFO nautobot.jobs :
nautobot_1 | Repository successfully refreshed
[..]
I’ve noticed this in the past (https://github.com/nautobot/nautobot/issues/744) but it seemed harmless at the time; today, with way more JobResults, it’s impossible for the page to display.
The ensure_git_repository
call seems to be made for each JobResult related to a Git job, maybe limited to the MAX_PAGE_SIZE count or the user page count, but I’m not sure since there’s really a lot of logs.
This can be critical, because when called multiple times, it can cause the healthcheck to fail / timeout, and it may be possible to DoS the instance this way.
In the Docker (itself) logs, it’s also possible to notice:
# journalctl -fu docker
Mar 17 10:47:13 my_hostname dockerd[12283]: time="2022-03-17T10:47:13.700931230+01:00" level=warning msg="Health check for container <my_container_id> error: context deadline exceeded
[..]
I tried to find where this call originate from and noted:
ensure_git_repository
>pull_git_repository_and_refresh_data
>>enqueue_pull_git_repository_and_refresh_data
>Jobs._get_job_source_paths
>>Jobs.get_jobs
>>>get_job_classpaths
>>>get_job
>>>>JobResult.related_object()
>>>JobListView.get()
>>>JobView.get()
The path I think is problematic is JobResult.related_object() > Jobs.get_job > Jobs.get_jobs > Jobs._get_job_source_paths -> ensure_git_repository
.
Maybe Jobs.get_jobs
could be cached (per request?) (to deduplicate ensure_git_repository
calls)? Or maybe ensure_git_repository
should not be called at all (updating list on git creation/sync only? And a beat/watcher could monitor changes for local jobs). And shouldn’t ensure_git_repository
run only on the celery worker?
Issue Analytics
- State:
- Created 2 years ago
- Comments:21 (21 by maintainers)
Top GitHub Comments
Thinking about this from a different angle - as you pointed out, the main critical path here is
JobResultTable
>JobResult.related_object()
>Jobs.get_job
>Jobs.get_jobs
>Jobs._get_job_source_paths
->ensure_git_repository
, which is a lot of unnecessary work when all we’re using to actually render the table column, in the case whererelated_object
is a Job, is itsclass_path
(actually loading the Jobs into memory, etc is a completely unnecessary bit of processing).I wonder therefore if we should revisit how this table column is calculated/rendered so that it only calls
record.related_object
in the case where we can determine in advance thatrelated_object
is not a Job, something like:@u1735067 We revised this again to hybridize the approach here to assert that the caching is in place for Git-based jobs, and that the
related_object
is only retrieved once per row if it’s a not a Job class. If you could please give the latest for this a chance?One thing however, is that we updated the base for this fix from
develop
(v1.2.x) tonext
(v1.3.0-beta.x) as we are about to release v1.3.0. So you will need to run database migrations as well, OR just hot fix your v1.2.x install w/ the updated code. Thanks in advance!