question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

(Likely) left-overs from previous build are breaking the build in self-hosted runeers

See original GitHub issue

I started to experience strange errors when building documentation And I have a reason to believe this is because of left-overs from previous builds on the self hosted machines. For example looks like somewhere in the generated classes there is a get_info class used which normally is only generated (in different directory) during provider packages preparation.

I think this is the manifestation of not “cleaning” everything during the self-hosted build. As discussed before. I think the only way to avoid any such errors is always start from a completely “clean” state when the runner starts building, otherwise it opens up Pandora’s box of many similar problems (imagine for example .pyc files for one python version used in another).

I think the only reasonable way to approach it is to what every other CI system does - hard-cleaning of everything, docker cache, sources, log directories - everything just before the new job starts. While it might mean it’s a waste of time (and cache) as we always have to start from scratch, this avoids multitude of problems and lost developer’s time on investigating issues that should not be on our radar (and that have no easy way to fix them either). Our builds are prepared to quickly restore the cache as needed - they are able to - very quickly usually - bring the CI images from registry (which we use as cache) - in most cases bringing in nessary images is as quick as 1 minute, so caching them locally is not needed.

Also cleaning help in keeping the environment sane - if we clean-up everything (True Clean State ™) before the build, there will be no “growing” logs and other artifacts that might grow as machine is re-used for several jobs.

We need to clean-up everything before the run because there might be many reasons why the job is not cleanly stoped (cancelling job, temporary network failures and the like). And since we have everything in tmpfs it should be as easy as simply removing and recreating tmpfs volume,

Example failed job where I suspect the problem with non-clean state. I run doc build locally and it completed without problems so I suspect the “non-clean” state of the CI machine’s job is the root cause.

https://github.com/apache/airflow/pull/14125/checks?check_run_id=1897287904#step:4:16551

  Module "airflow.provider_info.schema.j" could not be loaded. Full source will not be available. "error importing 'airflow.provider_info.schema.j' (exception was: ModuleNotFoundError("No module named 'airflow.provider_info'",))"
  reading sources... [ 20%] operators-and-hooks-ref/apache
  
  Traceback (most recent call last):
    File "/usr/local/lib/python3.6/site-packages/sphinx/events.py", line 111, in emit
      results.append(listener.handler(self.app, *args))
    File "/usr/local/lib/python3.6/site-packages/sphinx/ext/viewcode.py", line 155, in env_purge_doc
      for modname, (code, tags, used, refname) in list(modules.items()):
  TypeError: 'bool' object is not iterable
  
  The above exception was the direct cause of the following exception:
  
  Traceback (most recent call last):
    File "/usr/local/lib/python3.6/site-packages/sphinx/cmd/build.py", line 280, in build_main
      app.build(args.force_all, filenames)
    File "/usr/local/lib/python3.6/site-packages/sphinx/application.py", line 352, in build
      self.builder.build_update()
    File "/usr/local/lib/python3.6/site-packages/sphinx/builders/__init__.py", line 298, in build_update
      len(to_build))
    File "/usr/local/lib/python3.6/site-packages/sphinx/builders/__init__.py", line 310, in build
      updated_docnames = set(self.read())
    File "/usr/local/lib/python3.6/site-packages/sphinx/builders/__init__.py", line 417, in read
      self._read_serial(docnames)
    File "/usr/local/lib/python3.6/site-packages/sphinx/builders/__init__.py", line 436, in _read_serial
      self.events.emit('env-purge-doc', self.env, docname)
    File "/usr/local/lib/python3.6/site-packages/sphinx/events.py", line 120, in emit
      (listener.handler, name), exc, modname=modname) from exc
  sphinx.errors.ExtensionError: Handler <function env_purge_doc at 0x7fa233d74378> for event 'env-purge-doc' threw an exception (exception: 'bool' object is not iterable)
  
  Extension error (sphinx.ext.viewcode):
  Handler <function env_purge_doc at 0x7fa233d74378> for event 'env-purge-doc' threw an exception (exception: 'bool' object is not iterable)

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:10 (10 by maintainers)

github_iconTop GitHub Comments

2reactions
ashbcommented, Feb 15, 2021

Looking.

1reaction
ashbcommented, Feb 15, 2021

(I’m glad I took the time to upload all logs from the runner hosts to Cloudwatch now. https://vector.dev/ is much more powerful than Amazon’s Cloudwatch agent.)

Read more comments on GitHub >

github_iconTop Results From Across the Web

[Self-hosted] job abandoned #1546 - actions/runner - GitHub
Describe the bug Since yesterday, CI jobs keep failing. I tried to re-run the previously passed changes and still failed.
Read more >
3 steps to allocate a paycheck when you want to get ahead ...
Let's break it down: essentials first, savings and investments second, and entertainment third. 1. Keep essentials at about 50% of your pay. Things...
Read more >
Errors connecting to Docker socket (#2408) - gitlab-runner
Summary. I am receiving errors connecting to the Docker socket during the start of builds, the retries sometimes pass but a good percentage ......
Read more >
buildbot.status.builder
We can't subscribe them any earlier, 86 otherwise they'd get data out of order. ... 279 return os.path.join(self.step.build.builder.basedir, self.filename).
Read more >
Changelog – NTLite
New-Source: Preliminary Server 2022 22H2 build 20349 support ... in DISM-only mode was causing error 8007042B during self-upgrade (Host refresh) migration
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found