Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Flaky builds when using cache actions with Ruby gems

See original GitHub issue

I used the caching action for caching Ruby gems used in building this Middleman website. This has greatly sped up the build process, down to under a minute from 4-5 minutes without caching.

However, this results in a too-flaky build. This is demonstrated by the commit trail in PR 23

Bug or no Bug?

After some research and experiments, this is what I think is going on. Maintainers here, please decide what to do with this issue. I’m indecisive whether this is an actions/cache issue.

My project had a dependency on ruby-sassc. On installation, this gem does some compilation. Pre version 2.2.0, it compiled into a cross platform gem. With version 2.2.0, this “cross platformness” (what’s the right word) was dropped in favour of a speed improvement, see this insightful comment (and if you’re in to it read the entire comment thread of sassc-ruby issue 146).

I suspect that every once in a while, this job gets executed by a worker that runs on an architecture incompatible with the version of the compiled gem that is stored in the cache. Then the build fails. If I re-run the build, it gets picked up by a worker running on a compatible architecture and voila, it passes again.

My solution workaround

Fix the version of the Ruby-sassc gem to version 2.1.0.

Observations

See the commit trail of PR 23 Intermittent build failures.

A good example:

Commit 3ab744a passes - run 461949888

Commit cc4b3ad (same codebase) fails - 461965463

Failed builds can often be resolved by running 1 or 2 more times using the “re-run jobs” button.

The build always fails in the Build step. Always with a message similar to:

[...]
/home/runner/work/XSCALE-Alliance.github.io/XSCALE-Alliance.github.io/vendor/bundle/ruby/2.5.0/gems/ffi-1.12.2/lib/ffi/library.rb:112: [BUG] Illegal instruction at 0x00007efffd9a5780
ruby 2.5.7p206 (2019-10-01 revision 67816) [x86_64-linux]

-- Control frame information -----------------------------------------------
c:0042 p:---- s:0222 e:000221 CFUNC  :open
[...]

I have not observed this build error in builds without the caching step.

How to reproduce

It occurs intermittently. Generally occurs once in about ten builds, sometimes more frequent.

I reproduce it by pushing minor changes to this branch, as demonstrated by the commit log PR 23

Build configuration

The build configuration is per the Ruby example in this repo.

See verify_pull_request.yml on this PR branch:

jobs:
  build:

    runs-on: ubuntu-latest

    steps:
    - name: Check Ruby Versions
      run: |
        echo "$RUNNER_TOOL_CACHE"
        ls $RUNNER_TOOL_CACHE/Ruby
    - uses: actions/checkout@v1
    - uses: actions/setup-ruby@v1
      with:
        ruby-version: '2.5'
    - name: Cache Ruby Gems
      uses: actions/cache@v1
      with:
        path: vendor/bundle
        key: ${{ runner.os }}-gems2-${{ hashFiles('**/Gemfile.lock') }}
        restore-keys: |
          ${{ runner.os }}-gems2-
    - name: Bootstrap
      run: |
        bundle config path vendor/bundle
        make bootstrap
    - name: Build
      run: make test

Further experiments

I can try the following:

verify sassc version used. 2.2.0 has this problem; it was reported fixed in 2.2.1, but others still have reported similar issue with 2.2.1. 2.1.x does not has this problem apparently.
run 10+ builds with a locked vesion of sassc v2.1.0

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:10 (2 by maintainers)

Top GitHub Comments

1reaction

serracommented, Feb 26, 2020

Thanks for your answers @joshmgross . Your pointers are clear and I believe I’m following them already. As I don’t have a clear suggestion on how to better handle this in actions/cache, I’ll close this issue now. Thank your for your help.

1reaction

joshmgrosscommented, Feb 25, 2020

I think the platform differences can be more subtle than only bitness. But this stuff is a bit out of my league

All hosted runners for a given label will be the same VM image and architecture. We do a monthly update of that image, but it’s primarily software updates.

As a user of actions/cache, how can I implement caching that results in a reliable build?

It’s important to correctly choose a key that uniquely identifies a cache, such as including the runner OS and a hash of any dependency files (such as Gemfile.lock). Additionally, you should be careful with restore keys, as they allow pulling an older version of the cache that doesn’t match your primary key.

You can find more info at https://help.github.com/en/actions/configuring-and-managing-workflows/caching-dependencies-to-speed-up-workflows

As a user of actions/cache what assumptions do I have to assert myself to guarantee a reliable build?

Depending on your workflow and ecosystem, it’s recommended to still run the dependency install step after caching. This allows the tooling to pull any missing dependencies while benefiting from the cached dependencies already available locally.