Flaky builds when using cache actions with Ruby gems
See original GitHub issueI used the caching action for caching Ruby gems used in building this Middleman website. This has greatly sped up the build process, down to under a minute from 4-5 minutes without caching.
However, this results in a too-flaky build. This is demonstrated by the commit trail in PR 23
Bug or no Bug?
After some research and experiments, this is what I think is going on. Maintainers here, please decide what to do with this issue. I’m indecisive whether this is an actions/cache
issue.
My project had a dependency on ruby-sassc. On installation, this gem does some compilation. Pre version 2.2.0, it compiled into a cross platform gem. With version 2.2.0, this “cross platformness” (what’s the right word) was dropped in favour of a speed improvement, see this insightful comment (and if you’re in to it read the entire comment thread of sassc-ruby issue 146).
I suspect that every once in a while, this job gets executed by a worker that runs on an architecture incompatible with the version of the compiled gem that is stored in the cache. Then the build fails. If I re-run the build, it gets picked up by a worker running on a compatible architecture and voila, it passes again.
My solution workaround
Fix the version of the Ruby-sassc gem to version 2.1.0.
Observations
See the commit trail of PR 23 Intermittent build failures.
A good example:
Commit 3ab744a passes - run 461949888
Commit cc4b3ad (same codebase) fails - 461965463
Failed builds can often be resolved by running 1 or 2 more times using the “re-run jobs” button.
The build always fails in the Build
step. Always with a message similar to:
[...]
/home/runner/work/XSCALE-Alliance.github.io/XSCALE-Alliance.github.io/vendor/bundle/ruby/2.5.0/gems/ffi-1.12.2/lib/ffi/library.rb:112: [BUG] Illegal instruction at 0x00007efffd9a5780
ruby 2.5.7p206 (2019-10-01 revision 67816) [x86_64-linux]
-- Control frame information -----------------------------------------------
c:0042 p:---- s:0222 e:000221 CFUNC :open
[...]
I have not observed this build error in builds without the caching step.
How to reproduce
It occurs intermittently. Generally occurs once in about ten builds, sometimes more frequent.
I reproduce it by pushing minor changes to this branch, as demonstrated by the commit log PR 23
Build configuration
The build configuration is per the Ruby example in this repo.
See verify_pull_request.yml
on this PR branch:
jobs:
build:
runs-on: ubuntu-latest
steps:
- name: Check Ruby Versions
run: |
echo "$RUNNER_TOOL_CACHE"
ls $RUNNER_TOOL_CACHE/Ruby
- uses: actions/checkout@v1
- uses: actions/setup-ruby@v1
with:
ruby-version: '2.5'
- name: Cache Ruby Gems
uses: actions/cache@v1
with:
path: vendor/bundle
key: ${{ runner.os }}-gems2-${{ hashFiles('**/Gemfile.lock') }}
restore-keys: |
${{ runner.os }}-gems2-
- name: Bootstrap
run: |
bundle config path vendor/bundle
make bootstrap
- name: Build
run: make test
Further experiments
I can try the following:
- verify sassc version used. 2.2.0 has this problem; it was reported fixed in 2.2.1, but others still have reported similar issue with 2.2.1. 2.1.x does not has this problem apparently.
- run 10+ builds with a locked vesion of sassc v2.1.0
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:10 (2 by maintainers)
Top GitHub Comments
Thanks for your answers @joshmgross . Your pointers are clear and I believe I’m following them already. As I don’t have a clear suggestion on how to better handle this in
actions/cache
, I’ll close this issue now. Thank your for your help.All hosted runners for a given label will be the same VM image and architecture. We do a monthly update of that image, but it’s primarily software updates.
It’s important to correctly choose a key that uniquely identifies a cache, such as including the runner OS and a hash of any dependency files (such as
Gemfile.lock
). Additionally, you should be careful with restore keys, as they allow pulling an older version of the cache that doesn’t match your primary key.You can find more info at https://help.github.com/en/actions/configuring-and-managing-workflows/caching-dependencies-to-speed-up-workflows
Depending on your workflow and ecosystem, it’s recommended to still run the dependency install step after caching. This allows the tooling to pull any missing dependencies while benefiting from the cached dependencies already available locally.
I’m open to suggestions for how we can better handle this.