Race condition while moving temporary .part files
See original GitHub issueParallel builds on our Jenkins CI box frequently fail to resolve dependencies with the following error:
[error] (update) lmcoursier.internal.shaded.coursier.error.FetchError$DownloadingArtifacts: Error fetching artifacts:
[error] https://repo.scala-sbt.org/scalasbt/sbt-plugin-releases/com.typesafe.sbt/sbt-git/scala_2.12/sbt_1.0/1.0.0/docs/sbt-git-javadoc.jar: download error: Caught java.nio.file.NoSuchFileException: /Users/bomarchman/Library/Caches/Coursier/v1/https/repo.scala-sbt.org/scalasbt/sbt-plugin-releases/com.typesafe.sbt/sbt-git/scala_2.12/sbt_1.0/1.0.0/docs/.sbt-git-javadoc.jar.part -> /Users/bomarchman/Library/Caches/Coursier/v1/https/repo.scala-sbt.org/scalasbt/sbt-plugin-releases/com.typesafe.sbt/sbt-git/scala_2.12/sbt_1.0/1.0.0/docs/sbt-git-javadoc.jar (/Users/bomarchman/Library/Caches/Coursier/v1/https/repo.scala-sbt.org/scalasbt/sbt-plugin-releases/com.typesafe.sbt/sbt-git/scala_2.12/sbt_1.0/1.0.0/docs/.sbt-git-javadoc.jar.part -> /Users/bomarchman/Library/Caches/Coursier/v1/https/repo.scala-sbt.org/scalasbt/sbt-plugin-releases/com.typesafe.sbt/sbt-git/scala_2.12/sbt_1.0/1.0.0/docs/sbt-git-javadoc.jar) while downloading https://repo.scala-sbt.org/scalasbt/sbt-plugin-releases/com.typesafe.sbt/sbt-git/scala_2.12/sbt_1.0/1.0.0/docs/sbt-git-javadoc.jar
Coursier is trying to move .sbt-git-javadoc.jar.part
to .sbt-git-javadoc.jar
, but .sbt-git-javadoc.jar.part
no longer exists because another process has already moved it. This only happens when multiple SBT processes are simultaneously trying to resolve the same dependency into a shared cache.
The error seems to originate here. From my reading of the code, it looks like the structure lock is acquired here to download the .part
file, then released to download auxiliary files, then acquired again to move the .part
file to its final location. Another process could easily grab the lock and move the .part
file in between those two locked sections.
Could this be resolved by wrapping all of these lines in the structure lock? Or are there other considerations that require the lock to be released in there? I’m happy to take a stab at this if someone can confirm that this would be a viable solution.
Versions
- Scala: 2.12.11
- SBT: 1.3.1
- Platforms MacOs and Linux
To Reproduce
Create four separate projects named coursier-race-condition-1
, coursier-race-condition-2
, etc with the following project structure:
build.sbt
scalaVersion := "2.12.11"
project/build.properties
sbt.version = 1.3.13
project/plugins.sbt
addSbtPlugin("com.typesafe.sbt" % "sbt-git" % "1.0.0")
Run the following to start parallel SBT instances in each of these projects
# Clear cached versions of `sbt-git`
rm -rv ~/$CACHE_LOCATION/v1/https/repo.scala-sbt.org/scalasbt/sbt-plugin-releases/com.typesafe.sbt/sbt-git/
here=$(pwd)
for i in $(seq 1 4); do
cd "$here/coursier-race-condition-$i" && sbt compile &
done
Full Stacktrace
[error] lmcoursier.internal.shaded.coursier.error.FetchError$DownloadingArtifacts: Error fetching artifacts:
[error] https://repo.scala-sbt.org/scalasbt/sbt-plugin-releases/com.typesafe.sbt/sbt-git/scala_2.12/sbt_1.0/1.0.0/docs/sbt-git-javadoc.jar: download error: Caught java.nio.file.NoSuchFileException: /Users/bomarchman/Library/Caches/Coursier/v1/https/repo.scala-sbt.org/scalasbt/sbt-plugin-releases/com.typesafe.sbt/sbt-git/scala_2.12/sbt_1.0/1.0.0/docs/.sbt-git-javadoc.jar.part -> /Users/bomarchman/Library/Caches/Coursier/v1/https/repo.scala-sbt.org/scalasbt/sbt-plugin-releases/com.typesafe.sbt/sbt-git/scala_2.12/sbt_1.0/1.0.0/docs/sbt-git-javadoc.jar (/Users/bomarchman/Library/Caches/Coursier/v1/https/repo.scala-sbt.org/scalasbt/sbt-plugin-releases/com.typesafe.sbt/sbt-git/scala_2.12/sbt_1.0/1.0.0/docs/.sbt-git-javadoc.jar.part -> /Users/bomarchman/Library/Caches/Coursier/v1/https/repo.scala-sbt.org/scalasbt/sbt-plugin-releases/com.typesafe.sbt/sbt-git/scala_2.12/sbt_1.0/1.0.0/docs/sbt-git-javadoc.jar) while downloading https://repo.scala-sbt.org/scalasbt/sbt-plugin-releases/com.typesafe.sbt/sbt-git/scala_2.12/sbt_1.0/1.0.0/docs/sbt-git-javadoc.jar
[error]
[error] at lmcoursier.internal.shaded.coursier.Artifacts$.$anonfun$fetchArtifacts$14(Artifacts.scala:302)
[error] at lmcoursier.internal.shaded.coursier.util.Task$.$anonfun$flatMap$2(Task.scala:14)
[error] at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:307)
[error] at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:41)
[error] at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
[error] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[error] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[error] at java.lang.Thread.run(Thread.java:748)
[error] Caused by: lmcoursier.internal.shaded.coursier.cache.ArtifactError$DownloadError: download error: Caught java.nio.file.NoSuchFileException: /Users/bomarchman/Library/Caches/Coursier/v1/https/repo.scala-sbt.org/scalasbt/sbt-plugin-releases/com.typesafe.sbt/sbt-git/scala_2.12/sbt_1.0/1.0.0/docs/.sbt-git-javadoc.jar.part -> /Users/bomarchman/Library/Caches/Coursier/v1/https/repo.scala-sbt.org/scalasbt/sbt-plugin-releases/com.typesafe.sbt/sbt-git/scala_2.12/sbt_1.0/1.0.0/docs/sbt-git-javadoc.jar (/Users/bomarchman/Library/Caches/Coursier/v1/https/repo.scala-sbt.org/scalasbt/sbt-plugin-releases/com.typesafe.sbt/sbt-git/scala_2.12/sbt_1.0/1.0.0/docs/.sbt-git-javadoc.jar.part -> /Users/bomarchman/Library/Caches/Coursier/v1/https/repo.scala-sbt.org/scalasbt/sbt-plugin-releases/com.typesafe.sbt/sbt-git/scala_2.12/sbt_1.0/1.0.0/docs/sbt-git-javadoc.jar) while downloading https://repo.scala-sbt.org/scalasbt/sbt-plugin-releases/com.typesafe.sbt/sbt-git/scala_2.12/sbt_1.0/1.0.0/docs/sbt-git-javadoc.jar
[error] at lmcoursier.internal.shaded.coursier.cache.FileCache$.helper$2(FileCache.scala:1018)
[error] at lmcoursier.internal.shaded.coursier.cache.FileCache$.coursier$cache$FileCache$$downloading(FileCache.scala:1032)
[error] at lmcoursier.internal.shaded.coursier.cache.FileCache.doDownload$1(FileCache.scala:320)
[error] at lmcoursier.internal.shaded.coursier.cache.FileCache.$anonfun$download$54(FileCache.scala:507)
[error] at lmcoursier.internal.shaded.coursier.cache.CacheLocks$.loop$1(CacheLocks.scala:59)
[error] at lmcoursier.internal.shaded.coursier.cache.CacheLocks$.withLockOr(CacheLocks.scala:84)
[error] at lmcoursier.internal.shaded.coursier.cache.FileCache.$anonfun$download$32(FileCache.scala:508)
[error] at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
[error] at scala.util.Success.$anonfun$map$1(Try.scala:255)
[error] at scala.util.Success.map(Try.scala:213)
[error] at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
[error] at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
[error] at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
[error] at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
[error] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[error] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[error] at java.lang.Thread.run(Thread.java:748)
[error] Caused by: java.nio.file.NoSuchFileException: /Users/bomarchman/Library/Caches/Coursier/v1/https/repo.scala-sbt.org/scalasbt/sbt-plugin-releases/com.typesafe.sbt/sbt-git/scala_2.12/sbt_1.0/1.0.0/docs/.sbt-git-javadoc.jar.part -> /Users/bomarchman/Library/Caches/Coursier/v1/https/repo.scala-sbt.org/scalasbt/sbt-plugin-releases/com.typesafe.sbt/sbt-git/scala_2.12/sbt_1.0/1.0.0/docs/sbt-git-javadoc.jar
[error] at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
[error] at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
[error] at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:396)
[error] at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262)
[error] at java.nio.file.Files.move(Files.java:1395)
[error] at lmcoursier.internal.shaded.coursier.cache.FileCache.$anonfun$download$46(FileCache.scala:412)
[error] at lmcoursier.internal.shaded.coursier.cache.CacheLocks$$anon$1.call(CacheLocks.scala:22)
[error] at lmcoursier.internal.shaded.coursier.paths.CachePath.withStructureLock(CachePath.java:139)
[error] at lmcoursier.internal.shaded.coursier.cache.CacheLocks$.withStructureLock(CacheLocks.scala:22)
[error] at lmcoursier.internal.shaded.coursier.cache.FileCache.$anonfun$download$33(FileCache.scala:410)
[error] at lmcoursier.internal.shaded.coursier.cache.FileCache$.$anonfun$downloading$1(FileCache.scala:998)
[error] at lmcoursier.internal.shaded.coursier.cache.CacheLocks$.withUrlLock(CacheLocks.scala:102)
[error] at lmcoursier.internal.shaded.coursier.cache.FileCache$.helper$2(FileCache.scala:998)
[error] at lmcoursier.internal.shaded.coursier.cache.FileCache$.coursier$cache$FileCache$$downloading(FileCache.scala:1032)
[error] at lmcoursier.internal.shaded.coursier.cache.FileCache.doDownload$1(FileCache.scala:320)
[error] at lmcoursier.internal.shaded.coursier.cache.FileCache.$anonfun$download$54(FileCache.scala:507)
[error] at lmcoursier.internal.shaded.coursier.cache.CacheLocks$.loop$1(CacheLocks.scala:59)
[error] at lmcoursier.internal.shaded.coursier.cache.CacheLocks$.withLockOr(CacheLocks.scala:84)
[error] at lmcoursier.internal.shaded.coursier.cache.FileCache.$anonfun$download$32(FileCache.scala:508)
[error] at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
[error] at scala.util.Success.$anonfun$map$1(Try.scala:255)
[error] at scala.util.Success.map(Try.scala:213)
[error] at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
[error] at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
[error] at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
[error] at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
[error] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[error] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[error] at java.lang.Thread.run(Thread.java:748)
Issue Analytics
- State:
- Created 3 years ago
- Comments:30 (8 by maintainers)
Top GitHub Comments
@alexarchambault I constructed an example that triggers an error consistently on my laptop. The error is not always exactly the same, as it’s caused by a race condition. The actual dependency I’m downloading is not important; I chose Cats because it’s a somewhat bigger library that requires a lot of downloads.
Save the following as
test.sc
:Then run:
Click to expand an example error
This is still an issue, so I did a little blackbox investigation. I ran the ammonite script above and traced system calls to watch the actual actions on the file system.
Here is an example output. The file simulacrum-scalafix-annotations_2.13-0.5.1.jar.part is the file causing the error. My interpretation is that the two processes (the PID is the first column) both lock the structure lock file, then open the file
simulacrum-scalafix-annotations_2.13-0.5.1.jar.part
file, then BOTH write to it, then both take the structure lock and rename it, then release the lock. Perhaps the check to see if the file exists should be done should be done within the structure lock?In order to recreate this, you can use the following scripts on linux: create run.sh:
then test.sc as described by scheleaap
then run using strace:
Then to get just the bits related to the file in question:
If there is any other investigation I can do to help fix this, please do ask, as this bug is slowing our CI pipelines down considerably because the shared caching in jenkins is effectively broken if using parallelism within mill