Investigate directory enumeration performance
See original GitHub issueI noticed we’re using standard Directory.EnumerateFiles()
to enumerate files for globs. It’s not very efficient, and also runs the risks of throwing when it hits directories or files it can’t access.
Sample first-chance exception:
System.UnauthorizedAccessException: Access to the path 'C:\Documents and Settings' is denied.
at void System.IO.__Error.WinIOError(int errorCode, string maybeFullPath)
at void System.IO.FileSystemEnumerableIterator<TSource>.CommonInit()
at new System.IO.FileSystemEnumerableIterator<TSource>(string path, string originalUserPath, string searchPattern, SearchOption searchOption, SearchResultHandler<TSource> resultHandler, bool checkHost)
at IEnumerable<string> System.IO.Directory.EnumerateFiles(string path, string searchPattern, SearchOption searchOption)
at IEnumerable<string> Microsoft.Build.Shared.FileSystem.ManagedFileSystem.EnumerateFiles(string path, string searchPattern, SearchOption searchOption)
at IEnumerable<string> Microsoft.Build.Shared.FileSystem.MSBuildOnWindowsFileSystem.EnumerateFiles(string path, string searchPattern, SearchOption searchOption)
at IEnumerable<string> Microsoft.Build.Shared.FileSystem.CachingFileSystemWrapper.EnumerateFiles(string path, string searchPattern, SearchOption searchOption)
at IReadOnlyList<string> Microsoft.Build.Shared.FileMatcher.GetAccessibleFiles(IFileSystem fileSystem, string path, string filespec, string projectDirectory, bool stripProjectDirectory)
at IReadOnlyList<string> Microsoft.Build.Shared.FileMatcher.GetAccessibleFileSystemEntries(IFileSystem fileSystem, FileSystemEntity entityType, string path, string pattern, string projectDirectory, bool stripProjectDirectory)
at Microsoft.Build.Shared.FileMatcher(IFileSystem fileSystem, ConcurrentDictionary<string, IReadOnlyList<string>> fileEntryExpansionCache)+(FileSystemEntity entityType, string path, string pattern, string projectDirectory, bool stripProjectDirectory) => { } x 2
at TValue System.Collections.Concurrent.ConcurrentDictionary<TKey, TValue>.GetOrAdd(TKey key, Func<TKey, TValue> valueFactory)
at Microsoft.Build.Shared.FileMatcher(IFileSystem fileSystem, ConcurrentDictionary<string, IReadOnlyList<string>> fileEntryExpansionCache)+(FileSystemEntity type, string path, string pattern, string directory, bool stripProjectDirectory) => { }
at IEnumerable<string> Microsoft.Build.Shared.FileMatcher.GetFilesForStep(RecursiveStepResult stepResult, RecursionState recursionState, string projectDirectory, bool stripProjectDirectory)
at void Microsoft.Build.Shared.FileMatcher.GetFilesRecursive(ConcurrentStack<List<string>> listOfFiles, RecursionState recursionState, string projectDirectory, bool stripProjectDirectory, IList<RecursionState> searchesToExclude, Dictionary<string, List<RecursionState>> searchesToExcludeInSubdirs, TaskOptions taskOptions)
at void Microsoft.Build.Shared.FileMatcher.GetFilesRecursive(ConcurrentStack<List<string>> listOfFiles, RecursionState recursionState, string projectDirectory, bool stripProjectDirectory, IList<RecursionState> searchesToExclude, Dictionary<string, List<RecursionState>> searchesToExcludeInSubdirs, TaskOptions taskOptions)+(string subdir) => { }
at ParallelLoopResult System.Threading.Tasks.Parallel.ForEachWorker<TSource, TLocal>(IEnumerable<TSource> source, ParallelOptions parallelOptions, Action<TSource> body, Action<TSource, ParallelLoopState> bodyWithState, Action<TSource, ParallelLoopState, long> bodyWithStateAndIndex, Func<TSource, ParallelLoopState, TLocal, TLocal> bodyWithStateAndLocal, Func<TSource, ParallelLoopState, long, TLocal, TLocal> bodyWithEverything, Func<TLocal> localInit, Action<TLocal> localFinally)+(int i) => { }
at ParallelLoopResult System.Threading.Tasks.Parallel.ForWorker<TLocal>(int fromInclusive, int toExclusive, ParallelOptions parallelOptions, Action<int> body, Action<int, ParallelLoopState> bodyWithState, Func<int, ParallelLoopState, TLocal, TLocal> bodyWithLocal, Func<TLocal> localInit, Action<TLocal> localFinally)+() => { }
at void System.Threading.Tasks.Task.InnerInvokeWithArg(Task childTask)
at void System.Threading.Tasks.Task.ExecuteSelfReplicating(Task root)+() => { }
at void System.Threading.Tasks.Task.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem()
at bool System.Threading.ThreadPoolWorkQueue.Dispatch()
at bool System.Threading._ThreadPoolWaitCallback.PerformWaitCallback()
Also seeing the same for C:\Config.Msi
when we accidentally enumerate the whole drive due to some property being empty and the glob ends up starting with a \
.
I’ve had success with directly calling the Win32 API in parallel to reduce allocations, achieving up to 2x speed and 0.5x allocations: https://github.com/KirillOsenkov/Benchmarks/blob/8556f92c07b9a3d211a7e72b776c324aff7e24b7/src/Tests/DirectoryEnumeration.cs#L12-L15
Also it seems that this approach doesn’t run into exceptions when trying to access inaccessible directories, unlike the BCL one.
Feel free to experiment with this benchmark, steal the source, try on real-world builds, see if you can tune it further, submit PRs if you can make it even faster 😉
The first place I would try this is in FileMatcher (see the stack above). Also, looking at the stack, I’d measure getting rid of the ConcurrentDictionary and try a simple collection with a lock around it. I often get much better results with a simple lock around simple collections.
I’m noticing we do have a ManagedFileSystem abstraction, so I guess we can try replacing the implementation in a single place and see if it can make our builds faster wholesale.
One potential concern is that the parallelism in the new method does a lot of thrashing, so not sure how this performs on an HDD. But then again, do we care about HDDs anymore?
Issue Analytics
- State:
- Created 8 months ago
- Comments:10 (10 by maintainers)
I did add Microsoft.IO.Redist to my benchmark and it is indeed even faster than my handcrafted approach! Kudos to Jeremy!
https://github.com/KirillOsenkov/Benchmarks/blob/f2c45821c2cf7243b040d2c1db5904bab8134cf8/src/Tests/DirectoryEnumeration.cs#L12-L16
I did follow up with the original stack that I’d pasted here, and it’s from 2021 😱 My apologies. Most of this issue is now invalid as we have transitioned to Microsoft.IO.Redist!
Remaining issues:
I won’t be offended if we close this issue outright or mark it as low priority 😉
Apologies I should have checked the MSBuild version before filing the issue.
@JeremyKuhne I told him the same thing internally 😃