question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Investigate directory enumeration performance

See original GitHub issue

I noticed we’re using standard Directory.EnumerateFiles() to enumerate files for globs. It’s not very efficient, and also runs the risks of throwing when it hits directories or files it can’t access.

Sample first-chance exception:

System.UnauthorizedAccessException: Access to the path 'C:\Documents and Settings' is denied.
   at void System.IO.__Error.WinIOError(int errorCode, string maybeFullPath)
   at void System.IO.FileSystemEnumerableIterator<TSource>.CommonInit()
   at new System.IO.FileSystemEnumerableIterator<TSource>(string path, string originalUserPath, string searchPattern, SearchOption searchOption, SearchResultHandler<TSource> resultHandler, bool checkHost)
   at IEnumerable<string> System.IO.Directory.EnumerateFiles(string path, string searchPattern, SearchOption searchOption)
   at IEnumerable<string> Microsoft.Build.Shared.FileSystem.ManagedFileSystem.EnumerateFiles(string path, string searchPattern, SearchOption searchOption)
   at IEnumerable<string> Microsoft.Build.Shared.FileSystem.MSBuildOnWindowsFileSystem.EnumerateFiles(string path, string searchPattern, SearchOption searchOption)
   at IEnumerable<string> Microsoft.Build.Shared.FileSystem.CachingFileSystemWrapper.EnumerateFiles(string path, string searchPattern, SearchOption searchOption)
   at IReadOnlyList<string> Microsoft.Build.Shared.FileMatcher.GetAccessibleFiles(IFileSystem fileSystem, string path, string filespec, string projectDirectory, bool stripProjectDirectory)
   at IReadOnlyList<string> Microsoft.Build.Shared.FileMatcher.GetAccessibleFileSystemEntries(IFileSystem fileSystem, FileSystemEntity entityType, string path, string pattern, string projectDirectory, bool stripProjectDirectory)
   at Microsoft.Build.Shared.FileMatcher(IFileSystem fileSystem, ConcurrentDictionary<string, IReadOnlyList<string>> fileEntryExpansionCache)+(FileSystemEntity entityType, string path, string pattern, string projectDirectory, bool stripProjectDirectory) => { } x 2
   at TValue System.Collections.Concurrent.ConcurrentDictionary<TKey, TValue>.GetOrAdd(TKey key, Func<TKey, TValue> valueFactory)
   at Microsoft.Build.Shared.FileMatcher(IFileSystem fileSystem, ConcurrentDictionary<string, IReadOnlyList<string>> fileEntryExpansionCache)+(FileSystemEntity type, string path, string pattern, string directory, bool stripProjectDirectory) => { }
   at IEnumerable<string> Microsoft.Build.Shared.FileMatcher.GetFilesForStep(RecursiveStepResult stepResult, RecursionState recursionState, string projectDirectory, bool stripProjectDirectory)
   at void Microsoft.Build.Shared.FileMatcher.GetFilesRecursive(ConcurrentStack<List<string>> listOfFiles, RecursionState recursionState, string projectDirectory, bool stripProjectDirectory, IList<RecursionState> searchesToExclude, Dictionary<string, List<RecursionState>> searchesToExcludeInSubdirs, TaskOptions taskOptions)
   at void Microsoft.Build.Shared.FileMatcher.GetFilesRecursive(ConcurrentStack<List<string>> listOfFiles, RecursionState recursionState, string projectDirectory, bool stripProjectDirectory, IList<RecursionState> searchesToExclude, Dictionary<string, List<RecursionState>> searchesToExcludeInSubdirs, TaskOptions taskOptions)+(string subdir) => { }
   at ParallelLoopResult System.Threading.Tasks.Parallel.ForEachWorker<TSource, TLocal>(IEnumerable<TSource> source, ParallelOptions parallelOptions, Action<TSource> body, Action<TSource, ParallelLoopState> bodyWithState, Action<TSource, ParallelLoopState, long> bodyWithStateAndIndex, Func<TSource, ParallelLoopState, TLocal, TLocal> bodyWithStateAndLocal, Func<TSource, ParallelLoopState, long, TLocal, TLocal> bodyWithEverything, Func<TLocal> localInit, Action<TLocal> localFinally)+(int i) => { }
   at ParallelLoopResult System.Threading.Tasks.Parallel.ForWorker<TLocal>(int fromInclusive, int toExclusive, ParallelOptions parallelOptions, Action<int> body, Action<int, ParallelLoopState> bodyWithState, Func<int, ParallelLoopState, TLocal, TLocal> bodyWithLocal, Func<TLocal> localInit, Action<TLocal> localFinally)+() => { }
   at void System.Threading.Tasks.Task.InnerInvokeWithArg(Task childTask)
   at void System.Threading.Tasks.Task.ExecuteSelfReplicating(Task root)+() => { }
   at void System.Threading.Tasks.Task.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem()
   at bool System.Threading.ThreadPoolWorkQueue.Dispatch()
   at bool System.Threading._ThreadPoolWaitCallback.PerformWaitCallback()

Also seeing the same for C:\Config.Msi when we accidentally enumerate the whole drive due to some property being empty and the glob ends up starting with a \.

I’ve had success with directly calling the Win32 API in parallel to reduce allocations, achieving up to 2x speed and 0.5x allocations: https://github.com/KirillOsenkov/Benchmarks/blob/8556f92c07b9a3d211a7e72b776c324aff7e24b7/src/Tests/DirectoryEnumeration.cs#L12-L15

Also it seems that this approach doesn’t run into exceptions when trying to access inaccessible directories, unlike the BCL one.

Feel free to experiment with this benchmark, steal the source, try on real-world builds, see if you can tune it further, submit PRs if you can make it even faster 😉

The first place I would try this is in FileMatcher (see the stack above). Also, looking at the stack, I’d measure getting rid of the ConcurrentDictionary and try a simple collection with a lock around it. I often get much better results with a simple lock around simple collections.

I’m noticing we do have a ManagedFileSystem abstraction, so I guess we can try replacing the implementation in a single place and see if it can make our builds faster wholesale.

One potential concern is that the parallelism in the new method does a lot of thrashing, so not sure how this performs on an HDD. But then again, do we care about HDDs anymore?

Issue Analytics

  • State:closed
  • Created 8 months ago
  • Comments:10 (10 by maintainers)

github_iconTop GitHub Comments

2reactions
KirillOsenkovcommented, Feb 3, 2023

I did add Microsoft.IO.Redist to my benchmark and it is indeed even faster than my handcrafted approach! Kudos to Jeremy!

https://github.com/KirillOsenkov/Benchmarks/blob/f2c45821c2cf7243b040d2c1db5904bab8134cf8/src/Tests/DirectoryEnumeration.cs#L12-L16

I did follow up with the original stack that I’d pasted here, and it’s from 2021 😱 My apologies. Most of this issue is now invalid as we have transitioned to Microsoft.IO.Redist!

Remaining issues:

  • Pass IgnoreInaccessible
  • investigate smaller perf issues as indicated by Dan in the previous reply

I won’t be offended if we close this issue outright or mark it as low priority 😉

Apologies I should have checked the MSBuild version before filing the issue.

2reactions
davkeancommented, Feb 3, 2023

@JeremyKuhne I told him the same thing internally 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Improve the performance for enumerating files and folders ...
I have a base directory that contains several thousand folders. Inside of these folders there can be between 1 and 20 subfolders that...
Read more >
Gaining Visibility Into Active Directory Enumeration
Enumerating the AD environment for the victim network — through built-in system commands or a wealth of publicly available tools — provides ...
Read more >
How to: Enumerate directories and files | Microsoft Learn
Learn to enumerate directories and files by using enumerable collections, which can provide better performance than arrays in .NET.
Read more >
Digital Canaries in a Coal Mine: Detecting Enumeration ...
In this post, we're going to create some active directory canaries that will aid us in detecting threat-actors enumerating our network using ...
Read more >
Automated Directory listing Retrieval System Based on ...
In light of the above discussion, an automatic directory list- ing retrieval system based on spoken inputs and spoken out- puts appears to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found