Striding loop performance
See original GitHub issueSystem.numerics.vectors exposes a SIMD enhanced Vector classes. Using VS2015 Update 1, latest versions of .NET framework and F# and System.numerics.vectors the performance of System.Numerics is worse than not using it at all, for instance:
let sumVectorLoop =
let mutable total = Vector<int>.Zero
for i in 0 .. COUNT/8-1 do
total <- total + vecArray.[i]
total
Is slower than the same operation on an array of integers:
let sumsLoop =
let mutable total = 0;
for i in 0 .. COUNT - 1 do
total <- total + numsArray.[i]
total
I have confirmed that Vector.isHardwareAccelerated
reports as true. I have confirmed that equivalent code in C# runs ~2x faster for the Vector approach. Interestingly, using Array.reduce on the vector array is faster than the imperative loop, which is the opposite of working with an array of ints, suggesting something may be amiss:
let sumVectorReduce =
Array.reduce (fun a e -> a + e) vecArray
Issue Analytics
- State:
- Created 8 years ago
- Reactions:1
- Comments:13 (12 by maintainers)
Top Results From Across the Web
CUDA Pro Tip: Write Flexible Kernels with Grid-Stride Loops
By using a loop with stride equal to the grid size, we ensure that all addressing within warps is unit-stride, so we get...
Read more >Is it preferable to loop over multiple iterations with the ...
The answer is always don't optimize prematurely. Go for the easier implementation (one thread per iteration), and evaluate if that kernel's ...
Read more >CUDA grid stride loop for nested for loop
The reason the grid-stride loop on the outer for-loop makes sense is because the work done on the outer for-loop iterations is independent....
Read more >I can't tell you how many large-power-of-two stride loops ...
Strided access has always been slow, on CPUs, on GPUs, on everything, for the last 30 years. Its a known issue. "Fixing" strided...
Read more >Optimizing Loop Stride - Michael Brundage
In this article I describe an optimization technique I've used to squeeze an extra 10% or so out of C/C++ code. A common...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I started to take a look at this, and it’s not easy.
One problem is that the F# “FastIntegerLoop” TAST construct can’t represent striding loops. It could be extended, but this has to be done with care since the construct can (and does) occur in optimization information and the representations of inlined functions. Ideally care should be taken that DLLs that generate this new construct be consumable by down-level F# compilers, but that’s hard to arrange.
Another problem is that “F#-style loops”
for x in n .. step .. m
are currently generated using an “bne” branch-not-equals instruction at the end condition. This is done becausem
might beMaxInt
. But this won’t work for striding loops - a less-than operation is needed. But a less-than operation doesn’t work whenm
is aboveMaxInt - step
since a wrap-around occurs.Perhaps we could just sacrifice semantics for striding loops near the maxint condition - though whatever we do parity with C# is really needed. Perhaps I need to look more closely at C# code generation for these cases
Yes, we should fix this, definitely.