[3.3.0] Performance regression
See original GitHub issueThis issue was originally created in fs2, so version numbers below refer to fs2.
I’ve observed a performance degradation in some scenarios on 3.2.3 version.
The throughput on small byte streams remained the same.
The throughput on bigger byte streams is decreased roughly by ~20%.
The memory allocation decreased by 10-15% on bigger byte streams.
See the benchmarks below for more details.
Stream usage
The project utilizes a TCP socket from the fs2-io module. I cannot share many details due to NDA, but the generalized usage of the socket is the following one:
val streamDecoder: StreamDecoder[Structure] =
StreamDecoder.many(StructureDecoder)
socket.reads
.through(streamDecoder.toPipeByte)
.evalMap(structure => queue.offer(Right(structure)))
.compile
.background
Consumed bytes per single invocation:
- createOne: 167 bytes
- returnRandomUUID: 135 bytes
- return100Record: 2683 bytes
3.2.2
Operation average time:
Benchmark Mode Cnt Score Error Units
DriverBenchmark.createOne avgt 25 4.725 ± 0.396 ms/op
DriverBenchmark.return100Records avgt 25 8.613 ± 0.488 ms/op
DriverBenchmark.returnRandomUUID avgt 25 3.542 ± 1.156 ms/op
Memory allocation:
DriverBenchmark.return100Records:·gc.alloc.rate 201.323 ± 11.576 MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm 3117144.635 ± 13036.746 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space 199.083 ± 23.093 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space.norm 3076695.652 ± 283692.848 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen 0.074 ± 0.113 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen.norm 1266.657 ± 1981.461 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space 0.484 ± 0.444 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space.norm 7614.493 ± 6998.312 B/op
DriverBenchmark.return100Records:·gc.count 72.000 counts
DriverBenchmark.return100Records:·gc.time 300.000 ms
3.2.3
Operation average time:
Benchmark Mode Cnt Score Error Units
DriverBenchmark.createOne avgt 25 4.862 ± 0.414 ms/op
DriverBenchmark.return100Records avgt 25 11.008 ± 0.356 ms/op
DriverBenchmark.returnRandomUUID avgt 25 3.068 ± 0.299 ms/op
Memory allocation:
DriverBenchmark.return100Records:·gc.alloc.rate 169.961 ± 6.376 MB/sec
DriverBenchmark.return100Records:·gc.alloc.rate.norm 3383113.516 ± 12432.017 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space 175.288 ± 24.073 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Eden_Space.norm 3484911.125 ± 432781.648 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen 0.061 ± 0.075 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Old_Gen.norm 1224.015 ± 1529.533 B/op
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space 0.484 ± 0.566 MB/sec
DriverBenchmark.return100Records:·gc.churn.G1_Survivor_Space.norm 9489.291 ± 11071.370 B/op
DriverBenchmark.return100Records:·gc.count 56.000 counts
DriverBenchmark.return100Records:·gc.time 433.000 ms
Issue Analytics
- State:
- Created 2 years ago
- Comments:45 (45 by maintainers)
Top Results From Across the Web
[3.3.0] Performance regression · Issue #2634 · typelevel/cats ...
I've observed a performance degradation in some scenarios on 3.2.3 version. The throughput on small byte streams remained the same. The ...
Read more >performance regression with Eigen 3.3.0 vs. 3.2.10?
A straightforward self-adjoint eigenvalue decomposition takes twice as long with Eigen 3.3.0 as it does with 3.2.10.
Read more >Performance regression in Eigen 3.3.0 for sub-vector access
Performance regression in Eigen 3.3.0 for sub-vector access ... I observe a fairly significant performance difference (~48% slower with ...
Read more >1342 – Performance regression in Eigen 3.3.0 for sub-vector access
My unfounded gut feeling is that maybe the sub-vector accesses (i.e. head<> and segment<>) may be the root cause, but I don't know...
Read more >Flutter 3.3.0 release notes
Release notes for Flutter 3.3.0. ... Made Directionality forego dependency tracking for better performance. by @gaaclarke in ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Looking at the time Delta, this seems about in line with what I would expect for tracing code which is mostly compute bound: about a 25% difference. Fully compute bound would be about 30%, more than likely.
The really interesting thing here is that GC time though. The tracing benchmark hit fewer GC iterations but still took almost twice as long. I wonder if that means that we could optimize tracing a bit further in practice by streamlining GC costs?
Edit: actually that shift only happened in 3.3.0, so it’s almost certainly being caused by the weak bag shenanigans. We’re forcing the GC to work harder in order to avoid bogging down the critical path.