ParquetSerializer 100x slower than ParquetConvert to serialise IEnumerable<T>
See original GitHub issueIssue description
ParquetSerialisationPerformance.zip
Attached is a repro with Benchmark.Net which shows ParquetSerializer.SerializeAsync can be more than 100x slower than ParquetConvert.SerializeAsync.
I have a Dictionary<TestKey, TestValue> of records that I want to serialise. In the repro I have benchmarked 10, 100, 200 and 300 records which is enough to show the performance difference but my real use case has around 2000 records. TestKey has 4 properties and TestValue has 1000 properties. Again, this is enough to show the performance difference but my real use case has 4000 properties. I convert Dictionary<TestKey, TestValue> to a collection of TestRecords - either IEnumerable<TestRecord> or List<TestRecord> and serialise this collection.
On my PC, the output from Benchmark.Net is below.
| Method | RecordCount | Mean | Error | StdDev |
|------------------------------------- |------------ |-------------:|----------:|----------:|
| ParquetConvertSerializeEnumerable | 10 | 32.24 ms | 0.213 ms | 0.199 ms |
| ParquetConvertSerializeList | 10 | 31.79 ms | 0.298 ms | 0.279 ms |
| ParquetSerializerSerializeEnumerable | 10 | 1,343.28 ms | 10.501 ms | 9.823 ms |
| ParquetSerializerSerializeList | 10 | 1,005.20 ms | 1.576 ms | 1.397 ms |
| ParquetConvertSerializeEnumerable | 100 | 84.89 ms | 0.420 ms | 0.373 ms |
| ParquetConvertSerializeList | 100 | 83.32 ms | 0.281 ms | 0.249 ms |
| ParquetSerializerSerializeEnumerable | 100 | 4,088.74 ms | 9.388 ms | 8.782 ms |
| ParquetSerializerSerializeList | 100 | 1,085.49 ms | 8.067 ms | 7.546 ms |
| ParquetConvertSerializeEnumerable | 200 | 117.84 ms | 0.745 ms | 0.697 ms |
| ParquetConvertSerializeList | 200 | 114.20 ms | 0.466 ms | 0.390 ms |
| ParquetSerializerSerializeEnumerable | 200 | 6,951.59 ms | 20.213 ms | 18.908 ms |
| ParquetSerializerSerializeList | 200 | 1,133.79 ms | 8.000 ms | 7.484 ms |
| ParquetConvertSerializeEnumerable | 300 | 141.11 ms | 1.781 ms | 1.488 ms |
| ParquetConvertSerializeList | 300 | 136.93 ms | 2.421 ms | 2.378 ms |
| ParquetSerializerSerializeEnumerable | 300 | 10,084.32 ms | 60.075 ms | 56.194 ms |
| ParquetSerializerSerializeList | 300 | 1,191.82 ms | 7.001 ms | 6.549 ms |
In my real use case, ParquetConvert.SerializeAsync(IEnumerable<T> …) runs in under 1 second whereas ParquetSerializer.SerializeAsync(IEnumerable<T> …) runs for over 30 minutes.
Issue Analytics
- State:
- Created 6 months ago
- Comments:9 (6 by maintainers)
I reran my benchmarks with 4.6.2, increasing the record count to 20000.
ParquetSerializer is still slightly slower than ParquetConvert but the difference is small enough now to not matter for my real use case. Thank you for your quick response.
Hey thanks, please continue using ParquetConvert for now, it’s still really good. ParquetSerializer is feature complete but there is some performance penalty. I’ve already brought it down from 9x to 1.38x slower, and it should be faster than ParquetConvert in the next minor update