question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ParquetSerializer 100x slower than ParquetConvert to serialise IEnumerable<T>

See original GitHub issue

Issue description

ParquetSerialisationPerformance.zip

Attached is a repro with Benchmark.Net which shows ParquetSerializer.SerializeAsync can be more than 100x slower than ParquetConvert.SerializeAsync.

I have a Dictionary<TestKey, TestValue> of records that I want to serialise. In the repro I have benchmarked 10, 100, 200 and 300 records which is enough to show the performance difference but my real use case has around 2000 records. TestKey has 4 properties and TestValue has 1000 properties. Again, this is enough to show the performance difference but my real use case has 4000 properties. I convert Dictionary<TestKey, TestValue> to a collection of TestRecords - either IEnumerable<TestRecord> or List<TestRecord> and serialise this collection.

On my PC, the output from Benchmark.Net is below.

|                               Method | RecordCount |         Mean |     Error |    StdDev |
|------------------------------------- |------------ |-------------:|----------:|----------:|
|    ParquetConvertSerializeEnumerable |          10 |     32.24 ms |  0.213 ms |  0.199 ms |
|          ParquetConvertSerializeList |          10 |     31.79 ms |  0.298 ms |  0.279 ms |
| ParquetSerializerSerializeEnumerable |          10 |  1,343.28 ms | 10.501 ms |  9.823 ms |
|       ParquetSerializerSerializeList |          10 |  1,005.20 ms |  1.576 ms |  1.397 ms |
|    ParquetConvertSerializeEnumerable |         100 |     84.89 ms |  0.420 ms |  0.373 ms |
|          ParquetConvertSerializeList |         100 |     83.32 ms |  0.281 ms |  0.249 ms |
| ParquetSerializerSerializeEnumerable |         100 |  4,088.74 ms |  9.388 ms |  8.782 ms |
|       ParquetSerializerSerializeList |         100 |  1,085.49 ms |  8.067 ms |  7.546 ms |
|    ParquetConvertSerializeEnumerable |         200 |    117.84 ms |  0.745 ms |  0.697 ms |
|          ParquetConvertSerializeList |         200 |    114.20 ms |  0.466 ms |  0.390 ms |
| ParquetSerializerSerializeEnumerable |         200 |  6,951.59 ms | 20.213 ms | 18.908 ms |
|       ParquetSerializerSerializeList |         200 |  1,133.79 ms |  8.000 ms |  7.484 ms |
|    ParquetConvertSerializeEnumerable |         300 |    141.11 ms |  1.781 ms |  1.488 ms |
|          ParquetConvertSerializeList |         300 |    136.93 ms |  2.421 ms |  2.378 ms |
| ParquetSerializerSerializeEnumerable |         300 | 10,084.32 ms | 60.075 ms | 56.194 ms |
|       ParquetSerializerSerializeList |         300 |  1,191.82 ms |  7.001 ms |  6.549 ms |

In my real use case, ParquetConvert.SerializeAsync(IEnumerable<T> …) runs in under 1 second whereas ParquetSerializer.SerializeAsync(IEnumerable<T> …) runs for over 30 minutes.

Issue Analytics

  • State:closed
  • Created 6 months ago
  • Comments:9 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
El-Gor-docommented, Mar 28, 2023

I reran my benchmarks with 4.6.2, increasing the record count to 20000.

|                            Method | RecordCount |        Mean |     Error |    StdDev |
|---------------------------------- |------------ |------------:|----------:|----------:|
| ParquetConvertSerializeEnumerable |          10 |    32.35 ms |  0.279 ms |  0.233 ms |
|    ParquetSerializerSerializeList |          10 |    32.55 ms |  0.481 ms |  0.450 ms |
| ParquetConvertSerializeEnumerable |         100 |    85.20 ms |  0.389 ms |  0.364 ms |
|    ParquetSerializerSerializeList |         100 |    92.99 ms |  0.554 ms |  0.491 ms |
| ParquetConvertSerializeEnumerable |         200 |   120.02 ms |  1.801 ms |  1.684 ms |
|    ParquetSerializerSerializeList |         200 |   135.60 ms |  2.018 ms |  1.685 ms |
| ParquetConvertSerializeEnumerable |         300 |   144.49 ms |  2.438 ms |  2.280 ms |
|    ParquetSerializerSerializeList |         300 |   164.66 ms |  1.732 ms |  1.620 ms |
| ParquetConvertSerializeEnumerable |        1000 |   310.13 ms |  2.344 ms |  1.957 ms |
|    ParquetSerializerSerializeList |        1000 |   373.56 ms |  2.681 ms |  2.508 ms |
| ParquetConvertSerializeEnumerable |        2000 |   562.90 ms |  2.512 ms |  2.098 ms |
|    ParquetSerializerSerializeList |        2000 |   719.13 ms |  5.156 ms |  4.823 ms |
| ParquetConvertSerializeEnumerable |        5000 | 1,685.66 ms |  6.709 ms |  6.275 ms |
|    ParquetSerializerSerializeList |        5000 | 1,995.09 ms |  6.349 ms |  5.939 ms |
| ParquetConvertSerializeEnumerable |       10000 | 3,399.74 ms |  8.466 ms |  7.069 ms |
|    ParquetSerializerSerializeList |       10000 | 4,165.52 ms | 21.434 ms | 20.049 ms |
| ParquetConvertSerializeEnumerable |       20000 | 6,806.18 ms | 21.799 ms | 20.391 ms |
|    ParquetSerializerSerializeList |       20000 | 8,546.03 ms | 14.272 ms | 12.652 ms |

ParquetSerializer is still slightly slower than ParquetConvert but the difference is small enough now to not matter for my real use case. Thank you for your quick response.

1reaction
aloneguidcommented, Mar 27, 2023

Hey thanks, please continue using ParquetConvert for now, it’s still really good. ParquetSerializer is feature complete but there is some performance penalty. I’ve already brought it down from 9x to 1.38x slower, and it should be faster than ParquetConvert in the next minor update

Read more comments on GitHub >

github_iconTop Results From Across the Web

No results found

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found