question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

does apache/avro perform better on pypy?

See original GitHub issue

I performed some tests to check the performance of fastavro and apache/avro between pypy and cpython. And here is the summary of the results. Hope that the contributors can confirm if this aligns with their expectation

AVRO SCHEMA

BIG_SCHEMA_OLD = {
    "type": "record",
    "name": "test_record",
    "fields": [
        {
            "type": ["null", "int"],
            "name": "union_field"
        },
        {
            "type": ["null", "int"],
            "name": "union_field_null",
            "default": None
        },
        {
            "type": ["null", "int"],
            "name": "union_field_101",
            "default": 101
        },
        {
            "type": "boolean",
            "name": "bool_field"
        },
        {
            "type": "boolean",
            "name": "bool_field_F",
            "default": False
        },
        {
            "type": "string",
            "name": "string_field"
        },
        {
            "type": "string",
            "name": "string_field_foo",
            "default": "foo❤"
        },
        {
            "type": "bytes",
            "name": "bytes_field"
        },
        {
            "type": "bytes",
            "name": "bytes_field_bar",
            "default": "bar"
        },
        {
            "type": "int",
            "name": "int_field"
        },
        {
            "type": "int",
            "name": "int_field_1",
            "default": 1
        },
        {
            "type": "long",
            "name": "long_field"
        },
        {
            "type": "long",
            "name": "long_field_42",
            "default": 42
        },
        {
            "type": "float",
            "name": "float_field"
        },
        {
            "type": "float",
            "name": "float_field_p75",
            "default": 0.75
        },
        {
            "type": "double",
            "name": "double_field"
        },
        {
            "type": "double",
            "name": "double_field_pi",
            "default": 3.14
        }
    ]
}

Number of repetitions 100000

reader/writer type used

  • I am using schemaless_reader and schemaless_writer for these tests.
  • providing a reader_schema to schemaless_reader has significantly poorer performance

benchmark on pypy

  • avro_reader and avro_writer corresponding to apache/avro
  • fastavro_reader and fastavro_writer corresponding tp schemaless_reader and schemaless_writer of fastavro
  • unit is seconds for total number of repitations

with reader_schema in schemaless_reader

{
    "avro_reader": 1.5274810791015625,
    "avro_writer": 1.061816930770874,
    "fastavro_reader": 8.852604866027832,
    "fastavro_writer": 2.948662042617798
}

without reader_schema in schemaless_reader

{
    "avro_reader": 1.5450429916381836,
    "avro_writer": 1.0277588367462158,
    "fastavro_reader": 1.7646219730377197,
    "fastavro_writer": 2.8703200817108154
}

benchmark on py27

with reader_schema in schemaless_reader

{
    "avro_reader": 15.267684936523438,
    "avro_writer": 15.52902102470398,
    "fastavro_reader": 13.891213178634644,
    "fastavro_writer": 5.328428030014038
}

without reader_schema in schemaless_reader

{
    "avro_reader": 15.217741966247559,
    "avro_writer": 15.30265998840332,
    "fastavro_reader": 4.296072006225586,
    "fastavro_writer": 5.25685715675354
}

2 things stood out to me

  1. use of reader_schema is causing performance regressions, can we do something there or is that a prenulty we have to pay?
  2. i would expect fastavro to perform better in pypy against apache/avro.

happy to hear from others.

we should defniately add some regression tests to fastavro to catch performance related issues between releases.

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:11 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
scottbeldencommented, May 20, 2018

Thanks for all of the results. I don’t run with pypy normally so I’m afraid I can’t really say why the pypy2 performance is worse than the standard avro library (and why in pypy3 this flips). Patches to improve the pypy performance would be welcome, but I’m probably not going to be able to contribute them.

0reactions
abrarsheikhcommented, May 20, 2018

good point, i updated the benchmark code so that writer/reader for apache avro are now part of benchmark. surpricingly this din’t change the benchmarking results.

I also fixed the number of rounds and iteration for pytest-benchmark which now give more reproducible runs

screen shot 2018-05-19 at 6 56 04 pm screen shot 2018-05-19 at 6 56 12 pm

Read more comments on GitHub >

github_iconTop Results From Across the Web

Benchmarking avro and fastavro using pytest ... - Medium
general performance of pypy3 is much better than all other python interpreters. avro is much faster than fastavro on pypy , this is...
Read more >
avro - PyPI
Apache Avro ™ is a data serialization system. To learn more, please visit our website. Documentation. Apache Avro documentation is maintained on our...
Read more >
Performance | PyPy
(This is also good modularity practice). The cost of CPython global references is high enough that, for example, if you have code in...
Read more >
Getting Started (Python) - Apache Avro
This is a short guide for getting started with Apache Avro™ using Python. ... few minor difference (e.g., function name capitalization, such as...
Read more >
Re: [Vote] Re: Proposal: Official Python Version ... - The Mail Archive
Support for python 2.x is now removed from pip so time to move on our side too! ... a vote on Apache Avro's...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found