Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

does apache/avro perform better on pypy?

See original GitHub issue

I performed some tests to check the performance of fastavro and apache/avro between pypy and cpython. And here is the summary of the results. Hope that the contributors can confirm if this aligns with their expectation

AVRO SCHEMA

BIG_SCHEMA_OLD = {
    "type": "record",
    "name": "test_record",
    "fields": [
        {
            "type": ["null", "int"],
            "name": "union_field"
        },
        {
            "type": ["null", "int"],
            "name": "union_field_null",
            "default": None
        },
        {
            "type": ["null", "int"],
            "name": "union_field_101",
            "default": 101
        },
        {
            "type": "boolean",
            "name": "bool_field"
        },
        {
            "type": "boolean",
            "name": "bool_field_F",
            "default": False
        },
        {
            "type": "string",
            "name": "string_field"
        },
        {
            "type": "string",
            "name": "string_field_foo",
            "default": "foo❤"
        },
        {
            "type": "bytes",
            "name": "bytes_field"
        },
        {
            "type": "bytes",
            "name": "bytes_field_bar",
            "default": "bar"
        },
        {
            "type": "int",
            "name": "int_field"
        },
        {
            "type": "int",
            "name": "int_field_1",
            "default": 1
        },
        {
            "type": "long",
            "name": "long_field"
        },
        {
            "type": "long",
            "name": "long_field_42",
            "default": 42
        },
        {
            "type": "float",
            "name": "float_field"
        },
        {
            "type": "float",
            "name": "float_field_p75",
            "default": 0.75
        },
        {
            "type": "double",
            "name": "double_field"
        },
        {
            "type": "double",
            "name": "double_field_pi",
            "default": 3.14
        }
    ]
}

Number of repetitions 100000

reader/writer type used

I am using schemaless_reader and schemaless_writer for these tests.
providing a reader_schema to schemaless_reader has significantly poorer performance

benchmark on pypy

avro_reader and avro_writer corresponding to apache/avro
fastavro_reader and fastavro_writer corresponding tp schemaless_reader and schemaless_writer of fastavro
unit is seconds for total number of repitations

with reader_schema in schemaless_reader

{
    "avro_reader": 1.5274810791015625,
    "avro_writer": 1.061816930770874,
    "fastavro_reader": 8.852604866027832,
    "fastavro_writer": 2.948662042617798
}

without reader_schema in schemaless_reader

{
    "avro_reader": 1.5450429916381836,
    "avro_writer": 1.0277588367462158,
    "fastavro_reader": 1.7646219730377197,
    "fastavro_writer": 2.8703200817108154
}

benchmark on py27

with reader_schema in schemaless_reader

{
    "avro_reader": 15.267684936523438,
    "avro_writer": 15.52902102470398,
    "fastavro_reader": 13.891213178634644,
    "fastavro_writer": 5.328428030014038
}

without reader_schema in schemaless_reader

{
    "avro_reader": 15.217741966247559,
    "avro_writer": 15.30265998840332,
    "fastavro_reader": 4.296072006225586,
    "fastavro_writer": 5.25685715675354
}

2 things stood out to me

use of reader_schema is causing performance regressions, can we do something there or is that a prenulty we have to pay?
i would expect fastavro to perform better in pypy against apache/avro.

happy to hear from others.

we should defniately add some regression tests to fastavro to catch performance related issues between releases.

Issue Analytics

State:
Created 5 years ago
Comments:11 (2 by maintainers)

Top GitHub Comments

1reaction

scottbeldencommented, May 20, 2018

Thanks for all of the results. I don’t run with pypy normally so I’m afraid I can’t really say why the pypy2 performance is worse than the standard avro library (and why in pypy3 this flips). Patches to improve the pypy performance would be welcome, but I’m probably not going to be able to contribute them.

0reactions

abrarsheikhcommented, May 20, 2018

good point, i updated the benchmark code so that writer/reader for apache avro are now part of benchmark. surpricingly this din’t change the benchmarking results.

I also fixed the number of rounds and iteration for pytest-benchmark which now give more reproducible runs

screen shot 2018-05-19 at 6 56 04 pm

Top Results From Across the Web

Benchmarking avro and fastavro using pytest ... - Medium

general performance of pypy3 is much better than all other python interpreters. avro is much faster than fastavro on pypy , this is...

avro - PyPI

Apache Avro ™ is a data serialization system. To learn more, please visit our website. Documentation. Apache Avro documentation is maintained on our...

Performance | PyPy

(This is also good modularity practice). The cost of CPython global references is high enough that, for example, if you have code in...

Getting Started (Python) - Apache Avro

This is a short guide for getting started with Apache Avro™ using Python. ... few minor difference (e.g., function name capitalization, such as...

Re: [Vote] Re: Proposal: Official Python Version ... - The Mail Archive

Support for python 2.x is now removed from pip so time to move on our side too! ... a vote on Apache Avro's...