Add low-level read and skip APIs
See original GitHub issueThe official Avro Python library supports low-level APIs for reading and skipping individual fields in an Avro-encoded byte stream. For example, it has functions like read_enum, skip_enum, skip_null, read_null, and so on. I don’t think that fastavro has anything like this, so this issue is a feature request to add those functions.
I have a use case for this. I am scanning through and indexing very large amounts of astronomical data which have a large and complicated schema. Out of the ~180 fields, I actually only care about 5 of them for my purposes. An encoded message might be about 100KB, and I’m interested in about 20 or 30 of those bytes.
Parsing the entire schema document with avro-python
takes a while - about 10ms. But by using skip
methods, and only read
ing the fields I care about, I’m able to get that way down to about 0.3ms. This is a substantial difference for me!
Now, using fastavro
to read the entire object takes about 0.8ms, which is quite good - but it’s outperformed by my hand-tuned code which skips fields using ~slowavro
~ the official Avro library by about 2x.
In alignment with fastavro
’s goals, then, I think adding support for low-level skip
and read
APIs would be helpful. In really performance-critical or extreme circumstances, it makes it possible to write much faster code, which occasionally is needed.
What do you think about adding skip
methods, and supporting the low-level read
functions which are already written in a public API?
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (5 by maintainers)
@spenczar I put together what I think should be an initial working implementation. I haven’t spent any time to do profiling or performance testing yet, but I’m hoping it should be pretty easy for you to re-run your test with the new changes.
Wheels are here: Windows - https://github.com/fastavro/fastavro/actions/runs/521802879 Linux - https://github.com/fastavro/fastavro/actions/runs/521802875 OSX - https://github.com/fastavro/fastavro/actions/runs/521802874
As you can see here, the test cases that do the manual calls to
skip_*
show that it would be painful for users to have to do that themselves and the fact that the interfaces are different depending on whether or not the cython modules are being used is less than ideal. So I’m hoping that the simple approach of using the subschema reader will be sufficient.All that being said, I can take a stab at putting in the
skip_
functions over the weekend and then when the CI builds and wheels are available to install, perhaps you could see if you can add it to the timing analysis.