Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ENH: feather or parquet driver?

See original GitHub issue

I’ve had a very quick go at the new feather format by @wesm & @hadley to (de-)serialise DataFrame objects, here’s a working example. Given the substantial gains in read/write times, would it make sense to include an experimental driver for feather in geopandas? It would go something like:

db = gpd.read_file('mygeo.feather', driver='feather')
db.to_file(mynewgeo.feather', driver='feather')

Under the hood it’d serialise the geometry column into wkb_hex (or any other format if faster/more efficient, this was my first go at it) and back into shapely geoms when reading.

Tagging @ljwolf as this is the fruit of discussing with him too.

Issue Analytics

State:
Created 6 years ago
Reactions:1
Comments:28 (17 by maintainers)

Top GitHub Comments

7reactions

knaaptimecommented, Mar 22, 2019

Is this still being considered for inclusion? My colleagues and I regularly use parquet for storing geodataframes so I’d be happy to submit a PR if it’s of interest

4reactions

brendan-wardcommented, Oct 7, 2019

I’d like to see write benchmarks - esp. for geo data - before dropping feather. In my case, writes consume more time than reads and in my own tiny benchmarks in geofeather have seem more variation in write times than read times (compared to shapefile).

For those engaged here, I’d recommend a similar approach to supporting these as optional dependencies, like parquet and pyarrow as is used in pandas. Or - if the implementations move directly into geopandas to keep the dependencies on pyarrow, etc as optional.

@darcy-r not sure that a GDAL driver for feather makes sense, esp. given the above post from Wes, but a parquet driver may make sense since it is a storage format.

@snowman2 I think there are really only two parts of a spec we are talking about here, since the substantive specs are at the feather and parquet levels:

store geometries as wkb
store CRS info somewhere attached to the file containing the data (preferably in the metadata for that file instead of a separate file like I did).

It seems like parquet can give us all of the above, and perhaps by disabling compression, keep I/O competitive or better than feather. We just need to build out support first in a consistent way, then benchmark.

@darcy-r I’d be happy to run with trying to integrate your work with parquet with mine with feather and pull together a PR to introduce both, if that seems reasonable to you?