ENH: feather or parquet driver?
See original GitHub issueI’ve had a very quick go at the new feather format by @wesm & @hadley to (de-)serialise DataFrame objects, here’s a working example. Given the substantial gains in read/write times, would it make sense to include an experimental driver for feather in geopandas? It would go something like:
db = gpd.read_file('mygeo.feather', driver='feather')
db.to_file(mynewgeo.feather', driver='feather')
Under the hood it’d serialise the geometry column into wkb_hex (or any other format if faster/more efficient, this was my first go at it) and back into shapely geoms when reading.
Tagging @ljwolf as this is the fruit of discussing with him too.
Issue Analytics
- State:
- Created 6 years ago
- Reactions:1
- Comments:28 (17 by maintainers)
Top Results From Across the Web
What are the differences between feather and parquet?
Feather seems better for light weight data, as it writes and loads faster. Parquet has better storage ratios. Feather library support and ...
Read more >Choosing the Right HDFS File Format for Your Apache Spark ...
Spark's default file format is Parquet. Parquet has a number of advantages that improves the performance of querying and filtering the data.
Read more >Feather vs Parquet vs CSV vs Jay - Shabbir Bawaji - Medium
Feather is a file format that sometimes outperforms even parquet but is really not the file format to use while saving boolean file...
Read more >Stop persisting pandas data frames in CSVs
Advantages of pickle, parquet, and others— faster, more reliable and efficient. Different methods how to persist pandas dataframe.
Read more >Loading data into a Pandas DataFrame - a performance study
... Feather, Parquet or HDF5] or in a database [Microsoft SQL Server]. ... performs slightly better than the two other drivers [pyodbc and...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Is this still being considered for inclusion? My colleagues and I regularly use parquet for storing geodataframes so I’d be happy to submit a PR if it’s of interest
I’d like to see write benchmarks - esp. for geo data - before dropping
feather. In my case, writes consume more time than reads and in my own tiny benchmarks ingeofeatherhave seem more variation in write times than read times (compared to shapefile).For those engaged here, I’d recommend a similar approach to supporting these as optional dependencies, like
parquetandpyarrowas is used inpandas. Or - if the implementations move directly intogeopandasto keep the dependencies onpyarrow, etc as optional.@darcy-r not sure that a GDAL driver for
feathermakes sense, esp. given the above post from Wes, but aparquetdriver may make sense since it is a storage format.@snowman2 I think there are really only two parts of a spec we are talking about here, since the substantive specs are at the
featherandparquetlevels:wkbIt seems like
parquetcan give us all of the above, and perhaps by disabling compression, keep I/O competitive or better thanfeather. We just need to build out support first in a consistent way, then benchmark.@darcy-r I’d be happy to run with trying to integrate your work with
parquetwith mine withfeatherand pull together a PR to introduce both, if that seems reasonable to you?