question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ENH: feather or parquet driver?

See original GitHub issue

I’ve had a very quick go at the new feather format by @wesm & @hadley to (de-)serialise DataFrame objects, here’s a working example. Given the substantial gains in read/write times, would it make sense to include an experimental driver for feather in geopandas? It would go something like:

db = gpd.read_file('mygeo.feather', driver='feather')
db.to_file(mynewgeo.feather', driver='feather')

Under the hood it’d serialise the geometry column into wkb_hex (or any other format if faster/more efficient, this was my first go at it) and back into shapely geoms when reading.

Tagging @ljwolf as this is the fruit of discussing with him too.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:1
  • Comments:28 (17 by maintainers)

github_iconTop GitHub Comments

7reactions
knaaptimecommented, Mar 22, 2019

Is this still being considered for inclusion? My colleagues and I regularly use parquet for storing geodataframes so I’d be happy to submit a PR if it’s of interest

4reactions
brendan-wardcommented, Oct 7, 2019

I’d like to see write benchmarks - esp. for geo data - before dropping feather. In my case, writes consume more time than reads and in my own tiny benchmarks in geofeather have seem more variation in write times than read times (compared to shapefile).

For those engaged here, I’d recommend a similar approach to supporting these as optional dependencies, like parquet and pyarrow as is used in pandas. Or - if the implementations move directly into geopandas to keep the dependencies on pyarrow, etc as optional.

@darcy-r not sure that a GDAL driver for feather makes sense, esp. given the above post from Wes, but a parquet driver may make sense since it is a storage format.

@snowman2 I think there are really only two parts of a spec we are talking about here, since the substantive specs are at the feather and parquet levels:

  1. store geometries as wkb
  2. store CRS info somewhere attached to the file containing the data (preferably in the metadata for that file instead of a separate file like I did).

It seems like parquet can give us all of the above, and perhaps by disabling compression, keep I/O competitive or better than feather. We just need to build out support first in a consistent way, then benchmark.

@darcy-r I’d be happy to run with trying to integrate your work with parquet with mine with feather and pull together a PR to introduce both, if that seems reasonable to you?

Read more comments on GitHub >

github_iconTop Results From Across the Web

What are the differences between feather and parquet?
Feather seems better for light weight data, as it writes and loads faster. Parquet has better storage ratios. Feather library support and ...
Read more >
Choosing the Right HDFS File Format for Your Apache Spark ...
Spark's default file format is Parquet. Parquet has a number of advantages that improves the performance of querying and filtering the data.
Read more >
Feather vs Parquet vs CSV vs Jay - Shabbir Bawaji - Medium
Feather is a file format that sometimes outperforms even parquet but is really not the file format to use while saving boolean file...
Read more >
Stop persisting pandas data frames in CSVs
Advantages of pickle, parquet, and others— faster, more reliable and efficient. Different methods how to persist pandas dataframe.
Read more >
Loading data into a Pandas DataFrame - a performance study
... Feather, Parquet or HDF5] or in a database [Microsoft SQL Server]. ... performs slightly better than the two other drivers [pyodbc and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found