Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Impoved reading and writing from/to PostGIS (SQL in general?) support

See original GitHub issue

Currently we only have a very basic read_postgis function, and we certainly want a write function as well (https://github.com/geopandas/geopandas/issues/189). But, we currently also have some different open (overlapping) PRs and issues related to improving the IO support for PostGIS. Therefore I thought to open a new general issue to get some overview.

Open PRs:

https://github.com/geopandas/geopandas/pull/440 PR adding to_postgis
https://github.com/geopandas/geopandas/pull/457 PR with both read/write for postgis
https://github.com/geopandas/geopandas/pull/546 PR to use geoalchemy in from_postgis
one not related to postgis: https://github.com/geopandas/geopandas/pull/101 PR to add support for sqlite

Does somebody have an insight in what the main differences are between the postgis PRs? How to proceed with those?

Some questions related to this that we might need to answer:

Do we want to use geoalchemy (https://geoalchemy-2.readthedocs.io/en/latest/)? (and thus add it as an optional requirement) What does it bring?
Can we actually support more than PostGIS? More general SQL support? (https://github.com/geopandas/geopandas/issues/490) Eg also MySQL has spatial data data types (https://dev.mysql.com/doc/refman/5.7/en/spatial-datatypes.html) But eg geoalchemy does not seem to support that.
Naming of the functions (https://github.com/geopandas/geopandas/issues/161): currenlty GeoDataFrame.from_postgis and read_postgis. Depending on the question above, we might want to make it more general (read_sql, to_sql). Personally I would retire the ‘from_postgis’ for read_postgis (or read_sql) anyhow.

There is some relevant discussion in https://github.com/geopandas/geopandas/issues/161 as well.

Other related issues: https://github.com/geopandas/geopandas/issues/451 on adding SRID support in read_postgis

cc @jdmcbr @dimitri-justeau @showjackyang @adamboche @kuanb @emiliom @perrygeo @carsonfarmer

Issue Analytics

State:
Created 6 years ago
Reactions:4
Comments:19 (18 by maintainers)

Top GitHub Comments

2reactions

HTenkanencommented, Jan 2, 2020

Hi @Sangarshanan and thanks for reviving this and your contributions! 👍

I now finally had time to get back to this and I went through your edits @Sangarshanan and included them to this Gist: https://gist.github.com/HTenkanen/3b214be899f0d3885bad48577de48150

I left some of the earlier parts as they were, so that the function is able to handle mix between single vs multi-geometries automatically (e.g. mix between Polygon and MultiPolygon).

@jorisvandenbossche: I now also updated the CRS reading using the new pyproj CRS class so it should work now quite nicely with different types of CRS information. In addition, I now tested swapping from shapely.wkb to pygeos.wkb as it also provided some improvements on the performance.

I did some time profiling on the different parts (see GIST) and currently the performance is as follows:

With Pygeos WKB:

In [1]: data = gpd.read_file("https://gist.githubusercontent.com/HTenkanen/456ec4611a943955823a65729c9cf2aa/raw/be56f5e1e5c06c33cd51e89f823a7d770d8769b5/ykr_basegrid.geojson")
In [2]: engine = create_engine("postgresql+psycopg2://myuser:mypwd@localhost:5432/mydb")
In [3]: %timeit copy_to_postgis(data, engine, table='ykr_test', if_exists='replace')
717 ms ± 58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

And this is how long different parts take:

'get_srid_from_crs'  0.01 seconds
'get_geometry_type'  0.02 seconds
'convert_to_wkb'  0.13 seconds
'write_to_db'  0.55 seconds
'copy_to_postgis'  0.77 seconds   # In total

With Shapely WKB

In [2]: %timeit copy_to_postgis(data, engine, table='ykr_test', if_exists='replace',  schema=None, dtype=None, index=True)
874 ms ± 44.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

And this is how long different parts take:

'get_srid_from_crs'  0.01 seconds
'get_geometry_type'  0.02 seconds
'convert_to_wkb'  0.26 seconds
'write_to_db'  0.50 seconds
'copy_to_postgis'  0.86 seconds  # In total

So as we can see, using the Pygeos instead of Shapely is twice as fast, and now most of the time goes to actually writing the data into the database which is normal.

Couple of questions for @jorisvandenbossche :

I have understood that the Pygeos things will eventually be integrated into Shapely. But I guess that might still take some time, so should we continue with this Pygeos approach, i.e. converting geometries from shapely to pygeos, and then those to wkb? Or should we stick with normal Shapely wkb-dumps for now, as this would naturally bring a new dependency to Geopandas? What are the current thoughts about integrating Pygeos to Geopandas?
I guess the most logical place to add this to_postgis() -functionality would be in the ..geopandas/io/sql.py -file, would you agree?
Any recommendations / ideas about how we should test these functionalities? I see that for testing the reading from PostGIS, you have the create_postgis() -function that populates a test_geopandas database. I guess we could take a similar approach here and test that populating the nybb data works with the to_postgis() function?

1reaction

martinfleiscommented, Jan 3, 2020

@HTenkanen Great job! Few notes from me.

pyproj.CRS will be used as GeoDataFrame.crs from next release (#1101), so we will be able to clean those conditions then.

I am fine with GeoAlchemy2, it is a pure python package installable from PyPI. Recent GeoPandas in not available on defaults either. @jorisvandenbossche will be able to tell more about the channels support.

The plan was to use pygeos under the hood within geopandas anyway (#1155), but I am not sure what is the current situation after the decision to merge pygeos with shapely. I am not very keen to use the logic you implemented, but once this pygeos/shapely/geopandas relation will be clearer we might come up with a simpler way.

Top Results From Across the Web

13 Tips to Improve PostgreSQL Insert Performance - Timescale

Get 13 ways to improve your database ingest (INSERT) performance and speed up your time-series queries using PostgreSQL – plus ...

PostgreSQL 12.0 Release Notes

Major enhancements in PostgreSQL 12 include: General performance improvements, including: Optimizations to space utilization and read/write performance for ...

PostgreSQL: The world's most advanced open source database

The official site for PostgreSQL, the world's most advanced open source database.

PostgreSQL 14 Press Kit

This release includes several improvements to PostgreSQL's query parallelism support, including better performance of parallel sequential scans, ...

Documentation: 15: E.2. Release 15 - PostgreSQL

Support for structured server log output using the JSON format. Performance improvements, particularly for in-memory and on-disk sorting.