question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Impoved reading and writing from/to PostGIS (SQL in general?) support

See original GitHub issue

Currently we only have a very basic read_postgis function, and we certainly want a write function as well (https://github.com/geopandas/geopandas/issues/189). But, we currently also have some different open (overlapping) PRs and issues related to improving the IO support for PostGIS. Therefore I thought to open a new general issue to get some overview.

Open PRs:

Does somebody have an insight in what the main differences are between the postgis PRs? How to proceed with those?

Some questions related to this that we might need to answer:

There is some relevant discussion in https://github.com/geopandas/geopandas/issues/161 as well.

Other related issues: https://github.com/geopandas/geopandas/issues/451 on adding SRID support in read_postgis

cc @jdmcbr @dimitri-justeau @showjackyang @adamboche @kuanb @emiliom @perrygeo @carsonfarmer

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:4
  • Comments:19 (18 by maintainers)

github_iconTop GitHub Comments

2reactions
HTenkanencommented, Jan 2, 2020

Hi @Sangarshanan and thanks for reviving this and your contributions! 👍

I now finally had time to get back to this and I went through your edits @Sangarshanan and included them to this Gist: https://gist.github.com/HTenkanen/3b214be899f0d3885bad48577de48150

I left some of the earlier parts as they were, so that the function is able to handle mix between single vs multi-geometries automatically (e.g. mix between Polygon and MultiPolygon).

@jorisvandenbossche: I now also updated the CRS reading using the new pyproj CRS class so it should work now quite nicely with different types of CRS information. In addition, I now tested swapping from shapely.wkb to pygeos.wkb as it also provided some improvements on the performance.

I did some time profiling on the different parts (see GIST) and currently the performance is as follows:

With Pygeos WKB:

In [1]: data = gpd.read_file("https://gist.githubusercontent.com/HTenkanen/456ec4611a943955823a65729c9cf2aa/raw/be56f5e1e5c06c33cd51e89f823a7d770d8769b5/ykr_basegrid.geojson")
In [2]: engine = create_engine("postgresql+psycopg2://myuser:mypwd@localhost:5432/mydb")
In [3]: %timeit copy_to_postgis(data, engine, table='ykr_test', if_exists='replace')
717 ms ± 58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

And this is how long different parts take:

'get_srid_from_crs'  0.01 seconds
'get_geometry_type'  0.02 seconds
'convert_to_wkb'  0.13 seconds
'write_to_db'  0.55 seconds
'copy_to_postgis'  0.77 seconds   # In total

With Shapely WKB

In [2]: %timeit copy_to_postgis(data, engine, table='ykr_test', if_exists='replace',  schema=None, dtype=None, index=True)
874 ms ± 44.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

And this is how long different parts take:

'get_srid_from_crs'  0.01 seconds
'get_geometry_type'  0.02 seconds
'convert_to_wkb'  0.26 seconds
'write_to_db'  0.50 seconds
'copy_to_postgis'  0.86 seconds  # In total

So as we can see, using the Pygeos instead of Shapely is twice as fast, and now most of the time goes to actually writing the data into the database which is normal.

Couple of questions for @jorisvandenbossche :

  • I have understood that the Pygeos things will eventually be integrated into Shapely. But I guess that might still take some time, so should we continue with this Pygeos approach, i.e. converting geometries from shapely to pygeos, and then those to wkb? Or should we stick with normal Shapely wkb-dumps for now, as this would naturally bring a new dependency to Geopandas? What are the current thoughts about integrating Pygeos to Geopandas?

  • I guess the most logical place to add this to_postgis() -functionality would be in the ..geopandas/io/sql.py -file, would you agree?

  • Any recommendations / ideas about how we should test these functionalities? I see that for testing the reading from PostGIS, you have the create_postgis() -function that populates a test_geopandas database. I guess we could take a similar approach here and test that populating the nybb data works with the to_postgis() function?

1reaction
martinfleiscommented, Jan 3, 2020

@HTenkanen Great job! Few notes from me.

pyproj.CRS will be used as GeoDataFrame.crs from next release (#1101), so we will be able to clean those conditions then.

I am fine with GeoAlchemy2, it is a pure python package installable from PyPI. Recent GeoPandas in not available on defaults either. @jorisvandenbossche will be able to tell more about the channels support.

The plan was to use pygeos under the hood within geopandas anyway (#1155), but I am not sure what is the current situation after the decision to merge pygeos with shapely. I am not very keen to use the logic you implemented, but once this pygeos/shapely/geopandas relation will be clearer we might come up with a simpler way.

Read more comments on GitHub >

github_iconTop Results From Across the Web

13 Tips to Improve PostgreSQL Insert Performance - Timescale
Get 13 ways to improve your database ingest (INSERT) performance and speed up your time-series queries using PostgreSQL – plus ...
Read more >
PostgreSQL 12.0 Release Notes
Major enhancements in PostgreSQL 12 include: General performance improvements, including: Optimizations to space utilization and read/write performance for ...
Read more >
PostgreSQL: The world's most advanced open source database
The official site for PostgreSQL, the world's most advanced open source database.
Read more >
PostgreSQL 14 Press Kit
This release includes several improvements to PostgreSQL's query parallelism support, including better performance of parallel sequential scans, ...
Read more >
Documentation: 15: E.2. Release 15 - PostgreSQL
Support for structured server log output using the JSON format. Performance improvements, particularly for in-memory and on-disk sorting.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found