Writing performance
See original GitHub issueThere has been a previous issue about the slow writing of GeoPackage files: https://github.com/Toblerity/Fiona/issues/476. But triggered by https://gis.stackexchange.com/questions/302811/how-to-get-fast-writing-with-geopandas-fiona, I was further looking into it with the latest versions of GeoPandas and Fiona, and it still seems relatively slow.
It already improved a lot compared with the previous versions of both GeoPandas and Fiona. And of the remaining time, GeoPandas takes the most time (which I will try to fix, cfr https://github.com/geopandas/geopandas/issues/863). But even then, writing a file with 100k rows and 5 attribute columns takes ca 10s with Fiona.
Sample set-up:
import pandas as pd
import geopandas
import fiona
import shapely.geometry
import random
import string
N = 100000
df = geopandas.GeoDataFrame(
{'a': np.random.randn(N), 'b': np.random.randn(N),
'c': np.random.randn(N), 'd': np.random.randint(100, size=N),
'e': [''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(5)) for _ in range(N)],
'geometry': [shapely.geometry.Point(random.random(), random.random()) for _ in range(N)]})
records = list(df.iterfeatures())
schema = geopandas.io.file.infer_schema(df)
with fiona.Env():
with fiona.open("test_geopackage.gpkg", 'w', driver="GPKG", schema=schema) as colxn:
colxn.writerecords(records)
Timing only the fiona-writing part (using Fiona 1.8.1, GDAL 2.3 with Python 3.6 on Ubuntu 16.04, installed with conda-forge):
In [37]: %%timeit
...: with fiona.Env():
...: with fiona.open("test_geopackage_profile2.gpkg", 'w', driver="GPKG", schema=schema) as colxn:
...: colxn.writerecords(records)
11.4 s ± 284 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Full example notebook exploring the performance of writing that dataframe to GPKG: http://nbviewer.jupyter.org/gist/jorisvandenbossche/c8590a3617698befad527e66eefb7f5b
As comparison, writing the same dataset with QGIS takes only a couple of seconds, and reading / writing it with ogr2ogr
also takes less time (around 5s for both reading and writing it).
So the main question I am wondering: would there still be ways to improve this in Fiona?
I suppose that part of it is inherent to the design / the fact that we have the data in Python objects and need to convert to OGR objects. Possibilities that I was thinking about to further explore: would (optionally) turn of some validation steps make a difference? Would using WKB instead of the mapping as intermediate geometry object make a difference? But maybe you already know the answer to those questions, or see other possibilities.
Issue Analytics
- State:
- Created 5 years ago
- Comments:10 (8 by maintainers)
Top GitHub Comments
@jorisvandenbossche I took advice from https://julien.danjou.info/guide-to-python-profiling-cprofile-concrete-case-carbonara/ to install pyproj2calltree and kcachegrind. These are quite handy. I’ve used them to profile and make a call graph of the following script.
This opens the test file, reads its features, multiplies them 2000x times, and writes them to a new geopackage file. Since our transaction size is 20,000, there are 7 transactions involved. According to my analysis, committing the transactions costs us very little. We spend less than 1% of the time committing transactions.
Apologies for the poorly cropped image.
Here’s a look near
writerecords
.We spend 85% of the time writing features out. 32% is spent constructing OGR geometry objects and some 11% of the time in there is somewhat wasted on debug log messages. The geometry builder is one place to look for improvements.
If the feature source provided OGR objects and we skipped GeoJSON deserialization (hypothetical!) the geometry builder would not be needed and ~20% could be eliminated.
There’s another 50% of the overall cost in
writerecs
that is more opaque. This is likely the best code in which to look for improvements. A different tool might be needed because we can’t see into OGR from cProfile. WrappingOGR_L_CreateFeature
(https://github.com/Toblerity/Fiona/blob/master/fiona/ogrext.pyx#L1179) up in a Cython function will give us a little more data, too.I’m going to close this one. I’m seeing a ~2x speedup between 1.8.0 and 1.9a2. Further speedups should be part of the 2.0.0 work.