Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Writing performance

See original GitHub issue

There has been a previous issue about the slow writing of GeoPackage files: https://github.com/Toblerity/Fiona/issues/476. But triggered by https://gis.stackexchange.com/questions/302811/how-to-get-fast-writing-with-geopandas-fiona, I was further looking into it with the latest versions of GeoPandas and Fiona, and it still seems relatively slow.

It already improved a lot compared with the previous versions of both GeoPandas and Fiona. And of the remaining time, GeoPandas takes the most time (which I will try to fix, cfr https://github.com/geopandas/geopandas/issues/863). But even then, writing a file with 100k rows and 5 attribute columns takes ca 10s with Fiona.

Sample set-up:

import pandas as pd
import geopandas
import fiona
import shapely.geometry

import random
import string

N = 100000

df = geopandas.GeoDataFrame(
    {'a': np.random.randn(N), 'b': np.random.randn(N),
     'c': np.random.randn(N), 'd': np.random.randint(100, size=N),
     'e': [''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(5)) for _ in range(N)],
     'geometry': [shapely.geometry.Point(random.random(), random.random()) for _ in range(N)]})

records = list(df.iterfeatures())
schema = geopandas.io.file.infer_schema(df)

with fiona.Env():
    with fiona.open("test_geopackage.gpkg", 'w', driver="GPKG", schema=schema) as colxn:
        colxn.writerecords(records)

Timing only the fiona-writing part (using Fiona 1.8.1, GDAL 2.3 with Python 3.6 on Ubuntu 16.04, installed with conda-forge):

In [37]: %%timeit
    ...: with fiona.Env():
    ...:     with fiona.open("test_geopackage_profile2.gpkg", 'w', driver="GPKG", schema=schema) as colxn:
    ...:         colxn.writerecords(records)
11.4 s ± 284 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Full example notebook exploring the performance of writing that dataframe to GPKG: http://nbviewer.jupyter.org/gist/jorisvandenbossche/c8590a3617698befad527e66eefb7f5b

As comparison, writing the same dataset with QGIS takes only a couple of seconds, and reading / writing it with ogr2ogr also takes less time (around 5s for both reading and writing it).

So the main question I am wondering: would there still be ways to improve this in Fiona?

I suppose that part of it is inherent to the design / the fact that we have the data in Python objects and need to convert to OGR objects. Possibilities that I was thinking about to further explore: would (optionally) turn of some validation steps make a difference? Would using WKB instead of the mapping as intermediate geometry object make a difference? But maybe you already know the answer to those questions, or see other possibilities.

Issue Analytics

State:
Created 5 years ago
Comments:10 (8 by maintainers)

Top GitHub Comments

2reactions

sgilliescommented, Nov 18, 2018

@jorisvandenbossche I took advice from https://julien.danjou.info/guide-to-python-profiling-cprofile-concrete-case-carbonara/ to install pyproj2calltree and kcachegrind. These are quite handy. I’ve used them to profile and make a call graph of the following script.


from itertools import chain, repeat
import fiona

with fiona.Env():

    with fiona.open("tests/data/coutwildrnp.shp") as collection:
        features = chain.from_iterable(repeat(list(collection), 2000))

        with fiona.open("/tmp/out.gpkg", "w", schema=collection.schema, crs=collection.crs, driver="GPKG") as dst:
            dst.writerecords(features)

This opens the test file, reads its features, multiplies them 2000x times, and writes them to a new geopackage file. Since our transaction size is 20,000, there are 7 transactions involved. According to my analysis, committing the transactions costs us very little. We spend less than 1% of the time committing transactions.

gpkg_commit

Apologies for the poorly cropped image.

Here’s a look near writerecords.

gpkg_write

We spend 85% of the time writing features out. 32% is spent constructing OGR geometry objects and some 11% of the time in there is somewhat wasted on debug log messages. The geometry builder is one place to look for improvements.

If the feature source provided OGR objects and we skipped GeoJSON deserialization (hypothetical!) the geometry builder would not be needed and ~20% could be eliminated.

There’s another 50% of the overall cost in writerecs that is more opaque. This is likely the best code in which to look for improvements. A different tool might be needed because we can’t see into OGR from cProfile. Wrapping OGR_L_CreateFeature (https://github.com/Toblerity/Fiona/blob/master/fiona/ogrext.pyx#L1179) up in a Cython function will give us a little more data, too.

0reactions

sgilliescommented, Jun 2, 2022

I’m going to close this one. I’m seeing a ~2x speedup between 1.8.0 and 1.9a2. Further speedups should be part of the 2.0.0 work.

Top Results From Across the Web

6 Tips for Writing an Effective Performance Review

6 Tips for Writing an Effective Performance Review · 1. Provide regular, informal feedback. · 2. Be honest. · 3. Do it face...

Performance Writing

It is a multi-modal approach which explores through artistic practice how writing interacts with other art forms and practices — visual art, sound...

What is Writing Performance

What is Writing Performance? Definition of Writing Performance: The achievement level of students in writing studies.

Writing as Performance

The first and perhaps the most important requirement for a successful writing performance—and writing is a performance, like singing an aria or dancing...

The writing performance of elementary students receiving ...

The purpose of this study was to examine the effects of SIWI on the written expression of d/hh elementary students across recount/ personal...