Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ENH: Build line topology in dataframe for simple integration with NetworkX (+ potentially other graph systems)

See original GitHub issue

Hello there

I’ve been using geopandas for some time now, as well as The NetworkX library for doing graph operations. NetworkX has for a while now had their own nx.from_shp function for reading shapefiles and creating graphs out of them, but not much geospatial functionality beyond that. It is also based on raw GDAL which is apparently a bit of a pain for them to maintain dependencies for

They also have a function called nx.from_pandas_edgelist which can build a graph out of a pandas DataFrame so long as it has columns indicating the source and target of each edge. I made a pull request to their repo with some code which can build a graph from a geopandas GeoDataFrame, so long as it only has LineString geometries. In the end, however, we realized that since gpd.GeoDataFrmae inherits from pd.DataFrame, it already technically compatible with nx.from_pandas_edgelist, it just needs to have the source and target columns pre-computed.

Some maintainers of NetworkX have agreed that shedding responsibility for handling geospatial I/O would be a relief to their project and I think that geopandas is in a good position to fill this void, given that it can already read & write from just about every GIS format and is already compatible with their support for pandas DataFrames

So I propose that I add a function to [geodataframe.py](https://github.com/geopandas/geopandas/blob/master/geopandas/geodataframe.py called make_topology which would construct source and target columns with unique ids where the nodes ought to be. Of course, the column names would be customizable and a spatial tolerance would be a parameter. Furthermore, it may also be interesting to have a way to somehow extract a point-based “nodes” GeoDataFrame for visualization of the identified nodes, but I’m not sure what the best way would be to do that. So the signature would look something like:

gdf.make_topology(source_col="source", target_col="target", precision=0.005, inplace=False)

I didn’t plan to do any topological cleaning or modification of geometry (i.e. assuming that the network is already clean), although maybe that would be an interesting set of new functions for the future.

I think this functionality would be quite useful for topological analysis of geographic networks in general, as the creation of the source & target columns would effectively make the GeoDataFrame ready for integration into NetworkX or indeed any other library or system which wants topological edge representations rather than geometric representations of features.

Here is the main bit of code which I had made for the original proposal for integration into NetworkX before we considered putting it in geopandas instead. It’s not all that complex, basically we just look at all the LineString geometries first & last points and check to see if they are unique (within the specified precision). Just note that this is not the final proposal, just a quickly-extracted and slightly modified version of what I had originally proposed in the NetworkX codebase:


def make_topology(source_col="source", target_col="target", precision=0.001 geometry="geometry"):

    # Determine number of dimensions of geometry by checking the first row (i.e. 2D or 3D lines?)
    dims = range(len(gdf[geometry][0].coords[0]))

    # Find all unique start & end points and assign them an id
    gdf["source_coords"] = gdf[geometry].apply(
        lambda geom: tuple(round(geom.coords[0][i], precision) for i in dims)
    )
    gdf["target_coords"] = gdf[geometry].apply(
        lambda geom: tuple(round(geom.coords[-1][i], precision) for i in dims)
    )
    node_ids = {}
    i = 0
    for row in gdf.itertuples(index=False):
        node_1 = row.source_coords
        node_2 = row.target_coords
        if node_1 not in node_ids:
            node_ids[node_1] = i
            i += 1
        if node_2 not in node_ids:
            node_ids[node_2] = i
            i += 1

    # Assign the unique id to each
    gdf[source_col] = gdf["source_coords"].apply(lambda x: node_ids[x])
    gdf[target_col] = gdf["target_coords"].apply(lambda x: node_ids[x])

    gdf.drop(
        ["source_coords", "target_coords"],
        axis="columns",
        inplace=True,
    )

Issue Analytics

State:
Created 3 years ago
Reactions:4
Comments:9 (4 by maintainers)

Top GitHub Comments

2reactions

ljwolfcommented, Sep 8, 2020

Building topologies from geographic data stored in GeoDataFrames is indeed a big interest to many. As @martinfleis mentions above, the pysal library has also focused on providing these kinds of fast algorithms for graph/topology construction with no GEOS dependency. Though, in econometrics, these are called “spatial weights,” rather than “graphs” for historical reasons. We cover many kinds of distance-based and contiguity-based weights. I asked about these interfaces to networkx a while ago, and that stalled.

So, we went ahead and did it in libpysal. We have converters to_networkx() and from_networkx(), as well as to_adjlist() and from_adjlist() and access to the underlying sparse matrix representation

But… it’d be great if we could have a single library where this kind of graph building is fast and performant… ours usually is fast because “sharing a vertex” is slightly different from “touches at a point,” but building on top of pygeos might open new performance benefits.

Regardless, coordinating to build a single topology builder that has a generic builder for things with a __geo_interface__ and possibly has faster implementations for specific input types would be really great, and I think consolidate a ton of duplicated effort in the ecosystem.

For those not familiar with pysal, an example task might be:

import geopandas, osmnx
from pysal.lib import weights

graph = osmnx.graph_from_place("Bristol, UK")
nodes, edges = osmnx.graph_to_gdfs(graph)

Edges is now a geodataframe of line segments, as if read in from file. In our lexicon, we’d use “queen contiguity” to refer to neighboring geometries that share a single vertex:

w = weights.Queen.from_dataframe(edges)
w.to_adjlist() # contains head id, tail id, link weight

w.to_networkx() # contains the NetworkX Graph/DiGraph

w.sparse # contains the scipy.sparse adjacency matrix

This works for polygons, too:

boros = geopandas.read_file(geopandas.datasets.get_path("nybb"))
weights.Rook.from_dataframe(boros)

and, for distance-based graphs, we have a similar API:

weights.DistanceBand.from_dataframe(nodes, threshold=.1)
weights.KNN.from_dataframe(nodes, k=10)
weights.Kernel.from_dataframe(nodes, function='gaussian')

1reaction

gboeingcommented, Aug 31, 2020

It may be of interest how OSMnx handles NetworkX MultiDiGraph to/from Geopandas GeoDataFrame conversions with its graph_to_gdfs and graph_from_gdfs functions.

The package has undergone a major renovation over the summer and has shifted somewhat away from its original mission of being a pure OpenStreetMap -> NetworkX graph utility, towards more flexibly working with OSM spatial networks in NetworkX and OSM non-networked geometries in Geopandas (particularly with the new geometries module).

Top Results From Across the Web

From DataFrame to Network Graph - Towards Data Science

A quick start guide to visualizing a Pandas dataframe using networkx and matplotlib. A group of people sitting and standing with network-lines ......

from_pandas_edgelist — NetworkX 2.8.8 documentation

Returns a graph from Pandas DataFrame containing an edge list. The Pandas DataFrame should contain at least two columns of node names and...

A pipeline of integrating transcriptome and interactome ... - NCBI

This protocol requires a wide range of graph/network data, transcriptomics data, and a few annotation datasets. The first step is to build a ......

(PDF) CompositeView: A Network-Based Visualization Tool

Large networks are quintessential to bioinformatics, knowledge graphs, social network analysis, and graph-based learning. CompositeView is a Python-based ...

rivus Documentation - Read the Docs

tructure networks systems with a focus on high spatial resolution. ... it represents the accumulated base area of different building types.