Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DataCatalog can't read in shapefile

See original GitHub issue

Description

I can’t read in shape files (shape file folder or .shp file) when specified them in conf/base/catalog.yml. If I import the GeoJSONDataSet, it works fine in a python script. Is it preferable to update the current implementation of GeoJSONDataSet or to create a new Extra Dataset Class for shape file?

Context

I have a shapefolder containing .shp, .shx, .dbj, etc. file. I can feed the folder path to geopandas or GeoJSONDataSet to read in as a geojson dataframe but when I specify the path in the data catalog and run the node, it gives me an error

Steps to Reproduce

This is what works:

from kedro.extras.datasets.geopandas import GeoJSONDataSet
data = GeoJSONDataSet("sample_shape_file")
data
<kedro.extras.datasets.geopandas.geojson_dataset.GeoJSONDataSet at 0x1323e0dd0>

data_2 = GeoJSONDataSet("sample_shape_file/sample_shape_file.shp")
data_2
<kedro.extras.datasets.geopandas.geojson_dataset.GeoJSONDataSet at 0x1323f0950>

Here’s what I have in my catalog.yml

shape_file_data:
  type: geopandas.GeoJSONDataSet
  filepath: data/01_raw/sample_shape_file
  save_args={'driver': GeoJSON}

Here’s the node in my pipeline:

node(
                func=do_something,
                inputs=[
                    "shape_file_data",
                    "params:location",
                ],
                outputs="filtered_shape_file_data",
                name="do_something",
            ),

When I run kedro run --node=do_something, I got this error:

2021-02-16 14:24:24,394 - root - INFO - ** Kedro project project_name
2021-02-16 14:24:24,434 - kedro.io.data_catalog - INFO - Loading data from `shape_file_data` (GeoJSONDataSet)...
2021-02-16 14:24:24,434 - kedro.runner.sequential_runner - WARNING - There are 1 nodes that have not run.
You can resume the pipeline run by adding the following argument to your previous command:

2021-02-16 14:24:24,442 - kedro.framework.session.store - INFO - `save()` not implemented for `BaseSessionStore`. Skipping the step.
Traceback (most recent call last):
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/io/core.py", line 208, in load
    return self._load()
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/extras/datasets/geopandas/geojson_dataset.py", line 149, in _load
    with self._fs.open(load_path, **self._fs_open_args_load) as fs_file:
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/fsspec/spec.py", line 936, in open
    **kwargs
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/fsspec/implementations/local.py", line 117, in _open
    return LocalFileOpener(path, mode, fs=self, **kwargs)
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/fsspec/implementations/local.py", line 199, in __init__
    self._open()
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/fsspec/implementations/local.py", line 204, in _open
    self.f = open(self.path, mode=self.mode)
IsADirectoryError: [Errno 21] Is a directory: '/Users/duongvu/project_name/data/01_raw/shape_file_data'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/duongvu/.pyenv/versions/env_name/bin/kedro", line 10, in <module>
    sys.exit(main())
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/framework/cli/cli.py", line 696, in main
    cli_collection(**cli_context)
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/duongvu/project_name/cli.py", line 212, in run
    pipeline_name=pipeline,
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/framework/session/session.py", line 414, in run
    run_result = runner.run(filtered_pipeline, catalog, run_id)
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/runner/runner.py", line 100, in run
    self._run(pipeline, catalog, run_id)
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/runner/sequential_runner.py", line 90, in _run
    run_node(node, catalog, self._is_async, run_id)
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/runner/runner.py", line 212, in run_node
    node = _run_node_sequential(node, catalog, run_id)
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/runner/runner.py", line 285, in _run_node_sequential
    inputs[name] = catalog.load(name)
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/io/data_catalog.py", line 402, in load
    result = func()
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/io/core.py", line 611, in load
    return super().load()
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/io/core.py", line 217, in load
    raise DataSetError(message) from exc
kedro.io.core.DataSetError: Failed while loading data from data set GeoJSONDataSet(filepath=/Users/duongvu/project_name/data/01_raw/shape_file_data, load_args={}, protocol=file, save_args={'driver': GeoJSON}).
[Errno 21] Is a directory: '/Users/duongvu/project_name/data/01_raw/shape_file_data'

If I specify the .shp file inside like this:

shape_file_data:
  type: geopandas.GeoJSONDataSet
  filepath: data/01_raw/sample_shape_file/sample_shape_file.shp
  save_args={'driver': GeoJSON}

I also got another error of:

kedro.io.core.DataSetError: Failed while loading data from data set GeoJSONDataSet(...) not recognized as a supported file format.

I did try different save_args from “ESRI Shapefile” to None to “GeoJSON”. None works.

My Environment

Include as many relevant details about the environment in which you experienced the bug:

Kedro version used: 0.17.0
Python version used: 3.7.7
Operating system and version: MacOS

Issue Analytics

State:
Created 3 years ago
Comments:15 (8 by maintainers)

Top GitHub Comments

1reaction

Xabitsukicommented, Nov 19, 2021

From Geopandas documentation (link) there should be an easy solution:

Instead of passing a path to the to .shp file
Zip all the files together (inside a folder): .shp, .shx, .dbf, .prj etc.
Use the zip path instead of the .shp file path.

Their example:

path = "simplecache::http://download.geofabrik.de/antarctica-latest-free.shp.zip"
with fsspec.open(path) as file:
    df = geopandas.read_file(file)

~I have not test yet using the catalog but it should work but using this snippet with one of my files works.~

It works.

0reactions

jamespollycommented, Jul 18, 2022

@rabiahmad I’ve shared your issue and had success by following the suggestion offered by @mzjp2. You should get the same results by following these instructions for creating custom datasets and using the below code snippet for your definition file.

import logging
from copy import deepcopy
from io import BytesIO
from pathlib import PurePosixPath
from typing import Any, Dict, Union

import fsspec
import geopandas as gpd

from kedro.extras.datasets.geopandas.geojson_dataset import GeoJSONDataSet
from kedro.io.core import (
    PROTOCOL_DELIMITER,
    DataSetError,
    Version,
    get_filepath_str,
    get_protocol_and_path,
)

logger = logging.getLogger(__name__)


class ShapefileDataSet(GeoJSONDataSet):
    """``ShapefileDataSet`` loads/saves data to a GeoJSON file using an underlying filesystem
    (eg: local, S3, GCS). Inherited from GeoJSONDataSet with only modified load function.
    The underlying functionality is supported by geopandas, so it supports all
    allowed geopandas (pandas) options for loading and saving GeoJSON files.

    Example:
    ::

        >>> import geopandas as gpd
        >>> from kedro.extras.datasets.geopandas import ShapefileDataSet
        >>>
        >>> data = gpd.read_file('test.shp')
        >>>
        >>> data_set = ShapefileDataSet(filepath="test.shp")
        >>> data_set.save(data)
        >>> reloaded = data_set.load()
        >>>
        >>> assert data.equals(reloaded)

    """
    def _load(self) -> Union[gpd.GeoDataFrame, Dict[str, gpd.GeoDataFrame]]:
        load_path = get_filepath_str(self._get_load_path(), self._protocol)
        with self._fs.open(load_path, **self._fs_open_args_load) as fs_file:
            # return gpd.read_file(fs_file, **self._load_args)
            return gpd.read_file(load_path, **self._load_args)

Top Results From Across the Web

DataCatalog can't read in shapefile · Issue #695 · kedro-org ...

Description I can't read in shape files (shape file folder or .shp file) when specified them in conf/base/catalog.yml.

R having trouble reading in shapefile - GIS Stack Exchange

I have successfully installed all the packages, and changed the read_excel part to read in a shapefile. # Import packages library(dplyr) library ...

Importing Table Data - Earth Engine - Google Developers

Uploading table assets. You can use the Asset Manager or command line interface (CLI) to upload datasets in the Shapefile or CSV format....

Assignment 1 – Spatial Data Catalog | GIS Programming - Sites

Walk will list the complete shapefile once. DBF files will be listed if they are standalone tables, but will not be listed if...

TIGER/Line Shapefile, 2019, nation, U.S., Current County and ...

The TIGER/Line shapefiles and related database files (.dbf) are an extract of selected geographic and cartographic information from the U.S. Census Bureau's ...