DataCatalog can't read in shapefile
See original GitHub issueDescription
I can’t read in shape files (shape file folder or .shp file) when specified them in conf/base/catalog.yml. If I import the GeoJSONDataSet, it works fine in a python script. Is it preferable to update the current implementation of GeoJSONDataSet or to create a new Extra Dataset Class for shape file?
Context
I have a shapefolder containing .shp, .shx, .dbj, etc. file. I can feed the folder path to geopandas or GeoJSONDataSet to read in as a geojson dataframe but when I specify the path in the data catalog and run the node, it gives me an error
Steps to Reproduce
- This is what works:
from kedro.extras.datasets.geopandas import GeoJSONDataSet
data = GeoJSONDataSet("sample_shape_file")
data
<kedro.extras.datasets.geopandas.geojson_dataset.GeoJSONDataSet at 0x1323e0dd0>
data_2 = GeoJSONDataSet("sample_shape_file/sample_shape_file.shp")
data_2
<kedro.extras.datasets.geopandas.geojson_dataset.GeoJSONDataSet at 0x1323f0950>
- Here’s what I have in my catalog.yml
shape_file_data:
type: geopandas.GeoJSONDataSet
filepath: data/01_raw/sample_shape_file
save_args={'driver': GeoJSON}
- Here’s the node in my pipeline:
node(
func=do_something,
inputs=[
"shape_file_data",
"params:location",
],
outputs="filtered_shape_file_data",
name="do_something",
),
- When I run
kedro run --node=do_something, I got this error:
2021-02-16 14:24:24,394 - root - INFO - ** Kedro project project_name
2021-02-16 14:24:24,434 - kedro.io.data_catalog - INFO - Loading data from `shape_file_data` (GeoJSONDataSet)...
2021-02-16 14:24:24,434 - kedro.runner.sequential_runner - WARNING - There are 1 nodes that have not run.
You can resume the pipeline run by adding the following argument to your previous command:
2021-02-16 14:24:24,442 - kedro.framework.session.store - INFO - `save()` not implemented for `BaseSessionStore`. Skipping the step.
Traceback (most recent call last):
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/io/core.py", line 208, in load
return self._load()
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/extras/datasets/geopandas/geojson_dataset.py", line 149, in _load
with self._fs.open(load_path, **self._fs_open_args_load) as fs_file:
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/fsspec/spec.py", line 936, in open
**kwargs
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/fsspec/implementations/local.py", line 117, in _open
return LocalFileOpener(path, mode, fs=self, **kwargs)
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/fsspec/implementations/local.py", line 199, in __init__
self._open()
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/fsspec/implementations/local.py", line 204, in _open
self.f = open(self.path, mode=self.mode)
IsADirectoryError: [Errno 21] Is a directory: '/Users/duongvu/project_name/data/01_raw/shape_file_data'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/duongvu/.pyenv/versions/env_name/bin/kedro", line 10, in <module>
sys.exit(main())
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/framework/cli/cli.py", line 696, in main
cli_collection(**cli_context)
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/Users/duongvu/project_name/cli.py", line 212, in run
pipeline_name=pipeline,
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/framework/session/session.py", line 414, in run
run_result = runner.run(filtered_pipeline, catalog, run_id)
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/runner/runner.py", line 100, in run
self._run(pipeline, catalog, run_id)
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/runner/sequential_runner.py", line 90, in _run
run_node(node, catalog, self._is_async, run_id)
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/runner/runner.py", line 212, in run_node
node = _run_node_sequential(node, catalog, run_id)
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/runner/runner.py", line 285, in _run_node_sequential
inputs[name] = catalog.load(name)
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/io/data_catalog.py", line 402, in load
result = func()
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/io/core.py", line 611, in load
return super().load()
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/io/core.py", line 217, in load
raise DataSetError(message) from exc
kedro.io.core.DataSetError: Failed while loading data from data set GeoJSONDataSet(filepath=/Users/duongvu/project_name/data/01_raw/shape_file_data, load_args={}, protocol=file, save_args={'driver': GeoJSON}).
[Errno 21] Is a directory: '/Users/duongvu/project_name/data/01_raw/shape_file_data'
- If I specify the .shp file inside like this:
shape_file_data:
type: geopandas.GeoJSONDataSet
filepath: data/01_raw/sample_shape_file/sample_shape_file.shp
save_args={'driver': GeoJSON}
I also got another error of:
kedro.io.core.DataSetError: Failed while loading data from data set GeoJSONDataSet(...) not recognized as a supported file format.
- I did try different save_args from “ESRI Shapefile” to None to “GeoJSON”. None works.
My Environment
Include as many relevant details about the environment in which you experienced the bug:
- Kedro version used: 0.17.0
- Python version used: 3.7.7
- Operating system and version: MacOS
Issue Analytics
- State:
- Created 3 years ago
- Comments:15 (8 by maintainers)
Top Results From Across the Web
DataCatalog can't read in shapefile · Issue #695 · kedro-org ...
Description I can't read in shape files (shape file folder or .shp file) when specified them in conf/base/catalog.yml.
Read more >R having trouble reading in shapefile - GIS Stack Exchange
I have successfully installed all the packages, and changed the read_excel part to read in a shapefile. # Import packages library(dplyr) library ...
Read more >Importing Table Data - Earth Engine - Google Developers
Uploading table assets. You can use the Asset Manager or command line interface (CLI) to upload datasets in the Shapefile or CSV format....
Read more >Assignment 1 – Spatial Data Catalog | GIS Programming - Sites
Walk will list the complete shapefile once. DBF files will be listed if they are standalone tables, but will not be listed if...
Read more >TIGER/Line Shapefile, 2019, nation, U.S., Current County and ...
The TIGER/Line shapefiles and related database files (.dbf) are an extract of selected geographic and cartographic information from the U.S. Census Bureau's ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

From Geopandas documentation (link) there should be an easy solution:
.shpfile.shp,.shx,.dbf,.prjetc..shpfile path.Their example:
~I have not test yet using the catalog but it should work but using this snippet with one of my files works.~
It works.
@rabiahmad I’ve shared your issue and had success by following the suggestion offered by @mzjp2. You should get the same results by following these instructions for creating custom datasets and using the below code snippet for your definition file.