SparkDataSet with a relative local file path doesn't work on jupyter notebook/lab
See original GitHub issueDescription
SparkDataSet with a relative local file path doesn’t work on jupyter notebook/lab
Context
We can have a SparkDataSet entry which filepath has a relative local file path in catalog.yml which looks like this;
something:
type: kedro.contrib.io.pyspark.SparkDataSet
file_format: parquet
filepath: data/01_intermediate/something.parquet
save_args:
mode: overwrite
And when something.parquet is placed under <project_directory>/data/01_intermediate/something.parquet properly, kedro run successfully can load this parquet as long as the pipeline uses the ‘something’ data.
But on a jupyter notebook invoked by kedro jupyter notebook command, the following script doesn’t load ‘something’ as expected.
io.load('something')
Instead, it raises an exception looks like the following;
Py4JJavaError: An error occurred while calling o25.load.
: org.apache.spark.sql.AnalysisException: Path does not exist: file:/Users/go_kojima/sample_kedro_project/notebooks/data/01_intermediate/something.parquet
Reading spark_data_set.py (of kedro) and readwriter.py (of pyspark), I think it is caused by spark.read.load implementation. And apparently spark.read.load tries to read the data which is located at /Users/go_kojima/sample_kedro_project/notebooks/data/01_intermediate/something.parquet mistakenly, somehow, spark.read.load tries to resolve a given relative filepath referring from the directory of the notebook.
Steps to Reproduce
As I showed above,
- put some spark readable data on your local file system
- put a corresponding data entry using a relative file path on catalog.yml
- invoke jupyter notebook by using kedro jupyter notebook command at the root directory of the kedro project
- execute io.load(‘something’) on a jupyter notebook
Expected Result
load a ‘something’ dataframe
Actual Result
raises exceptions
Py4JJavaError: An error occurred while calling o25.load.
: org.apache.spark.sql.AnalysisException: Path does not exist: file:/Users/go_kojima/sample_kedro_project/notebooks/data/01_intermediate/something.parquet
Your Environment
Include as many relevant details about the environment in which you experienced the bug:
-
Kedro version used (
pip show kedro
orkedro -V
): kedro, version 0.14.1 (anaconda3-2019.03) -
Python version used (
python -V
): Python 3.7.3 (anaconda3-2019.03) -
Operating system and version: macOS Mojave version 10.14.5
My personal solution
Modifies the filepath as a absolute file path when a relative local file path is given in the SparkDataSet’s init function code looks like the following;
class SparkDataSet(AbstractDataSet, ExistsMixin):
def __init__(
self,
filepath: str,
file_format: str = "parquet",
load_args: Optional[Dict[str, Any]] = None,
save_args: Optional[Dict[str, Any]] = None,
) -> None:
import re
from os.path import abspath, curdir
### original version
# self._filepath = filepath
### modified version
def is_relative_path(path):
def is_url():
url_pattern = r'^\S+://'
return not not re.match(url_pattern, path)
def is_abspath():
abspath_pattern = r'^/'
return not not re.match(abspath_pattern, path)
return not(is_url() or is_abspath())
def file_url(rpath):
return 'file://%s/%s' % (abspath(curdir), rpath)
self._filepath = file_url(filepath) if is_relative_path(filepath) else filepath
self._file_format = file_format
self._load_args = load_args if load_args is not None else {}
self._save_args = save_args if save_args is not None else {}
Issue Analytics
- State:
- Created 4 years ago
- Comments:9 (3 by maintainers)
Top GitHub Comments
I just found the cause of this issue.
In a python script(xxx.py) under [project dir]/src/<project name>/nodes/, there was SparkSession initialization fragment which looks like the following;
This was the cause.
In [project dir]/src/[project name]/run.py, SparkSession initialization fragment codes were put, so kedro run command execution didn’t suffer from this issue.
But for notebook, the node function defininig python script having the SparkSession initialization showed above was loaded during notebook initialization process, so the SparkSession was initialized with the notebook’s directory as the working directory.
To avoid this issue, I moved SparkSession initialization fragment into the inside of function definitions of the node function defining script, as like the following;
This solved the issue I have been facing.
I hope this comment helps some one facing same issue in the future.
@gotin, thank you for looking into it. Please keep us updated if you manage to reproduce the issue.