Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SparkDataSet with a relative local file path doesn't work on jupyter notebook/lab

See original GitHub issue

Description

SparkDataSet with a relative local file path doesn’t work on jupyter notebook/lab

Context

We can have a SparkDataSet entry which filepath has a relative local file path in catalog.yml which looks like this;

something:
  type: kedro.contrib.io.pyspark.SparkDataSet
  file_format: parquet
  filepath: data/01_intermediate/something.parquet
  save_args:
    mode: overwrite

And when something.parquet is placed under <project_directory>/data/01_intermediate/something.parquet properly, kedro run successfully can load this parquet as long as the pipeline uses the ‘something’ data.

But on a jupyter notebook invoked by kedro jupyter notebook command, the following script doesn’t load ‘something’ as expected.

io.load('something')

Instead, it raises an exception looks like the following;

Py4JJavaError: An error occurred while calling o25.load.
: org.apache.spark.sql.AnalysisException: Path does not exist: file:/Users/go_kojima/sample_kedro_project/notebooks/data/01_intermediate/something.parquet

Reading spark_data_set.py (of kedro) and readwriter.py (of pyspark), I think it is caused by spark.read.load implementation. And apparently spark.read.load tries to read the data which is located at /Users/go_kojima/sample_kedro_project/notebooks/data/01_intermediate/something.parquet mistakenly, somehow, spark.read.load tries to resolve a given relative filepath referring from the directory of the notebook.

Steps to Reproduce

As I showed above,

put some spark readable data on your local file system
put a corresponding data entry using a relative file path on catalog.yml
invoke jupyter notebook by using kedro jupyter notebook command at the root directory of the kedro project
execute io.load(‘something’) on a jupyter notebook

Expected Result

load a ‘something’ dataframe

Actual Result

raises exceptions

Py4JJavaError: An error occurred while calling o25.load.
: org.apache.spark.sql.AnalysisException: Path does not exist: file:/Users/go_kojima/sample_kedro_project/notebooks/data/01_intermediate/something.parquet

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

Kedro version used (pip show kedro or kedro -V): kedro, version 0.14.1 (anaconda3-2019.03)
Python version used (python -V): Python 3.7.3 (anaconda3-2019.03)
Operating system and version: macOS Mojave version 10.14.5

My personal solution

Modifies the filepath as a absolute file path when a relative local file path is given in the SparkDataSet’s init function code looks like the following;

class SparkDataSet(AbstractDataSet, ExistsMixin):

    def __init__(
        self,
        filepath: str,
        file_format: str = "parquet",
        load_args: Optional[Dict[str, Any]] = None,
        save_args: Optional[Dict[str, Any]] = None,
    ) -> None:
        import re
        from os.path import abspath, curdir

        ### original version
        # self._filepath = filepath

        ### modified version
        def is_relative_path(path):
            def is_url():
                url_pattern = r'^\S+://'
                return not not re.match(url_pattern, path)
            def is_abspath():
                abspath_pattern = r'^/'
                return not not re.match(abspath_pattern, path)
            return not(is_url() or is_abspath())

        def file_url(rpath):
            return 'file://%s/%s' % (abspath(curdir), rpath)

        self._filepath = file_url(filepath) if is_relative_path(filepath) else filepath

        self._file_format = file_format
        self._load_args = load_args if load_args is not None else {}
        self._save_args = save_args if save_args is not None else {}

Issue Analytics

State:
Created 4 years ago
Comments:9 (3 by maintainers)

Top GitHub Comments

2reactions

gotincommented, Jul 10, 2019

I just found the cause of this issue.

In a python script(xxx.py) under [project dir]/src/<project name>/nodes/, there was SparkSession initialization fragment which looks like the following;

# in xxx.py
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

This was the cause.

In [project dir]/src/[project name]/run.py, SparkSession initialization fragment codes were put, so kedro run command execution didn’t suffer from this issue.

# in run.py

from pyspark.sql import SparkSession
def init_spark_session(aws_access_key=None, aws_secret_key=None):
    spark = (SparkSession.builder.master("local[*]")
             .appName("kedro")
             .config("spark.executor.memory", "24G")
             .config("spark.executor.cores", "10")
             .config('spark.driver.memory','4G')
             .config("spark.sql.execution.arrow.enabled", "true")
             .config("spark.driver.maxResultSize", "3g")
    return spark

def main(
    tags: Iterable[str] = None,
    env: str = None,
    runner: str = None,
):

    # Load Catalog
    conf = get_config(project_path=str(Path.cwd()), env=env)
    catalog = create_catalog(config=conf)

    spark = init_spark_session()
    # Load the pipeline
    pipeline = create_pipeline()
    pipeline = pipeline.only_nodes_with_tags(*tags) if tags else pipeline

But for notebook, the node function defininig python script having the SparkSession initialization showed above was loaded during notebook initialization process, so the SparkSession was initialized with the notebook’s directory as the working directory.

To avoid this issue, I moved SparkSession initialization fragment into the inside of function definitions of the node function defining script, as like the following;

from .. import run

def func1(df, params):
  spark = run.init_spark_session()
  # some scripts using spark session

This solved the issue I have been facing.

I hope this comment helps some one facing same issue in the future.

0reactions

DmitriiDeriabinQBcommented, Jul 9, 2019

@gotin, thank you for looking into it. Please keep us updated if you manage to reproduce the issue.

Top Results From Across the Web

SparkDataSet with a relative local file path doesn't work on ...

Description. SparkDataSet with a relative local file path doesn't work on jupyter notebook/lab. Context. We can have a SparkDataSet entry ...

apache spark - Pyspark - Load file: Path does not exist

I think is because the file// extension just read the file locally and it does not distribute the file across the other nodes....

Not able to read text file from local file path - Spark CSV reader

We are using Spark CSV reader to read the csv file to convert as DataFrame and we are running the job on. yarn-client....

First Steps With PySpark and Big Data Processing - Real Python

This command takes a PySpark or Scala program and executes it on a cluster. This is likely how you'll execute your real Big...

How to configure a self managed Spark Datasource

This guide will help you add a managed Spark dataset (Spark Dataframe ... Enter the path (relative or absolute) of the root directory...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

SparkDataSet with a relative local file path doesn't work on jupyter notebook/lab

Description

Context

Steps to Reproduce

Expected Result

Actual Result

Your Environment

My personal solution

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Allow defining credentials from variable environment

Dependency conflict with commonly used AWS libraries