question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SparkDataSet with a relative local file path doesn't work on jupyter notebook/lab

See original GitHub issue

Description

SparkDataSet with a relative local file path doesn’t work on jupyter notebook/lab

Context

We can have a SparkDataSet entry which filepath has a relative local file path in catalog.yml which looks like this;

something:
  type: kedro.contrib.io.pyspark.SparkDataSet
  file_format: parquet
  filepath: data/01_intermediate/something.parquet
  save_args:
    mode: overwrite

And when something.parquet is placed under <project_directory>/data/01_intermediate/something.parquet properly, kedro run successfully can load this parquet as long as the pipeline uses the ‘something’ data.

But on a jupyter notebook invoked by kedro jupyter notebook command, the following script doesn’t load ‘something’ as expected.

io.load('something')

Instead, it raises an exception looks like the following;

Py4JJavaError: An error occurred while calling o25.load.
: org.apache.spark.sql.AnalysisException: Path does not exist: file:/Users/go_kojima/sample_kedro_project/notebooks/data/01_intermediate/something.parquet

Reading spark_data_set.py (of kedro) and readwriter.py (of pyspark), I think it is caused by spark.read.load implementation. And apparently spark.read.load tries to read the data which is located at /Users/go_kojima/sample_kedro_project/notebooks/data/01_intermediate/something.parquet mistakenly, somehow, spark.read.load tries to resolve a given relative filepath referring from the directory of the notebook.

Steps to Reproduce

As I showed above,

  1. put some spark readable data on your local file system
  2. put a corresponding data entry using a relative file path on catalog.yml
  3. invoke jupyter notebook by using kedro jupyter notebook command at the root directory of the kedro project
  4. execute io.load(‘something’) on a jupyter notebook

Expected Result

load a ‘something’ dataframe

Actual Result

raises exceptions

Py4JJavaError: An error occurred while calling o25.load.
: org.apache.spark.sql.AnalysisException: Path does not exist: file:/Users/go_kojima/sample_kedro_project/notebooks/data/01_intermediate/something.parquet

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • Kedro version used (pip show kedro or kedro -V): kedro, version 0.14.1 (anaconda3-2019.03)

  • Python version used (python -V): Python 3.7.3 (anaconda3-2019.03)

  • Operating system and version: macOS Mojave version 10.14.5

My personal solution

Modifies the filepath as a absolute file path when a relative local file path is given in the SparkDataSet’s init function code looks like the following;

class SparkDataSet(AbstractDataSet, ExistsMixin):

    def __init__(
        self,
        filepath: str,
        file_format: str = "parquet",
        load_args: Optional[Dict[str, Any]] = None,
        save_args: Optional[Dict[str, Any]] = None,
    ) -> None:
        import re
        from os.path import abspath, curdir

        ### original version
        # self._filepath = filepath

        ### modified version
        def is_relative_path(path):
            def is_url():
                url_pattern = r'^\S+://'
                return not not re.match(url_pattern, path)
            def is_abspath():
                abspath_pattern = r'^/'
                return not not re.match(abspath_pattern, path)
            return not(is_url() or is_abspath())

        def file_url(rpath):
            return 'file://%s/%s' % (abspath(curdir), rpath)

        self._filepath = file_url(filepath) if is_relative_path(filepath) else filepath

        self._file_format = file_format
        self._load_args = load_args if load_args is not None else {}
        self._save_args = save_args if save_args is not None else {}


Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:9 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
gotincommented, Jul 10, 2019

I just found the cause of this issue.

In a python script(xxx.py) under [project dir]/src/<project name>/nodes/, there was SparkSession initialization fragment which looks like the following;

# in xxx.py
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

This was the cause.

In [project dir]/src/[project name]/run.py, SparkSession initialization fragment codes were put, so kedro run command execution didn’t suffer from this issue.

# in run.py

from pyspark.sql import SparkSession
def init_spark_session(aws_access_key=None, aws_secret_key=None):
    spark = (SparkSession.builder.master("local[*]")
             .appName("kedro")
             .config("spark.executor.memory", "24G")
             .config("spark.executor.cores", "10")
             .config('spark.driver.memory','4G')
             .config("spark.sql.execution.arrow.enabled", "true")
             .config("spark.driver.maxResultSize", "3g")
    return spark

def main(
    tags: Iterable[str] = None,
    env: str = None,
    runner: str = None,
):

    # Load Catalog
    conf = get_config(project_path=str(Path.cwd()), env=env)
    catalog = create_catalog(config=conf)

    spark = init_spark_session()
    # Load the pipeline
    pipeline = create_pipeline()
    pipeline = pipeline.only_nodes_with_tags(*tags) if tags else pipeline

But for notebook, the node function defininig python script having the SparkSession initialization showed above was loaded during notebook initialization process, so the SparkSession was initialized with the notebook’s directory as the working directory.

To avoid this issue, I moved SparkSession initialization fragment into the inside of function definitions of the node function defining script, as like the following;

from .. import run

def func1(df, params):
  spark = run.init_spark_session()
  # some scripts using spark session

This solved the issue I have been facing.

I hope this comment helps some one facing same issue in the future.

0reactions
DmitriiDeriabinQBcommented, Jul 9, 2019

@gotin, thank you for looking into it. Please keep us updated if you manage to reproduce the issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

SparkDataSet with a relative local file path doesn't work on ...
Description. SparkDataSet with a relative local file path doesn't work on jupyter notebook/lab. Context. We can have a SparkDataSet entry ...
Read more >
apache spark - Pyspark - Load file: Path does not exist
I think is because the file// extension just read the file locally and it does not distribute the file across the other nodes....
Read more >
Not able to read text file from local file path - Spark CSV reader
We are using Spark CSV reader to read the csv file to convert as DataFrame and we are running the job on. yarn-client....
Read more >
First Steps With PySpark and Big Data Processing - Real Python
This command takes a PySpark or Scala program and executes it on a cluster. This is likely how you'll execute your real Big...
Read more >
How to configure a self managed Spark Datasource
This guide will help you add a managed Spark dataset (Spark Dataframe ... Enter the path (relative or absolute) of the root directory...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found