Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Writing a DataFrame to an Excel file doesn't produce Operation in Spark Logical Plan

See original GitHub issue

Hello, I’m trying to add support for spark-excel in our Data Lineage Tracking tool Spline.

We use Spark QueryExecutionListener to capture the logical plan, and then we create lineage from it. Unfortunately there is no operation in the plan for writing in the Excel. On the other hand the Reading from Excel is contained in the plan.

We will be adding support for reading operation, but if you manage to make spark include the write operation in plan we can add support for that in Spline as well.

// This is the code I use when trying to produce the plan:

  val df = spark.read
    .format("com.crealytics.spark.excel")
    .option("useHeader", "true") // Required
    .load("/Users/abac720/test.xlsx")

  val res = df.select($"number" + 1, $"text")

  res.write
    .format("com.crealytics.spark.excel")
    .option("dataAddress", "'My Sheet'!B3:C35")
    .option("useHeader", "true")
    .save("data/output/batchWithDependencies/result.xlsx")

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:5

Top GitHub Comments

2reactions

cerveadacommented, Mar 2, 2020

At the end I was able to extract lineage data for both Read and Write.

I don’t know why it didn’t work from the start, but Ii does now. So I’m closing this ticket.

1reaction

cerveadacommented, Feb 21, 2020

Ok, I will add support for the reading only now and after the new version is released we can add support for it as well.

Top Results From Across the Web

Spark SQL, DataFrames and Datasets Guide

With a SparkSession , applications can create DataFrames from an existing RDD , from a Hive table, or from Spark data sources. As...

Work with Apache Spark Scala DataFrames - Azure Databricks

Learn how to load and transform data using the Apache Spark Scala DataFrame API in Azure Databricks.

Spark's Logical and Physical plans … When, Why ... - Medium

An execution plan is the set of operations executed to translate a query language statement (SQL, Spark SQL, Dataframe operations etc.) ...

Spark vs Pandas, part 2 - Towards Data Science

In contrast to Pandas, Spark uses a lazy execution model. This means that when you apply some transformation to a DataFrame, the data...

Data Cleaning with Apache Spark - Notes by Louisa - GitBook

Spark will: Automatically create columns in a DataFrame based on sep argument df1 = spark.read.csv('datafile.csv.gz', sep=','). Defaults to using ,. Can still ...