Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Correctly Identify AWS Glue as the Metastore when running Spark on Amazon EMR

See original GitHub issue

Describe the bug When using Apache Spark on Amazon EMR, you have the ability to replace Apache Hive with AWS Glue as the Hive-compatible metastore, with the underlying data being stored in S3. When doing so, the lineage metadata sent to DataHub incorrectly identifies the data catalog tables as Apache Hive Datasets and not AWS Glue Datasets. Additionally, the tables do not contain the schema, nor do they have an upstream link to the underlying data in Amazon S3.

To Reproduce Steps to reproduce the behavior:

Create a PySpark job that reads data from one or more AWS Glue Data Catalog tables, transforms the data, and writes back to an AWS Glue Data Catalog table. Create the SparkSession configuration as follows, updating it to your own DataHub REST endpoint:

spark = SparkSession \
    .builder \
    .appName("Spark Glue DataHub Test App") \
    .config("hive.metastore.client.factory.class",
            "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
    .config("spark.jars.packages", "io.acryl:datahub-spark-lineage:0.8.27") \
    .config("spark.extraListeners", "datahub.spark.DatahubSparkListener") \
    .config("spark.datahub.rest.server", "http://111.222.333.444:55555") \
    .enableHiveSupport() \
    .getOrCreate()

Run the PySpark job on Amazon EMR; I am using EMR v6.5.0 (Spark v3.1.2)

Expected behavior I would expect that when using Apache Spark on Amazon EMR and substituting Apache Hive with AWS Glue as the metastore with the underlying data is stored in S3, the data catalog tables (Datasets) identified as part of the lineage graph are of type AWS Glue. Additionally, it would be a bonus if they contained the schema and an upstream link to the underlying Amazon S3 data associated with the Glue table.

Screenshots Screen Shot 2022-03-13 at 4 33 25 PM

Desktop (please complete the following information):

OS: macOS 11.6 (20G165)
Browser: Google Chrome
Version 99.0.4844.51 (Official Build) (x86_64)

Additional context None.

Issue Analytics

State:
Created 2 years ago
Comments:5 (1 by maintainers)

Top GitHub Comments

1reaction

MugdhaHardikar-GSLabcommented, Mar 21, 2022

@garystafford Can you please get the debug level enabled logs with latest version? To enable debugging logs, add below configuration in log4j.properties file log4j.logger.datahub.spark=DEBUG log4j.logger.datahub.client.rest=DEBUG

0reactions

treff7escommented, Sep 16, 2022

I think the Spark Execution plan doesn’t know if it is Glue or Hive. We should add some options to identify Hive tables as Glue if user wants that.

Top Results From Across the Web

Use the AWS Glue Data Catalog as the metastore for Spark SQL

The AWS Glue Data Catalog provides a unified metadata repository across a variety of data sources and data formats, integrating with Amazon EMR...

Using the AWS Glue Data Catalog as the metastore for Hive

The AWS Glue Data Catalog provides a unified metadata repository across a variety of data sources and data formats, integrating with Amazon EMR...

AWS Glue Data Catalog support for Spark SQL jobs

Configure your jobs and development endpoints to run Spark SQL queries directly against tables stored in the AWS Glue Data Catalog.

Metadata classification, lineage, and discovery using Apache ...

AWS Glue Data Catalog integrates with Amazon EMR, and also Amazon ... EMR is merely what's needed for the Hive metastore on Amazon...

Metastore configuration - Amazon EMR - AWS Documentation

When you use EMR Studio to run your jobs with EMR Serverless Spark applications, the AWS Glue Data Catalog is the default metastore....

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Correctly Identify AWS Glue as the Metastore when running Spark on Amazon EMR

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Cannot build docker image on EC2

invalid variable names in docker/elasticsearch/env/docker.env when using docker compose v2