question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Correctly Identify AWS Glue as the Metastore when running Spark on Amazon EMR

See original GitHub issue

Describe the bug When using Apache Spark on Amazon EMR, you have the ability to replace Apache Hive with AWS Glue as the Hive-compatible metastore, with the underlying data being stored in S3. When doing so, the lineage metadata sent to DataHub incorrectly identifies the data catalog tables as Apache Hive Datasets and not AWS Glue Datasets. Additionally, the tables do not contain the schema, nor do they have an upstream link to the underlying data in Amazon S3.

To Reproduce Steps to reproduce the behavior:

  1. Create a PySpark job that reads data from one or more AWS Glue Data Catalog tables, transforms the data, and writes back to an AWS Glue Data Catalog table. Create the SparkSession configuration as follows, updating it to your own DataHub REST endpoint:
spark = SparkSession \
    .builder \
    .appName("Spark Glue DataHub Test App") \
    .config("hive.metastore.client.factory.class",
            "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
    .config("spark.jars.packages", "io.acryl:datahub-spark-lineage:0.8.27") \
    .config("spark.extraListeners", "datahub.spark.DatahubSparkListener") \
    .config("spark.datahub.rest.server", "http://111.222.333.444:55555") \
    .enableHiveSupport() \
    .getOrCreate()
  1. Run the PySpark job on Amazon EMR; I am using EMR v6.5.0 (Spark v3.1.2)

Expected behavior I would expect that when using Apache Spark on Amazon EMR and substituting Apache Hive with AWS Glue as the metastore with the underlying data is stored in S3, the data catalog tables (Datasets) identified as part of the lineage graph are of type AWS Glue. Additionally, it would be a bonus if they contained the schema and an upstream link to the underlying Amazon S3 data associated with the Glue table.

Screenshots Screen Shot 2022-03-13 at 4 33 25 PM

Desktop (please complete the following information):

  • OS: macOS 11.6 (20G165)
  • Browser: Google Chrome
  • Version 99.0.4844.51 (Official Build) (x86_64)

Additional context None.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
MugdhaHardikar-GSLabcommented, Mar 21, 2022

@garystafford Can you please get the debug level enabled logs with latest version? To enable debugging logs, add below configuration in log4j.properties file log4j.logger.datahub.spark=DEBUG log4j.logger.datahub.client.rest=DEBUG

0reactions
treff7escommented, Sep 16, 2022

I think the Spark Execution plan doesn’t know if it is Glue or Hive. We should add some options to identify Hive tables as Glue if user wants that.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Use the AWS Glue Data Catalog as the metastore for Spark SQL
The AWS Glue Data Catalog provides a unified metadata repository across a variety of data sources and data formats, integrating with Amazon EMR...
Read more >
Using the AWS Glue Data Catalog as the metastore for Hive
The AWS Glue Data Catalog provides a unified metadata repository across a variety of data sources and data formats, integrating with Amazon EMR...
Read more >
AWS Glue Data Catalog support for Spark SQL jobs
Configure your jobs and development endpoints to run Spark SQL queries directly against tables stored in the AWS Glue Data Catalog.
Read more >
Metadata classification, lineage, and discovery using Apache ...
AWS Glue Data Catalog integrates with Amazon EMR, and also Amazon ... EMR is merely what's needed for the Hive metastore on Amazon...
Read more >
Metastore configuration - Amazon EMR - AWS Documentation
When you use EMR Studio to run your jobs with EMR Serverless Spark applications, the AWS Glue Data Catalog is the default metastore....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found