Correctly Identify AWS Glue as the Metastore when running Spark on Amazon EMR
See original GitHub issueDescribe the bug When using Apache Spark on Amazon EMR, you have the ability to replace Apache Hive with AWS Glue as the Hive-compatible metastore, with the underlying data being stored in S3. When doing so, the lineage metadata sent to DataHub incorrectly identifies the data catalog tables as Apache Hive Datasets and not AWS Glue Datasets. Additionally, the tables do not contain the schema, nor do they have an upstream link to the underlying data in Amazon S3.
To Reproduce Steps to reproduce the behavior:
- Create a PySpark job that reads data from one or more AWS Glue Data Catalog tables, transforms the data, and writes back to an AWS Glue Data Catalog table. Create the
SparkSession
configuration as follows, updating it to your own DataHub REST endpoint:
spark = SparkSession \
.builder \
.appName("Spark Glue DataHub Test App") \
.config("hive.metastore.client.factory.class",
"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.config("spark.jars.packages", "io.acryl:datahub-spark-lineage:0.8.27") \
.config("spark.extraListeners", "datahub.spark.DatahubSparkListener") \
.config("spark.datahub.rest.server", "http://111.222.333.444:55555") \
.enableHiveSupport() \
.getOrCreate()
- Run the PySpark job on Amazon EMR; I am using EMR v6.5.0 (Spark v3.1.2)
Expected behavior I would expect that when using Apache Spark on Amazon EMR and substituting Apache Hive with AWS Glue as the metastore with the underlying data is stored in S3, the data catalog tables (Datasets) identified as part of the lineage graph are of type AWS Glue. Additionally, it would be a bonus if they contained the schema and an upstream link to the underlying Amazon S3 data associated with the Glue table.
Screenshots
Desktop (please complete the following information):
- OS: macOS 11.6 (20G165)
- Browser: Google Chrome
- Version 99.0.4844.51 (Official Build) (x86_64)
Additional context None.
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (1 by maintainers)
Top GitHub Comments
@garystafford Can you please get the debug level enabled logs with latest version? To enable debugging logs, add below configuration in log4j.properties file log4j.logger.datahub.spark=DEBUG log4j.logger.datahub.client.rest=DEBUG
I think the Spark Execution plan doesn’t know if it is Glue or Hive. We should add some options to identify Hive tables as Glue if user wants that.