[Spark] Cannot append to Glue table - StorageDescriptor#InputFormat cannot be null for table
See original GitHub issueApache Iceberg version
0.14.0 (latest release)
Query engine
Spark
Please describe the bug 🐞
Hello,
I’m trying to test Iceberg on AWS Glue (Glue version 3.0, Spark 3.1).
I was able to create new table, however when I want to append dataframe to the table, I’m receiving following error:
"Exception in User Class: org.apache.spark.sql.AnalysisException : org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table mytable. StorageDescriptor#InputFormat cannot be null for table: mytable (Service: null; Status Code: 0; Error Code: null; Request ID: null; Proxy: null)"
Do I need to specify InputFormat when I’m creating table?
Here is the code of my Glue job:
import com.amazonaws.services.glue.log.GlueLogger
import com.amazonaws.services.glue.util.{GlueArgParser, Job}
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, lit, map, when}
import scala.collection.JavaConverters._
object Main {
def main(sysArgs: Array[String]): Unit = {
val sparkConf = new SparkConf
val catalog = "glue_catalog"
val tableName = "mytable"
val dbName = "mydb"
val s3Bucket = "mybucket"
val nRows = 100
implicit val sparkSession: SparkSession = SparkSession
.builder()
.config(sparkConf)
.config(s"spark.sql.catalog.$catalog", "org.apache.iceberg.spark.SparkCatalog")
.config(s"spark.sql.catalog.$catalog.warehouse", s"s3:/$s3Bucket/iceberg/")
.config(s"spark.sql.catalog.$catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
.config(s"spark.sql.catalog.$catalog}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
.config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
.appName(s"iceberg_poc")
.enableHiveSupport()
.getOrCreate()
val glueContext: GlueContext = new GlueContext(sparkSession.sparkContext)
val logger = new GlueLogger
val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
Job.init(args("JOB_NAME"), glueContext, args.asJava)
val partitionDate = "2022-08-17"
val dateFormat = new java.text.SimpleDateFormat("yyyy-MM-dd")
val utilDate = dateFormat.parse(partitionDate)
val sqlDate = new java.sql.Date(utilDate.getTime())
import scala.util.Random
import sparkSession.implicits._
val df = (1 to nRows)
.map(_ => (Random.nextLong,Random.nextString(10),Random.nextDouble, Random.nextBoolean, sqlDate))
.toDF("test_integer","test_string","test_double", "test_boolean", "partition_date")
val tableExists = sparkSession.catalog.tableExists(s"$dbName.$tableName")
println(s"Table $tableName exists=$tableExists")
if (tableExists){
println("Appending table")
df.writeTo(s"$catalog.dbName.$tableName")
.append
}
else {
println("Creating table")
df.writeTo(s"$catalog.$dbName.$tableName")
.partitionedBy(col("partition_date"))
.tableProperty("format-version", "2")
.create
}
Job.commit()
}
}
Glue table metadata:
{
"Table": {
"Name": "mytable",
"DatabaseName": "mydb",
"CreateTime": "2022-08-17T15:40:22+02:00",
"UpdateTime": "2022-08-17T15:40:22+02:00",
"Retention": 0,
"StorageDescriptor": {
"Columns": [
{
"Name": "test_integer",
"Type": "bigint",
"Parameters": {
"iceberg.field.current": "true",
"iceberg.field.id": "1",
"iceberg.field.optional": "true"
}
},
{
"Name": "test_string",
"Type": "string",
"Parameters": {
"iceberg.field.current": "true",
"iceberg.field.id": "2",
"iceberg.field.optional": "true"
}
},
{
"Name": "test_double",
"Type": "double",
"Parameters": {
"iceberg.field.current": "true",
"iceberg.field.id": "3",
"iceberg.field.optional": "true"
}
},
{
"Name": "test_boolean",
"Type": "boolean",
"Parameters": {
"iceberg.field.current": "true",
"iceberg.field.id": "4",
"iceberg.field.optional": "true"
}
},
{
"Name": "partition_date",
"Type": "date",
"Parameters": {
"iceberg.field.current": "true",
"iceberg.field.id": "5",
"iceberg.field.optional": "true"
}
}
],
"Location": "s3://mybucket/mytable",
"Compressed": false,
"NumberOfBuckets": 0,
"SortColumns": [],
"StoredAsSubDirectories": false
},
"TableType": "EXTERNAL_TABLE",
"Parameters": {
"metadata_location": "s3://mybucket/metadata.json",
"table_type": "ICEBERG"
},
"CreatedBy": "arn:aws:sts::00000000:assumed-role/xxx/GlueJobRunnerSession",
"IsRegisteredWithLakeFormation": false,
"CatalogId": "0000000",
"VersionId": "0"
}
}
Error stacktrace:
2022-08-17 13:49:28,852 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(94)): Exception in User Class
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table mytable. StorageDescriptor#InputFormat cannot be null for table: mytable (Service: null; Status Code: 0; Error Code: null; Request ID: null; Proxy: null)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:135)
at org.apache.spark.sql.hive.HiveExternalCatalog.tableExists(HiveExternalCatalog.scala:879)
at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.tableExists(ExternalCatalogWithListener.scala:146)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableExists(SessionCatalog.scala:462)
at org.apache.spark.sql.internal.CatalogImpl.tableExists(CatalogImpl.scala:260)
at org.apache.spark.sql.internal.CatalogImpl.tableExists(CatalogImpl.scala:252)
at Main$.main(Main.scala:61)
at Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.amazonaws.services.glue.SparkProcessLauncherPlugin.invoke(ProcessLauncher.scala:48)
at com.amazonaws.services.glue.SparkProcessLauncherPlugin.invoke$(ProcessLauncher.scala:48)
at com.amazonaws.services.glue.ProcessLauncher$$anon$1.invoke(ProcessLauncher.scala:78)
at com.amazonaws.services.glue.ProcessLauncher.launch(ProcessLauncher.scala:143)
at com.amazonaws.services.glue.ProcessLauncher$.main(ProcessLauncher.scala:30)
at com.amazonaws.services.glue.ProcessLauncher.main(ProcessLauncher.scala)
Issue Analytics
- State:
- Created a year ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
Unable to query Iceberg data from Spark on EMR #2202
If I try to register an Iceberg catalog, databases or tables from the Flink ... StorageDescriptor cannot be null for table: iceberg_dummy_table (Service: ......
Read more >Unable to query Iceberg table from PySpark script in AWS Glue
I'm trying to read data from an iceberg table, the data is in ORC format and partitioned by column. I'm getting this error...
Read more >Reading Aurora Postgress Table with Spark SQL on EMR
HiveException: Unable to fetch table glue_public_distributors. StorageDescriptor#InputFormat cannot be null for table: glue_public_distributors (Service: ...
Read more >StorageDescriptor - AWS Glue
An object that references a schema stored in the AWS Glue Schema Registry. When creating a table, you can pass an empty list...
Read more >AWS Glue Data Catalog as the centralized metastore for ...
Copy link. Info. Shopping. Tap to unmute. If playback doesn't begin shortly, try restarting your device. Your browser can't play this video.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hello again @singhpk234 ,
I was able to fix the issue by rewriting all the code based on official AWS docs, append function started to working again. I was not able to fix the
sparkSession.catalog.tableExists(s"$catalog.$dbName.$tableName")
, so I did a workaround that is working as expected:Thanks for your help
Thanks for your input. This line is not creating the issue, it is caused by DataFrameWriterV2.append() method.
df.writeTo(s"$catalog.dbName.$tableName") .append
Anyway I changed the line you asked, including catalog name in the tableExists method, but I received another error. The error is not there, when I downgraded to Iceberg 0.13.0
Error trace, when I used iceberg 0.14.0 and included catalog name in
sparkSession.catalog.tableExists(s"$catalog.$dbName.$tableName")