Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Spark] Cannot append to Glue table - StorageDescriptor#InputFormat cannot be null for table

See original GitHub issue

Apache Iceberg version

0.14.0 (latest release)

Query engine

Spark

Please describe the bug 🐞

Hello,

I’m trying to test Iceberg on AWS Glue (Glue version 3.0, Spark 3.1).

I was able to create new table, however when I want to append dataframe to the table, I’m receiving following error: "Exception in User Class: org.apache.spark.sql.AnalysisException : org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table mytable. StorageDescriptor#InputFormat cannot be null for table: mytable (Service: null; Status Code: 0; Error Code: null; Request ID: null; Proxy: null)"

Do I need to specify InputFormat when I’m creating table?

Here is the code of my Glue job:

import com.amazonaws.services.glue.log.GlueLogger
import com.amazonaws.services.glue.util.{GlueArgParser, Job}
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, lit, map, when}

import scala.collection.JavaConverters._

object Main {

  def main(sysArgs: Array[String]): Unit = {
    val sparkConf = new SparkConf

    val catalog = "glue_catalog"
    val tableName = "mytable"
    val dbName = "mydb"
    val s3Bucket = "mybucket"
    val nRows = 100

    implicit val sparkSession: SparkSession = SparkSession
      .builder()
      .config(sparkConf)
      .config(s"spark.sql.catalog.$catalog", "org.apache.iceberg.spark.SparkCatalog")
      .config(s"spark.sql.catalog.$catalog.warehouse", s"s3:/$s3Bucket/iceberg/")
      .config(s"spark.sql.catalog.$catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
      .config(s"spark.sql.catalog.$catalog}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
      .config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
      .appName(s"iceberg_poc")
      .enableHiveSupport()
      .getOrCreate()

    val glueContext: GlueContext = new GlueContext(sparkSession.sparkContext)
    val logger                   = new GlueLogger
    
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    
    Job.init(args("JOB_NAME"), glueContext, args.asJava)
    
    val partitionDate = "2022-08-17"
    val dateFormat = new java.text.SimpleDateFormat("yyyy-MM-dd")
    val utilDate = dateFormat.parse(partitionDate)
    val sqlDate = new java.sql.Date(utilDate.getTime())
    
    import scala.util.Random
    import sparkSession.implicits._
    
    val df = (1 to nRows)
        .map(_ => (Random.nextLong,Random.nextString(10),Random.nextDouble, Random.nextBoolean, sqlDate))
        .toDF("test_integer","test_string","test_double", "test_boolean", "partition_date")
    
    val tableExists = sparkSession.catalog.tableExists(s"$dbName.$tableName")
    
    println(s"Table $tableName exists=$tableExists")
    
    if (tableExists){
        println("Appending table")

        df.writeTo(s"$catalog.dbName.$tableName")
        .append

    }
    
    else {
        println("Creating table")

        df.writeTo(s"$catalog.$dbName.$tableName")
        .partitionedBy(col("partition_date"))
        .tableProperty("format-version", "2")
        .create
    }
    
    Job.commit()
  }
}

Glue table metadata:

{
    "Table": {
        "Name": "mytable",
        "DatabaseName": "mydb",
        "CreateTime": "2022-08-17T15:40:22+02:00",
        "UpdateTime": "2022-08-17T15:40:22+02:00",
        "Retention": 0,
        "StorageDescriptor": {
            "Columns": [
                {
                    "Name": "test_integer",
                    "Type": "bigint",
                    "Parameters": {
                        "iceberg.field.current": "true",
                        "iceberg.field.id": "1",
                        "iceberg.field.optional": "true"
                    }
                },
                {
                    "Name": "test_string",
                    "Type": "string",
                    "Parameters": {
                        "iceberg.field.current": "true",
                        "iceberg.field.id": "2",
                        "iceberg.field.optional": "true"
                    }
                },
                {
                    "Name": "test_double",
                    "Type": "double",
                    "Parameters": {
                        "iceberg.field.current": "true",
                        "iceberg.field.id": "3",
                        "iceberg.field.optional": "true"
                    }
                },
                {
                    "Name": "test_boolean",
                    "Type": "boolean",
                    "Parameters": {
                        "iceberg.field.current": "true",
                        "iceberg.field.id": "4",
                        "iceberg.field.optional": "true"
                    }
                },
                {
                    "Name": "partition_date",
                    "Type": "date",
                    "Parameters": {
                        "iceberg.field.current": "true",
                        "iceberg.field.id": "5",
                        "iceberg.field.optional": "true"
                    }
                }
            ],
            "Location": "s3://mybucket/mytable",
            "Compressed": false,
            "NumberOfBuckets": 0,
            "SortColumns": [],
            "StoredAsSubDirectories": false
        },
        "TableType": "EXTERNAL_TABLE",
        "Parameters": {
            "metadata_location": "s3://mybucket/metadata.json",
            "table_type": "ICEBERG"
        },
        "CreatedBy": "arn:aws:sts::00000000:assumed-role/xxx/GlueJobRunnerSession",
        "IsRegisteredWithLakeFormation": false,
        "CatalogId": "0000000",
        "VersionId": "0"
    }
}

Error stacktrace:

2022-08-17 13:49:28,852 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(94)): Exception in User Class
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table mytable. StorageDescriptor#InputFormat cannot be null for table: mytable (Service: null; Status Code: 0; Error Code: null; Request ID: null; Proxy: null)
	at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:135)
	at org.apache.spark.sql.hive.HiveExternalCatalog.tableExists(HiveExternalCatalog.scala:879)
	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.tableExists(ExternalCatalogWithListener.scala:146)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableExists(SessionCatalog.scala:462)
	at org.apache.spark.sql.internal.CatalogImpl.tableExists(CatalogImpl.scala:260)
	at org.apache.spark.sql.internal.CatalogImpl.tableExists(CatalogImpl.scala:252)
	at Main$.main(Main.scala:61)
	at Main.main(Main.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.amazonaws.services.glue.SparkProcessLauncherPlugin.invoke(ProcessLauncher.scala:48)
	at com.amazonaws.services.glue.SparkProcessLauncherPlugin.invoke$(ProcessLauncher.scala:48)
	at com.amazonaws.services.glue.ProcessLauncher$$anon$1.invoke(ProcessLauncher.scala:78)
	at com.amazonaws.services.glue.ProcessLauncher.launch(ProcessLauncher.scala:143)
	at com.amazonaws.services.glue.ProcessLauncher$.main(ProcessLauncher.scala:30)
	at com.amazonaws.services.glue.ProcessLauncher.main(ProcessLauncher.scala)

Issue Analytics

State:
Created a year ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

t0ma-szcommented, Aug 19, 2022

Hello again @singhpk234 ,

I was able to fix the issue by rewriting all the code based on official AWS docs, append function started to working again. I was not able to fix the sparkSession.catalog.tableExists(s"$catalog.$dbName.$tableName"), so I did a workaround that is working as expected:

  def doesTableExistIceberg(spark: SparkSession, table: String): Boolean = {
      println(s"Checking if table $table exists")
      Try {spark.read.table(table)} match {
    case Success(i) => true
    case Failure(s) => println(s"Failed. Reason: $s"); false
}

Thanks for your help

1reaction

t0ma-szcommented, Aug 17, 2022

val tableExists = sparkSession.catalog.tableExists(s"$dbName.$tableName")
I think this is happening because you haven’t specified the catalog name in identifier of write to and neither made my_catalog as default …

can you please try once with 3 part identifier here i.e sparkSession.catalog.tableExists(s"$catalog.$dbName.$tableName")

ref : A ticket stating similar issue : #2202 (comment)

Thanks for your input. This line is not creating the issue, it is caused by DataFrameWriterV2.append() method. df.writeTo(s"$catalog.dbName.$tableName") .append

Anyway I changed the line you asked, including catalog name in the tableExists method, but I received another error. The error is not there, when I downgraded to Iceberg 0.13.0

Error trace, when I used iceberg 0.14.0 and included catalog name in

sparkSession.catalog.tableExists(s"$catalog.$dbName.$tableName")

2022-08-17 12:56:03,376 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(94)): Exception in User Class
org.apache.spark.sql.catalyst.parser.ParseException: 
mismatched input '.' expecting {<EOF>, '-'}(line 1, pos 43)

== SQL ==
glue_catalog.mydb.mytable
-----------------^^^

	at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:255)
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:124)
	at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:49)
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableIdentifier(ParseDriver.scala:48)
	at org.apache.spark.sql.catalyst.parser.extensions.IcebergSparkSqlExtensionsParser.parseTableIdentifier(IcebergSparkSqlExtensionsParser.scala:73)
	at org.apache.spark.sql.internal.CatalogImpl.tableExists(CatalogImpl.scala:251)
	at Main$.main(Main.scala:78)
	at Main.main(Main.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.amazonaws.services.glue.SparkProcessLauncherPlugin.invoke(ProcessLauncher.scala:48)
	at com.amazonaws.services.glue.SparkProcessLauncherPlugin.invoke$(ProcessLauncher.scala:48)
	at com.amazonaws.services.glue.ProcessLauncher$$anon$1.invoke(ProcessLauncher.scala:78)
	at com.amazonaws.services.glue.ProcessLauncher.launch(ProcessLauncher.scala:143)
	at com.amazonaws.services.glue.ProcessLauncher$.main(ProcessLauncher.scala:30)
	at com.amazonaws.services.glue.ProcessLauncher.main(ProcessLauncher.scala)

Top Results From Across the Web

Unable to query Iceberg data from Spark on EMR #2202

If I try to register an Iceberg catalog, databases or tables from the Flink ... StorageDescriptor cannot be null for table: iceberg_dummy_table (Service: ......

Unable to query Iceberg table from PySpark script in AWS Glue

I'm trying to read data from an iceberg table, the data is in ORC format and partitioned by column. I'm getting this error...

Reading Aurora Postgress Table with Spark SQL on EMR

HiveException: Unable to fetch table glue_public_distributors. StorageDescriptor#InputFormat cannot be null for table: glue_public_distributors (Service: ...

StorageDescriptor - AWS Glue

An object that references a schema stored in the AWS Glue Schema Registry. When creating a table, you can pass an empty list...

AWS Glue Data Catalog as the centralized metastore for ...

Copy link. Info. Shopping. Tap to unmute. If playback doesn't begin shortly, try restarting your device. Your browser can't play this video.