Materialized tables creation fails on EMR
See original GitHub issueWhen I attempt to have dbt run a simple job that results in a materialized Spark table using EMR, I get an error as follows:
> dbt run
Running with dbt=0.13.0
Found 1 models, 0 tests, 0 archives, 0 analyses, 100 macros, 0 operations, 0 seed files, 1 sources
16:25:55 | Concurrency: 1 threads (target='dev')
16:25:55 |
16:25:55 | 1 of 1 START table model data_lake_intgn_test.test_incremental_dt.... [RUN]
16:25:58 | 1 of 1 ERROR creating table model data_lake_intgn_test.test_incremental_dt [ERROR in 2.37s]
16:25:58 |
16:25:58 | Finished running 1 table models in 6.23s.
Completed with 1 errors:
Runtime Error in model data_lake_intgn_current|test_incremental_dt|current (models/data_lake_intgn/current/data_lake_intgn_current|test_incremental_dt|current.sql)
java.lang.IllegalArgumentException: Can not create a Path from an empty string
Done. PASS=0 ERROR=1 SKIP=0 TOTAL=1
If I run the compiled query directly in PySpark on the EMR cluster, I get the same error message (with the following more complete stack trace):
py4j.protocol.Py4JJavaError: An error occurred while calling o58.sql.
: java.lang.IllegalArgumentException: Can not create a Path from an empty string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:163)
at org.apache.hadoop.fs.Path.<init>(Path.java:175)
at org.apache.spark.sql.catalyst.catalog.CatalogUtils$.stringToURI(ExternalCatalogUtils.scala:236)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getDatabase$1$$anonfun$apply$2.apply(HiveClientImpl.scala:343)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getDatabase$1$$anonfun$apply$2.apply(HiveClientImpl.scala:339)
at scala.Option.map(Option.scala:146)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getDatabase$1.apply(HiveClientImpl.scala:339)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getDatabase$1.apply(HiveClientImpl.scala:345)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
at org.apache.spark.sql.hive.client.HiveClientImpl.getDatabase(HiveClientImpl.scala:338)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getDatabase$1.apply(HiveExternalCatalog.scala:211)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getDatabase$1.apply(HiveExternalCatalog.scala:211)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
at org.apache.spark.sql.hive.HiveExternalCatalog.getDatabase(HiveExternalCatalog.scala:210)
at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.getDatabase(ExternalCatalogWithListener.scala:65)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getDatabaseMetadata(SessionCatalog.scala:233)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.defaultTablePath(SessionCatalog.scala:472)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$4.apply(SessionCatalog.scala:327)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$4.apply(SessionCatalog.scala:327)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.validateTableLocation(SessionCatalog.scala:327)
at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:170)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:115)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:195)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:80)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
If I run the same query with the addition of a location statement, however, I do not get an error and the table is created successfully - e.g.:
create table [name]
using parquet
location 's3://[bucket]/[path]'
as SELECT * FROM [blah]"
I think that the root cause of this is that Databricks does some behind-the-scenes magic with default locations / managed tables / DBFS, which doesn’t work on more vanilla Spark, at least in the context of EMR. It’s possible that fiddling with some Spark configs could mitigate this, but in general I’d think that specifying an s3 path for a table would be a fairly normal thing to want to do.
There are a couple approaches that occur to me for dealing with this, which could probably be combined into a default / override kind of situation:
- Set a ‘root’ location in your dbt_project.yml, and have dbt format model names into it, i.e. set root to be
s3://my-bucket/prod/models/and havemodel_1get automatically put intos3://my-bucket/prod/models/model_1/,model_2automatically go intos3://my-bucket/prod/models/model_2/, and so on. - Set a specific table location at the table level via config, so you arbitrarily place
model_1ats3://bucket-a/some-modelandmodel_2ats3://bucket-b/some-model-also
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (5 by maintainers)

Top Related StackOverflow Question
Agreed @aaronsteers @Dandandan. I’m going to close this issue, given that those two features will ship in the next release.
For me, an older version was not working, dbt spark master + a location already provided in the glue data catalog (or creating it from dbt with schema creation) does work. Adding location_root support makes it much more flexible where to store individual tables in EMR.