Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Materialized tables creation fails on EMR

See original GitHub issue

When I attempt to have dbt run a simple job that results in a materialized Spark table using EMR, I get an error as follows:

> dbt run
Running with dbt=0.13.0
Found 1 models, 0 tests, 0 archives, 0 analyses, 100 macros, 0 operations, 0 seed files, 1 sources

16:25:55 | Concurrency: 1 threads (target='dev')
16:25:55 |
16:25:55 | 1 of 1 START table model data_lake_intgn_test.test_incremental_dt.... [RUN]
16:25:58 | 1 of 1 ERROR creating table model data_lake_intgn_test.test_incremental_dt [ERROR in 2.37s]
16:25:58 |
16:25:58 | Finished running 1 table models in 6.23s.

Completed with 1 errors:

Runtime Error in model data_lake_intgn_current|test_incremental_dt|current (models/data_lake_intgn/current/data_lake_intgn_current|test_incremental_dt|current.sql)
  java.lang.IllegalArgumentException: Can not create a Path from an empty string

Done. PASS=0 ERROR=1 SKIP=0 TOTAL=1

If I run the compiled query directly in PySpark on the EMR cluster, I get the same error message (with the following more complete stack trace):

py4j.protocol.Py4JJavaError: An error occurred while calling o58.sql.
: java.lang.IllegalArgumentException: Can not create a Path from an empty string
	at org.apache.hadoop.fs.Path.checkPathArg(Path.java:163)
	at org.apache.hadoop.fs.Path.<init>(Path.java:175)
	at org.apache.spark.sql.catalyst.catalog.CatalogUtils$.stringToURI(ExternalCatalogUtils.scala:236)
	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getDatabase$1$$anonfun$apply$2.apply(HiveClientImpl.scala:343)
	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getDatabase$1$$anonfun$apply$2.apply(HiveClientImpl.scala:339)
	at scala.Option.map(Option.scala:146)
	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getDatabase$1.apply(HiveClientImpl.scala:339)
	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getDatabase$1.apply(HiveClientImpl.scala:345)
	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
	at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
	at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
	at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
	at org.apache.spark.sql.hive.client.HiveClientImpl.getDatabase(HiveClientImpl.scala:338)
	at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getDatabase$1.apply(HiveExternalCatalog.scala:211)
	at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getDatabase$1.apply(HiveExternalCatalog.scala:211)
	at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
	at org.apache.spark.sql.hive.HiveExternalCatalog.getDatabase(HiveExternalCatalog.scala:210)
	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.getDatabase(ExternalCatalogWithListener.scala:65)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getDatabaseMetadata(SessionCatalog.scala:233)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.defaultTablePath(SessionCatalog.scala:472)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$4.apply(SessionCatalog.scala:327)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$4.apply(SessionCatalog.scala:327)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.validateTableLocation(SessionCatalog.scala:327)
	at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:170)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:115)
	at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
	at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
	at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364)
	at org.apache.spark.sql.Dataset.<init>(Dataset.scala:195)
	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:80)
	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

If I run the same query with the addition of a location statement, however, I do not get an error and the table is created successfully - e.g.:

create table [name]
using parquet 
location 's3://[bucket]/[path]' 
as SELECT * FROM [blah]"

I think that the root cause of this is that Databricks does some behind-the-scenes magic with default locations / managed tables / DBFS, which doesn’t work on more vanilla Spark, at least in the context of EMR. It’s possible that fiddling with some Spark configs could mitigate this, but in general I’d think that specifying an s3 path for a table would be a fairly normal thing to want to do.

There are a couple approaches that occur to me for dealing with this, which could probably be combined into a default / override kind of situation:

Set a ‘root’ location in your dbt_project.yml, and have dbt format model names into it, i.e. set root to be s3://my-bucket/prod/models/ and have model_1 get automatically put into s3://my-bucket/prod/models/model_1/, model_2 automatically go into s3://my-bucket/prod/models/model_2/, and so on.
Set a specific table location at the table level via config, so you arbitrarily place model_1 at s3://bucket-a/some-model and model_2 at s3://bucket-b/some-model-also

Issue Analytics

State:
Created 4 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

2reactions

jtcohen6commented, Feb 7, 2020

Agreed @aaronsteers @Dandandan. I’m going to close this issue, given that those two features will ship in the next release.

0reactions

Dandandancommented, Feb 7, 2020

For me, an older version was not working, dbt spark master + a location already provided in the glue data catalog (or creating it from dbt with schema creation) does work. Adding location_root support makes it much more flexible where to store individual tables in EMR.