question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Materialized tables creation fails on EMR

See original GitHub issue

When I attempt to have dbt run a simple job that results in a materialized Spark table using EMR, I get an error as follows:

> dbt run
Running with dbt=0.13.0
Found 1 models, 0 tests, 0 archives, 0 analyses, 100 macros, 0 operations, 0 seed files, 1 sources

16:25:55 | Concurrency: 1 threads (target='dev')
16:25:55 |
16:25:55 | 1 of 1 START table model data_lake_intgn_test.test_incremental_dt.... [RUN]
16:25:58 | 1 of 1 ERROR creating table model data_lake_intgn_test.test_incremental_dt [ERROR in 2.37s]
16:25:58 |
16:25:58 | Finished running 1 table models in 6.23s.

Completed with 1 errors:

Runtime Error in model data_lake_intgn_current|test_incremental_dt|current (models/data_lake_intgn/current/data_lake_intgn_current|test_incremental_dt|current.sql)
  java.lang.IllegalArgumentException: Can not create a Path from an empty string

Done. PASS=0 ERROR=1 SKIP=0 TOTAL=1

If I run the compiled query directly in PySpark on the EMR cluster, I get the same error message (with the following more complete stack trace):

py4j.protocol.Py4JJavaError: An error occurred while calling o58.sql.
: java.lang.IllegalArgumentException: Can not create a Path from an empty string
	at org.apache.hadoop.fs.Path.checkPathArg(Path.java:163)
	at org.apache.hadoop.fs.Path.<init>(Path.java:175)
	at org.apache.spark.sql.catalyst.catalog.CatalogUtils$.stringToURI(ExternalCatalogUtils.scala:236)
	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getDatabase$1$$anonfun$apply$2.apply(HiveClientImpl.scala:343)
	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getDatabase$1$$anonfun$apply$2.apply(HiveClientImpl.scala:339)
	at scala.Option.map(Option.scala:146)
	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getDatabase$1.apply(HiveClientImpl.scala:339)
	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getDatabase$1.apply(HiveClientImpl.scala:345)
	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
	at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
	at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
	at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
	at org.apache.spark.sql.hive.client.HiveClientImpl.getDatabase(HiveClientImpl.scala:338)
	at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getDatabase$1.apply(HiveExternalCatalog.scala:211)
	at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getDatabase$1.apply(HiveExternalCatalog.scala:211)
	at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
	at org.apache.spark.sql.hive.HiveExternalCatalog.getDatabase(HiveExternalCatalog.scala:210)
	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.getDatabase(ExternalCatalogWithListener.scala:65)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getDatabaseMetadata(SessionCatalog.scala:233)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.defaultTablePath(SessionCatalog.scala:472)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$4.apply(SessionCatalog.scala:327)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$4.apply(SessionCatalog.scala:327)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.validateTableLocation(SessionCatalog.scala:327)
	at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:170)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:115)
	at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
	at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
	at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364)
	at org.apache.spark.sql.Dataset.<init>(Dataset.scala:195)
	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:80)
	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

If I run the same query with the addition of a location statement, however, I do not get an error and the table is created successfully - e.g.:

create table [name]
using parquet 
location 's3://[bucket]/[path]' 
as SELECT * FROM [blah]"

I think that the root cause of this is that Databricks does some behind-the-scenes magic with default locations / managed tables / DBFS, which doesn’t work on more vanilla Spark, at least in the context of EMR. It’s possible that fiddling with some Spark configs could mitigate this, but in general I’d think that specifying an s3 path for a table would be a fairly normal thing to want to do.

There are a couple approaches that occur to me for dealing with this, which could probably be combined into a default / override kind of situation:

  • Set a ‘root’ location in your dbt_project.yml, and have dbt format model names into it, i.e. set root to be s3://my-bucket/prod/models/ and have model_1 get automatically put into s3://my-bucket/prod/models/model_1/, model_2 automatically go into s3://my-bucket/prod/models/model_2/, and so on.
  • Set a specific table location at the table level via config, so you arbitrarily place model_1 at s3://bucket-a/some-model and model_2 at s3://bucket-b/some-model-also

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
jtcohen6commented, Feb 7, 2020

Agreed @aaronsteers @Dandandan. I’m going to close this issue, given that those two features will ship in the next release.

0reactions
Dandandancommented, Feb 7, 2020

For me, an older version was not working, dbt spark master + a location already provided in the glue data catalog (or creating it from dbt with schema creation) does work. Adding location_root support makes it much more flexible where to store individual tables in EMR.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Materialized tables creation fails on EMR · Issue #21 · dbt- ...
When I attempt to have dbt run a simple job that results in a materialized Spark table using EMR, I get an error...
Read more >
Materialized view isn't refreshing for Amazon Redshift
The error occurs when REFRESH MATERIALIZED VIEW and VACUUM are submitted to run concurrently on the base table.
Read more >
Issue querying a materialized view redshift - Stack Overflow
Solution: Set AutoRefresh to ON on the mv (as it defaults to off). Or manually trigger the refresh.
Read more >
CREATE MATERIALIZED VIEW - Amazon Redshift
Creates a materialized view based on one or more Amazon Redshift tables or external ... so tables with BACKUP NO specified are restored...
Read more >
An Introduction to Hadoop in EMR AWS. | Analytics Vidhya
View is visible as table in database. Non materialized view are created by default in hive, which does not save underlying query data....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found