Tagged types based on anything other than AnyVals produces exception in Spark
See original GitHub issueA good explanation of the issue and examples are here: https://stackoverflow.com/questions/66377920/how-fix-issues-with-spark-and-shapeless-tagged-type-based-on-string There has been no responses. I also posed the question in the Gitter / Shapeless channel with no responses.
Basically, any case class that uses a tagged type like type Foo = Int @@ FooTag
works just fine with Spark Datasets.
But if I use type Foo = String @@ FooTag
, it fails with exception java.lang.ClassNotFoundException: no Java class corresponding to <refinement of java.lang.String with shapeless.tag.Tagged[FooTag]> found
.
Not sure if this is a bug in Shapeless, or a limitation with Spark. Is there any kind of work around? Or am I limited to things like Int, Long, Double, Boolean as base type?
I created a custom Spark UDT for the java.util.UUID
, and it works great. But when I use tagging on the UUID, same issue.
Thank you! Any guidance would be greatly appreciated.
Issue Analytics
- State:
- Created 3 years ago
- Comments:12
Here is a repo that demonstrates the issue: https://github.com/DCameronMauch/TaggedType
If you change the method called by the main to the Int version, you can see that it works just fine. Long as the base type also works.
The issue is with Spark SQL - it doesn’t work with refined types (which is what
@@
translates to). Here is the offending code: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L88-L112It works for primitives because they are special cased in
dataTypeFor
withisSubtype
- so subtypes of primitives are basically treated as primitives. Unfortunately Spark is not very good at offering extension points and I don’t think you can define a customDataType
for@@
. But if you consider using https://github.com/typelevel/frameless it does let you define custom encoders: http://typelevel.org/frameless/Injection.html