presto - querying nested object in parquet file created by hudi
See original GitHub issueDescribe the problem you faced
Using an AWS EMR spark job to create a hudi parquet record in S3 from a kinesis stream. Querying this record from presto is fine, but I can’t seem to query a nested column
Update: From my further investigation I think not being to query nested objects or using select * from ...
is just a symptom of taking an array object off a kinesis stream and saving using hudi.
To Reproduce
Steps to reproduce the behavior:
- Spark job that reads from kinesis stream, saves hudi file to S3
- AWS glue job creates database from record
- Log into AWS EMR with presto installed
- run
presto-cli --catalog hive --schema schema --server server:8889
- queries:
works without nesting
presto:schema> select id from default;
id
----------
34551832
(1 row)
Query 20200211_212022_00055_hej8h, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:01 [1 rows, 93B] [1 rows/s, 179B/s]
query that doesn’t work with nesting
presto:schema> select id, order.channel from default;
Query 20200211_212107_00056_hej8h failed: line 1:12: mismatched input 'order'. Expecting: '*', <expression>, <identifier>
select id, order.channel from default
table structure
presto:data-lake-database-dev-adam-8> show columns from default
-> ;
Column |
------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
_hoodie_commit_time | varchar
_hoodie_commit_seqno | varchar
_hoodie_record_key | varchar
_hoodie_partition_path | varchar
_hoodie_file_name | varchar
eventtimestamp | varchar
id | bigint
order | row(channel varchar, customer row(address row(country varchar, postcode varchar, region varchar), birthdate varchar, createddate varchar, email varchar, firstname varchar, id bigi
(11 rows)
Deploy script
aws emr add-steps --cluster-id j-xxxxxx --steps Type=spark,Name=ScalaStream,Args=[\
--deploy-mode,cluster,\
--master,yarn,\
--packages,\'org.apache.hudi:hudi-spark-bundle:0.5.0-incubating\',\
--jars,\'/usr/lib/spark/external/lib/spark-avro.jar,/usr/lib/spark/external/lib/spark-streaming-kinesis-asl-assembly.jar\',\
--conf,spark.yarn.submit.waitAppCompletion=false,\
--conf,yarn.log-aggregation-enable=true,\
--conf,spark.dynamicAllocation.enabled=true,\
--conf,spark.cores.max=4,\
--conf,spark.network.timeout=300,\
--conf,spark.serializer=org.apache.spark.serializer.KryoSerializer,\
--conf,spark.sql.hive.convertMetastoreParquet=false,\
--class,ScalaStream,\
s3://xxx.xxx/simple-project_2.11-1.0.jar\
],ActionOnFailure=CONTINUE
sbt file
name := "Simple Project"
version := "1.0"
scalaVersion := "2.11.12"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.4"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.4.4"
libraryDependencies += "org.apache.spark" %% "spark-streaming-kinesis-asl" % "2.4.4"
libraryDependencies += "org.apache.hudi" % "hudi-spark-bundle" % "0.5.0-incubating"
scalacOptions := Seq("-unchecked", "-deprecation")
AWS glue
in order for the AWS crawler to identify the ‘default’ directory where hudi has placed the data. I have had to add exclusions to the crawler. They are:
**/.hoodie_partition_metadata
**/default_$folder$
**/.hoodie_$folder$
**/.hoodie/hoodie.properties
Expected behavior
Nest row object to be output in query result.
Environment Description
-
Hudi version : hudi-spark-bundle:0.5.0-incubating, (with org.apache.spark:spark-avro_2.11:2.4.4)
-
Spark version : 2.4.4
-
Hive version : Hive 2.3.6,
-
Pig 0.17.0,
-
Presto 0.227
-
Hadoop version : Amazon 2.8.5
-
Storage (HDFS/S3/GCS…) : S3
-
Running on Docker? (yes/no) : no
Issue Analytics
- State:
- Created 4 years ago
- Comments:20 (12 by maintainers)
Top GitHub Comments
@adamjoneill apologies for the delayed response. Havent gotten a chance to look at this thread. Let me also try and reproduce this and get back soon.
Thanks @adamjoneill let me try to reproduce as well and see whats going on tonight.