question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] unhelpful error message when there are parquets outside table base path

See original GitHub issue
using hoodie 0.4.6 and spark 2.3.4

run below in hiveserver2 (v2.3.4):

CREATE EXTERNAL TABLE `someschema.mytbl`(
col1 string,
col2 string,
col3 string)
PARTITIONED BY ( 
  `mydate` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'com.uber.hoodie.hadoop.HoodieInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3a://redact/M5/table/mytbl'
  
  #use spark to create COW hudi parquet under s3://redact/M5/table/mytbl/2016/11/07/ and s3://redact/M/table/mytbl/2019/12/01/
  
  run below in hiveserver2:
  ALTER TABLE someschema.mytbl ADD IF NOT EXISTS PARTITION(mydate='2016-11-07')
LOCATION 's3a://redact/M5/table/mytbl/2016/11/07/'
ALTER TABLE someschema.mytbl ADD IF NOT EXISTS PARTITION(mydate='2019-12-01')
LOCATION 's3a://redact/M/table/mytbl/2019/12/01/'
  
  
  hive metastore shows below 2 rows:
  
  select TBLS.TBL_NAME,PARTITIONS.PART_NAME,SDS.LOCATION
from SDS,TBLS,PARTITIONS
where PARTITIONS.SD_ID = SDS.SD_ID
and TBLS.TBL_ID=PARTITIONS.TBL_ID
and TBLS.TBL_NAME = 'mytbl'
order by 1,2;


mytbl	mydate=2016-11-07	s3a://redact/M5/table/mytbl/2016/11/07
mytbl	mydate=2019-12-01	s3a://redact/M/table/mytbl/2019/12/01





query1:
select count(1) from someschema.mytbl where mydate = '2016-11-07'

works fine from both hiveserver2 and presto

query2:
select count(1) from someschema.mytbl where mydate = '2019-12-01'

presto gives unhelpful error:

io.prestosql.spi.PrestoException: HIVE_UNKNOWN_ERROR
	at io.prestosql.plugin.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:223)
	at io.prestosql.plugin.hive.util.ResumableTasks$1.run(ResumableTasks.java:38)
	at io.prestosql.$gen.Presto_ff748c3_dirty____20200610_171635_2.run(Unknown Source)
	at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:78)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.lang.ArrayIndexOutOfBoundsException: undefined


hiveserver2 gives more verbose yet still not too helpful error:
2020-06-12T18:22:23,375  WARN [HiveServer2-Handler-Pool: Thread-12109] thrift.ThriftCLIService: Error fetching results:
org.apache.hive.service.cli.HiveSQLException: java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 2
        at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:499) ~[hive-service-2.3.4.jar:2.3.4]
        at org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:307) ~[hive-service-2.3.4.jar:2.3.4]
        at org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:878) ~[hive-service-2.3.4.jar:2.3.4]
        at sun.reflect.GeneratedMethodAccessor135.invoke(Unknown Source) ~[?:?]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_252]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_252]
        at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78) ~[hive-service-2.3.4.jar:2.3.4]
        at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36) ~[hive-service-2.3.4.jar:2.3.4]
        at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63) ~[hive-service-2.3.4.jar:2.3.4]
        at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_252]
        at javax.security.auth.Subject.doAs(Subject.java:422) ~[?:1.8.0_252]
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) ~[hadoop-common-2.8.5.jar:?]
        at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59) ~[hive-service-2.3.4.jar:2.3.4]
        at com.sun.proxy.$Proxy42.fetchResults(Unknown Source) ~[?:?]
        at org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:559) ~[hive-service-2.3.4.jar:2.3.4]
        at org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:751) ~[hive-service-2.3.4.jar:2.3.4]
        at org.apache.hive.service.rpc.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1717) ~[hive-exec-2.3.4.jar:2.3.4]
        at org.apache.hive.service.rpc.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1702) ~[hive-exec-2.3.4.jar:2.3.4]
        at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) ~[hive-exec-2.3.4.jar:2.3.4]
        at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) ~[hive-exec-2.3.4.jar:2.3.4]
        at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56) ~[hive-service-2.3.4.jar:2.3.4]
        at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) ~[hive-exec-2.3.4.jar:2.3.4]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_252]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_252]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]
Caused by: java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 2
        at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:521) ~[hive-exec-2.3.4.jar:2.3.4]
        at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428) ~[hive-exec-2.3.4.jar:2.3.4]
        at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:147) ~[hive-exec-2.3.4.jar:2.3.4]
        at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2208) ~[hive-exec-2.3.4.jar:2.3.4]
        at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:494) ~[hive-service-2.3.4.jar:2.3.4]
        ... 24 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
        at com.uber.hoodie.common.util.FSUtils.getCommitTime(FSUtils.java:120) ~[hoodiebundle.jar:?]
        at com.uber.hoodie.common.model.HoodieDataFile.getCommitTime(HoodieDataFile.java:37) ~[hoodiebundle.jar:?]
        at com.uber.hoodie.common.model.HoodieFileGroup.addDataFile(HoodieFileGroup.java:89) ~[hoodiebundle.jar:?]
        at com.uber.hoodie.common.table.view.HoodieTableFileSystemView.lambda$null$3(HoodieTableFileSystemView.java:155) ~[hoodiebundle.jar:?]
        at java.util.ArrayList.forEach(ArrayList.java:1257) ~[?:1.8.0_252]
        at com.uber.hoodie.common.table.view.HoodieTableFileSystemView.lambda$addFilesToView$5(HoodieTableFileSystemView.java:155) ~[hoodiebundle.jar:?]
        at java.lang.Iterable.forEach(Iterable.java:75) ~[?:1.8.0_252]
        at com.uber.hoodie.common.table.view.HoodieTableFileSystemView.addFilesToView(HoodieTableFileSystemView.java:151) ~[hoodiebundle.jar:?]
        at com.uber.hoodie.common.table.view.HoodieTableFileSystemView.<init>(HoodieTableFileSystemView.java:107) ~[hoodiebundle.jar:?]
        at com.uber.hoodie.hadoop.HoodieInputFormat.listStatus(HoodieInputFormat.java:88) ~[hoodiebundle.jar:?]
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322) ~[hadoop-mapreduce-client-core-2.8.5.jar:?]
        at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextSplits(FetchOperator.java:372) ~[hive-exec-2.3.4.jar:2.3.4]
        at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:304) ~[hive-exec-2.3.4.jar:2.3.4]
        at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459) ~[hive-exec-2.3.4.jar:2.3.4]
        at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428) ~[hive-exec-2.3.4.jar:2.3.4]
        at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:147) ~[hive-exec-2.3.4.jar:2.3.4]
        at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2208) ~[hive-exec-2.3.4.jar:2.3.4]
        at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:494) ~[hive-service-2.3.4.jar:2.3.4]
        ... 24 more

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
tooptoop4commented, Jul 7, 2020

prestosql 336 with hudi 0.5.3 gives better error:

io.prestosql.spi.PrestoException: Index 2 out of bounds for length 1
	at io.prestosql.plugin.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:234)
	at io.prestosql.plugin.hive.util.ResumableTasks$1.run(ResumableTasks.java:38)
	at io.prestosql.$gen.Presto_1c5b75e_dirty____20200705_204556_2.run(Unknown Source)
	at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:80)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 2 out of bounds for length 1
	at org.apache.hudi.common.util.FSUtils.getCommitTime(FSUtils.java:137)
	at org.apache.hudi.common.model.HoodieBaseFile.getCommitTime(HoodieBaseFile.java:55)
	at org.apache.hudi.common.model.HoodieFileGroup.addBaseFile(HoodieFileGroup.java:86)
	at java.base/java.util.ArrayList.forEach(Unknown Source)
	at org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$buildFileGroups$4(AbstractTableFileSystemView.java:161)
	at java.base/java.lang.Iterable.forEach(Unknown Source)
	at org.apache.hudi.common.table.view.AbstractTableFileSystemView.buildFileGroups(AbstractTableFileSystemView.java:157)
	at org.apache.hudi.common.table.view.AbstractTableFileSystemView.buildFileGroups(AbstractTableFileSystemView.java:135)
	at org.apache.hudi.common.table.view.AbstractTableFileSystemView.addFilesToView(AbstractTableFileSystemView.java:115)
	at org.apache.hudi.common.table.view.HoodieTableFileSystemView.<init>(HoodieTableFileSystemView.java:120)
	at org.apache.hudi.hadoop.HoodieParquetInputFormat.filterFileStatusForSnapshotMode(HoodieParquetInputFormat.java:239)
	at org.apache.hudi.hadoop.HoodieParquetInputFormat.listStatus(HoodieParquetInputFormat.java:110)
	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:325)
	at io.prestosql.plugin.hive.BackgroundHiveSplitLoader.loadPartition(BackgroundHiveSplitLoader.java:428)
	at io.prestosql.plugin.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:298)
	at io.prestosql.plugin.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:227)
	... 6 more

after putting a log statement for fullFileName i see the value is part-00007-75dea991-eba7-4fb1-801c-af264bb5bfc3-c000.snappy.parquet, while for a table that can be queried fullFileName is 4b37466c-8b75-458e-ba28-1e0f4c350dbe_0_20200324151845.parquet

s3 listing under partition folder of table that works (there is .hoodie/ folder under base table path): 2020-03-24 15:18:55 93 .hoodie_partition_metadata 2020-03-24 15:18:57 2194374 4b37466c-8b75-458e-ba28-1e0f4c350dbe_0_20200324151845.parquet

s3 listing under partition folder of table that gets the error (there is .hoodie/ folder under base table path): 2020-03-24 15:18:44 0 _SUCCESS 2020-03-24 15:18:37 10649992 part-00000-75dea991-eba7-4fb1-801c-af264bb5bfc3-c000.snappy.parquet 2020-03-24 15:18:38 8787785 part-00001-75dea991-eba7-4fb1-801c-af264bb5bfc3-c000.snappy.parquet 2020-03-24 15:18:39 9562198 part-00002-75dea991-eba7-4fb1-801c-af264bb5bfc3-c000.snappy.parquet 2020-03-24 15:18:40 9359329 part-00003-75dea991-eba7-4fb1-801c-af264bb5bfc3-c000.snappy.parquet 2020-03-24 15:18:41 10519118 part-00004-75dea991-eba7-4fb1-801c-af264bb5bfc3-c000.snappy.parquet 2020-03-24 15:18:42 10452807 part-00005-75dea991-eba7-4fb1-801c-af264bb5bfc3-c000.snappy.parquet 2020-03-24 15:18:42 9104366 part-00006-75dea991-eba7-4fb1-801c-af264bb5bfc3-c000.snappy.parquet 2020-03-24 15:18:43 9016423 part-00007-75dea991-eba7-4fb1-801c-af264bb5bfc3-c000.snappy.parquet

UPDATE This is really old table, and got corrupted along the way. After removing .hoodie/ folder select works ok

0reactions
vinothchandarcommented, Jul 7, 2020

yes… makes sense… closing this issue

Read more comments on GitHub >

github_iconTop Results From Across the Web

Solved: external table stored as parquet - can not use fie...
We have parquet fields with relatively deep nested structure (up to 4-5 levels) and map them to external tables in hive/impala.
Read more >
Unable to infer schema when loading Parquet file
This error usually occurs when you try to read an empty directory as parquet. Probably your outcome Dataframe is empty.
Read more >
Why do I always get an error on querying the Parquet table
The issue can happen if the Hive syntax for table creation is used instead of the Spark syntax. Read more here: https://docs.databricks.com/spark/latest/spark ...
Read more >
Using the Parquet File Format with Impala Tables
Parquet files produced outside of Impala must write column data in the same order as the columns are declared in the Impala table....
Read more >
Troubleshoot the Parquet format connector - Azure Data ...
No enum constant · Symptoms: Error message occurred when you copy data to Parquet format: java.lang. · Cause: The issue could be caused...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found