[SUPPORT] Hdfsparquetimport tool is not picking rowKeyField and partitionPathField in correct format
See original GitHub issueDescribe the problem you faced
I am trying to use hdfsparquetimport tool which is available within hudi-cli to bootstrap a table to hudi. The table to be bootstrapped is in parquet file format. While doing so, I am facing multiple issues due to schema mismatch between what is in the parquet file versus what is defined in avro schema file.
Note: Tests are run in AWS EMR environment.
Avro schema definition
{ "name": "bootstraptest", "type": "record", "fields": [ { "name": "CREATEDBY", "type": "string" }, { "name": "ID", "type": "int" }, { "name": "CLIENT_ID", "type": "int" } ] }
When I load the parquet file to a spark dataframe and inspect the schema, it is like below:
|-- CREATEDBY: string (nullable = true) |-- ID: decimal(12,0) (nullable = true) |-- CLIENT_ID: decimal(12,0) (nullable = true)
Command used:
hdfsparquetimport --upsert false --srcPath s3://test/imported_data/TESTBOOTSTRAP/ --targetPath s3://test/hudi_converted/ --tableName TESTBOOTSTRAP --tableType COPY_ON_WRITE --rowKeyField ID --partitionPathField CLIENT_ID --parallelism 1500 --schemaFilePath s3://test/scripts/test.avsc --format parquet --sparkMemory 6g --retry 1
I have tried with int and long datatypes in avro schema for decimal columns. Both are resulting in the following error.
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 in file ....... Caused by: java.lang.UnsupportedOperationException: org.apache.parquet.avro.AvroConverters$FieldIntegerConverter
OR
Caused by: java.lang.UnsupportedOperationException: org.apache.parquet.avro.AvroConverters$FieldLongConverter
I have also tried to define avro logicalType for decimal columns as suggested in few blogs:
{ "name": "bootstraptest", "type": "record", "fields": [ { "name": "CREATEDBY", "type": "string" }, { "name": "ID", "type": { "type": "bytes", "logicalType": "decimal", "precision": 12, "scale": 0 } }, { "name": "CLIENT_ID", "type": { "type": "bytes", "logicalType": "decimal", "precision": 12, "scale": 0 } } ] }
This time data is generated, but the partition path is in weird format
I have also tried to reproduce the same in my local machine and seeing the same behaviour.
hdfsparquetimport --upsert false --srcPath /Users/user1/hudi-res/parquet_data/ --targetPath /Users/user1/hudi-res/hudi_converted/ --tableName BOOTSTRAPTEST --tableType COPY_ON_WRITE --rowKeyField ID --partitionPathField CLIENT_ID --parallelism 50 --schemaFilePath /Users/user1/hudi-res/schema/test.avsc --format parquet --sparkMemory 2G --retry 1 --sparkMaster local
Expected behavior
What I am trying is a very basic functionality of what is expected out of hdfsparquetimport tool. Trying to convert table files in parquet format to hudi format. Please let me know if something is wrong with my config/command.
Environment Description
-
Hudi version : 0.10.1
-
Spark version : 3.2.0
-
Storage : S3
Issue Analytics
- State:
- Created a year ago
- Comments:7 (5 by maintainers)
Top GitHub Comments
the import tool was deprecated. it is recommended to use bootstrap feature instead. see https://hudi.apache.org/blog/2020/08/20/efficient-migration-of-large-parquet-tables
@cajil can you please migrate to use bootstrap instead?