Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Hdfsparquetimport tool is not picking rowKeyField and partitionPathField in correct format

See original GitHub issue

Describe the problem you faced

I am trying to use hdfsparquetimport tool which is available within hudi-cli to bootstrap a table to hudi. The table to be bootstrapped is in parquet file format. While doing so, I am facing multiple issues due to schema mismatch between what is in the parquet file versus what is defined in avro schema file.

Note: Tests are run in AWS EMR environment.

Avro schema definition { "name": "bootstraptest", "type": "record", "fields": [ { "name": "CREATEDBY", "type": "string" }, { "name": "ID", "type": "int" }, { "name": "CLIENT_ID", "type": "int" } ] }

When I load the parquet file to a spark dataframe and inspect the schema, it is like below: |-- CREATEDBY: string (nullable = true) |-- ID: decimal(12,0) (nullable = true) |-- CLIENT_ID: decimal(12,0) (nullable = true) Command used: hdfsparquetimport --upsert false --srcPath s3://test/imported_data/TESTBOOTSTRAP/ --targetPath s3://test/hudi_converted/ --tableName TESTBOOTSTRAP --tableType COPY_ON_WRITE --rowKeyField ID --partitionPathField CLIENT_ID --parallelism 1500 --schemaFilePath s3://test/scripts/test.avsc --format parquet --sparkMemory 6g --retry 1

I have tried with int and long datatypes in avro schema for decimal columns. Both are resulting in the following error. Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 in file ....... Caused by: java.lang.UnsupportedOperationException: org.apache.parquet.avro.AvroConverters$FieldIntegerConverter OR Caused by: java.lang.UnsupportedOperationException: org.apache.parquet.avro.AvroConverters$FieldLongConverter

I have also tried to define avro logicalType for decimal columns as suggested in few blogs: { "name": "bootstraptest", "type": "record", "fields": [ { "name": "CREATEDBY", "type": "string" }, { "name": "ID", "type": { "type": "bytes", "logicalType": "decimal", "precision": 12, "scale": 0 } }, { "name": "CLIENT_ID", "type": { "type": "bytes", "logicalType": "decimal", "precision": 12, "scale": 0 } } ] }

This time data is generated, but the partition path is in weird format Screenshot 2022-07-05 at 3 41 28 PM

I have also tried to reproduce the same in my local machine and seeing the same behaviour. hdfsparquetimport --upsert false --srcPath /Users/user1/hudi-res/parquet_data/ --targetPath /Users/user1/hudi-res/hudi_converted/ --tableName BOOTSTRAPTEST --tableType COPY_ON_WRITE --rowKeyField ID --partitionPathField CLIENT_ID --parallelism 50 --schemaFilePath /Users/user1/hudi-res/schema/test.avsc --format parquet --sparkMemory 2G --retry 1 --sparkMaster local

Expected behavior

What I am trying is a very basic functionality of what is expected out of hdfsparquetimport tool. Trying to convert table files in parquet format to hudi format. Please let me know if something is wrong with my config/command.

Environment Description

Hudi version : 0.10.1
Spark version : 3.2.0
Storage : S3

Issue Analytics

State:
Created a year ago
Comments:7 (5 by maintainers)

Top GitHub Comments

1reaction

xushiyancommented, Jul 19, 2022

the import tool was deprecated. it is recommended to use bootstrap feature instead. see https://hudi.apache.org/blog/2020/08/20/efficient-migration-of-large-parquet-tables

0reactions

xushiyancommented, Jul 19, 2022

the import tool was deprecated. it is recommended to use bootstrap feature instead. see https://hudi.apache.org/blog/2020/08/20/efficient-migration-of-large-parquet-tables

Bootstrap works well.

@cajil can you please migrate to use bootstrap instead?

Top Results From Across the Web

All Configurations | Apache Hudi

These configs control the Hudi Spark Datasource, providing ability to define keys/partitioning, pick out the write operation, specify how to merge records or ......