question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Hdfsparquetimport tool is not picking rowKeyField and partitionPathField in correct format

See original GitHub issue

Describe the problem you faced

I am trying to use hdfsparquetimport tool which is available within hudi-cli to bootstrap a table to hudi. The table to be bootstrapped is in parquet file format. While doing so, I am facing multiple issues due to schema mismatch between what is in the parquet file versus what is defined in avro schema file.

Note: Tests are run in AWS EMR environment.

Avro schema definition { "name": "bootstraptest", "type": "record", "fields": [ { "name": "CREATEDBY", "type": "string" }, { "name": "ID", "type": "int" }, { "name": "CLIENT_ID", "type": "int" } ] }

When I load the parquet file to a spark dataframe and inspect the schema, it is like below: |-- CREATEDBY: string (nullable = true) |-- ID: decimal(12,0) (nullable = true) |-- CLIENT_ID: decimal(12,0) (nullable = true) Command used: hdfsparquetimport --upsert false --srcPath s3://test/imported_data/TESTBOOTSTRAP/ --targetPath s3://test/hudi_converted/ --tableName TESTBOOTSTRAP --tableType COPY_ON_WRITE --rowKeyField ID --partitionPathField CLIENT_ID --parallelism 1500 --schemaFilePath s3://test/scripts/test.avsc --format parquet --sparkMemory 6g --retry 1

I have tried with int and long datatypes in avro schema for decimal columns. Both are resulting in the following error. Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 in file ....... Caused by: java.lang.UnsupportedOperationException: org.apache.parquet.avro.AvroConverters$FieldIntegerConverter OR Caused by: java.lang.UnsupportedOperationException: org.apache.parquet.avro.AvroConverters$FieldLongConverter

I have also tried to define avro logicalType for decimal columns as suggested in few blogs: { "name": "bootstraptest", "type": "record", "fields": [ { "name": "CREATEDBY", "type": "string" }, { "name": "ID", "type": { "type": "bytes", "logicalType": "decimal", "precision": 12, "scale": 0 } }, { "name": "CLIENT_ID", "type": { "type": "bytes", "logicalType": "decimal", "precision": 12, "scale": 0 } } ] }

This time data is generated, but the partition path is in weird format Screenshot 2022-07-05 at 3 41 28 PM

I have also tried to reproduce the same in my local machine and seeing the same behaviour. hdfsparquetimport --upsert false --srcPath /Users/user1/hudi-res/parquet_data/ --targetPath /Users/user1/hudi-res/hudi_converted/ --tableName BOOTSTRAPTEST --tableType COPY_ON_WRITE --rowKeyField ID --partitionPathField CLIENT_ID --parallelism 50 --schemaFilePath /Users/user1/hudi-res/schema/test.avsc --format parquet --sparkMemory 2G --retry 1 --sparkMaster local

Expected behavior

What I am trying is a very basic functionality of what is expected out of hdfsparquetimport tool. Trying to convert table files in parquet format to hudi format. Please let me know if something is wrong with my config/command.

Environment Description

  • Hudi version : 0.10.1

  • Spark version : 3.2.0

  • Storage : S3

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:7 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
xushiyancommented, Jul 19, 2022

the import tool was deprecated. it is recommended to use bootstrap feature instead. see https://hudi.apache.org/blog/2020/08/20/efficient-migration-of-large-parquet-tables

0reactions
xushiyancommented, Jul 19, 2022

the import tool was deprecated. it is recommended to use bootstrap feature instead. see https://hudi.apache.org/blog/2020/08/20/efficient-migration-of-large-parquet-tables

Bootstrap works well.

@cajil can you please migrate to use bootstrap instead?

Read more comments on GitHub >

github_iconTop Results From Across the Web

All Configurations | Apache Hudi
These configs control the Hudi Spark Datasource, providing ability to define keys/partitioning, pick out the write operation, specify how to merge records or ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found