Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Hudi output just parquet even-though input is snappy.parquet

See original GitHub issue

Is there anyway like I can get output is same as snappy.parquet.? I am giving my input as snappy.parquet.

Issue Analytics

State:
Created 4 years ago
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

sanjiv1980commented, Dec 3, 2019

Oh, Now I am able to see it , but how can we make it visible …?

extra:                  hoodie_min_record_key = 1 
extra:                  parquet.avro.schema = {"type":"record","name":"hoodie_test_record","namespace":"hoodie.hoodie_test","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"id","type":["string","null"]},{"name":"name","type":["string","null"]},{"name":"email","type":["string","null"]},{"name":"city","type":["string","null"]},{"name":"file","type":"string"},{"name":"year","type":"string"},{"name":"month","type":"string"},{"name":"day","type":"string"},{"name":"date","type":"string"}]} 
extra:                  hoodie_max_record_key = 5 

file schema:            hoodie.hoodie_test.hoodie_test_record 
--------------------------------------------------------------------------------
_hoodie_commit_time:    OPTIONAL BINARY O:UTF8 R:0 D:1
_hoodie_commit_seqno:   OPTIONAL BINARY O:UTF8 R:0 D:1
_hoodie_record_key:     OPTIONAL BINARY O:UTF8 R:0 D:1
_hoodie_partition_path: OPTIONAL BINARY O:UTF8 R:0 D:1
_hoodie_file_name:      OPTIONAL BINARY O:UTF8 R:0 D:1
id:                     OPTIONAL BINARY O:UTF8 R:0 D:1
name:                   OPTIONAL BINARY O:UTF8 R:0 D:1
email:                  OPTIONAL BINARY O:UTF8 R:0 D:1
city:                   OPTIONAL BINARY O:UTF8 R:0 D:1
file:                   REQUIRED BINARY O:UTF8 R:0 D:0
year:                   REQUIRED BINARY O:UTF8 R:0 D:0
month:                  REQUIRED BINARY O:UTF8 R:0 D:0
day:                    REQUIRED BINARY O:UTF8 R:0 D:0
date:                   REQUIRED BINARY O:UTF8 R:0 D:0

row group 1:            RC:5 TS:1826 OFFSET:4 
--------------------------------------------------------------------------------
_hoodie_commit_time:     BINARY SNAPPY DO:0 FPO:4 SZ:128/124/0.97 VC:5 ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY ST:[min: 20191203004801, max: 20191203004801, num_nulls: 0]
_hoodie_commit_seqno:    BINARY SNAPPY DO:0 FPO:132 SZ:112/178/1.59 VC:5 ENC:PLAIN,BIT_PACKED,RLE ST:[min: 20191203004801_0_1, max: 20191203004801_0_5, num_nulls: 0]
_hoodie_record_key:      BINARY SNAPPY DO:0 FPO:244 SZ:59/58/0.98 VC:5 ENC:PLAIN,BIT_PACKED,RLE ST:[min: 1, max: 5, num_nulls: 0]
_hoodie_partition_path:  BINARY SNAPPY DO:0 FPO:303 SZ:183/179/0.98 VC:5 ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY ST:[min: year=2019/month=08/day=10, max: year=2019/month=08/day=10, num_nulls: 0]
_hoodie_file_name:       BINARY SNAPPY DO:0 FPO:486 SZ:396/391/0.99 VC:5 ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY ST:[min: 87e76033-c836-457a-8166-2cf2af9f5d6c-0_0-6-6_20191203004801.parquet, max: 87e76033-c836-457a-8166-2cf2af9f5d6c-0_0-6-6_20191203004801.parquet, num_nulls: 0]
id:                      BINARY SNAPPY DO:0 FPO:882 SZ:59/58/0.98 VC:5 ENC:PLAIN,BIT_PACKED,RLE ST:[min: 1, max: 5, num_nulls: 0]
name:                    BINARY SNAPPY DO:0 FPO:941 SZ:92/92/1.00 VC:5 ENC:PLAIN,BIT_PACKED,RLE ST:[min: aarush, max: sanjiv, num_nulls: 0]
email:                   BINARY SNAPPY DO:0 FPO:1033 SZ:164/201/1.23 VC:5 ENC:PLAIN,BIT_PACKED,RLE ST:[min: aarush.singh@gmail.com, max: sanjiv.kumar@gmail.com, num_nulls: 0]
city:                    BINARY SNAPPY DO:0 FPO:1197 SZ:101/100/0.99 VC:5 ENC:PLAIN,BIT_PACKED,RLE ST:[min: bangalore, max: sasaram, num_nulls: 0]
file:                    BINARY SNAPPY DO:0 FPO:1298 SZ:92/88/0.96 VC:5 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: hit_data, max: hit_data, num_nulls: 0]
year:                    BINARY SNAPPY DO:0 FPO:1390 SZ:72/68/0.94 VC:5 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 2019, max: 2019, num_nulls: 0]
month:                   BINARY SNAPPY DO:0 FPO:1462 SZ:62/58/0.94 VC:5 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 08, max: 08, num_nulls: 0]
day:                     BINARY SNAPPY DO:0 FPO:1524 SZ:62/58/0.94 VC:5 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 10, max: 10, num_nulls: 0]
date:                    BINARY SNAPPY DO:0 FPO:1586 SZ:177/173/0.98 VC:5 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: year=2019/month=08/day=10, max: year=2019/month=08/day=10, num_nulls: 0]`

0reactions

vinothchandarcommented, Dec 6, 2019

visible to queries? queries dont go by file name, IIUC they read this metadata from within files to actually read them, right?

Top Results From Across the Web

All Configurations | Apache Hudi

These configs provide deep control over lower level aspects like file sizing, compression, parallelism, compaction, write schema, cleaning etc. Although Hudi ...

Hive parquet snappy compression not working - Stack Overflow

The solution is using “TBLPROPERTIES ('parquet. compression'='SNAPPY')” (and the case matters) in the DDL instead of “TBLPROPERTIES ('PARQUET. ...

Loading Parquet data from Cloud Storage | BigQuery

To avoid resourcesExceeded errors when loading Parquet files into BigQuery, follow these guidelines: Keep record sizes to 50 MB or less. If your...

UNLOAD - Amazon Athena - AWS Documentation

CSV is the only output format used by the Athena SELECT query, but you can use UNLOAD to write the ... For Parquet,...

The Delta Lake Series — Complete Collection | Databricks

Lake is powered by Apache Spark, it's not only possible for multiple users to modify a ... and cumbersome with other traditional data...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

[SUPPORT]Caused by: org.apache.hudi.exception.HoodieException: java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: org.apache.parquet.avro.AvroSchemaConverter.convert(Lorg/apache/avro/Schema;)Lorg/apache/parquet/schema/MessageType;

Hudi output just parquet even-though input is snappy.parquet

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

[SUPPORT]Caused by: org.apache.hudi.exception.HoodieException: java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: org.apache.parquet.avro.AvroSchemaConverter.convert(Lorg/apache/avro/Schema;)Lorg/apache/parquet/schema/MessageType;

Setting "hoodie.parquet.max.file.size" to a value >= 2 GiB leads to no data being generated