question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Hudi output just parquet even-though input is snappy.parquet

See original GitHub issue

Is there anyway like I can get output is same as snappy.parquet.? I am giving my input as snappy.parquet.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
sanjiv1980commented, Dec 3, 2019

Oh, Now I am able to see it , but how can we make it visible …?

extra:                  hoodie_min_record_key = 1 
extra:                  parquet.avro.schema = {"type":"record","name":"hoodie_test_record","namespace":"hoodie.hoodie_test","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"id","type":["string","null"]},{"name":"name","type":["string","null"]},{"name":"email","type":["string","null"]},{"name":"city","type":["string","null"]},{"name":"file","type":"string"},{"name":"year","type":"string"},{"name":"month","type":"string"},{"name":"day","type":"string"},{"name":"date","type":"string"}]} 
extra:                  hoodie_max_record_key = 5 

file schema:            hoodie.hoodie_test.hoodie_test_record 
--------------------------------------------------------------------------------
_hoodie_commit_time:    OPTIONAL BINARY O:UTF8 R:0 D:1
_hoodie_commit_seqno:   OPTIONAL BINARY O:UTF8 R:0 D:1
_hoodie_record_key:     OPTIONAL BINARY O:UTF8 R:0 D:1
_hoodie_partition_path: OPTIONAL BINARY O:UTF8 R:0 D:1
_hoodie_file_name:      OPTIONAL BINARY O:UTF8 R:0 D:1
id:                     OPTIONAL BINARY O:UTF8 R:0 D:1
name:                   OPTIONAL BINARY O:UTF8 R:0 D:1
email:                  OPTIONAL BINARY O:UTF8 R:0 D:1
city:                   OPTIONAL BINARY O:UTF8 R:0 D:1
file:                   REQUIRED BINARY O:UTF8 R:0 D:0
year:                   REQUIRED BINARY O:UTF8 R:0 D:0
month:                  REQUIRED BINARY O:UTF8 R:0 D:0
day:                    REQUIRED BINARY O:UTF8 R:0 D:0
date:                   REQUIRED BINARY O:UTF8 R:0 D:0

row group 1:            RC:5 TS:1826 OFFSET:4 
--------------------------------------------------------------------------------
_hoodie_commit_time:     BINARY SNAPPY DO:0 FPO:4 SZ:128/124/0.97 VC:5 ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY ST:[min: 20191203004801, max: 20191203004801, num_nulls: 0]
_hoodie_commit_seqno:    BINARY SNAPPY DO:0 FPO:132 SZ:112/178/1.59 VC:5 ENC:PLAIN,BIT_PACKED,RLE ST:[min: 20191203004801_0_1, max: 20191203004801_0_5, num_nulls: 0]
_hoodie_record_key:      BINARY SNAPPY DO:0 FPO:244 SZ:59/58/0.98 VC:5 ENC:PLAIN,BIT_PACKED,RLE ST:[min: 1, max: 5, num_nulls: 0]
_hoodie_partition_path:  BINARY SNAPPY DO:0 FPO:303 SZ:183/179/0.98 VC:5 ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY ST:[min: year=2019/month=08/day=10, max: year=2019/month=08/day=10, num_nulls: 0]
_hoodie_file_name:       BINARY SNAPPY DO:0 FPO:486 SZ:396/391/0.99 VC:5 ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY ST:[min: 87e76033-c836-457a-8166-2cf2af9f5d6c-0_0-6-6_20191203004801.parquet, max: 87e76033-c836-457a-8166-2cf2af9f5d6c-0_0-6-6_20191203004801.parquet, num_nulls: 0]
id:                      BINARY SNAPPY DO:0 FPO:882 SZ:59/58/0.98 VC:5 ENC:PLAIN,BIT_PACKED,RLE ST:[min: 1, max: 5, num_nulls: 0]
name:                    BINARY SNAPPY DO:0 FPO:941 SZ:92/92/1.00 VC:5 ENC:PLAIN,BIT_PACKED,RLE ST:[min: aarush, max: sanjiv, num_nulls: 0]
email:                   BINARY SNAPPY DO:0 FPO:1033 SZ:164/201/1.23 VC:5 ENC:PLAIN,BIT_PACKED,RLE ST:[min: aarush.singh@gmail.com, max: sanjiv.kumar@gmail.com, num_nulls: 0]
city:                    BINARY SNAPPY DO:0 FPO:1197 SZ:101/100/0.99 VC:5 ENC:PLAIN,BIT_PACKED,RLE ST:[min: bangalore, max: sasaram, num_nulls: 0]
file:                    BINARY SNAPPY DO:0 FPO:1298 SZ:92/88/0.96 VC:5 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: hit_data, max: hit_data, num_nulls: 0]
year:                    BINARY SNAPPY DO:0 FPO:1390 SZ:72/68/0.94 VC:5 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 2019, max: 2019, num_nulls: 0]
month:                   BINARY SNAPPY DO:0 FPO:1462 SZ:62/58/0.94 VC:5 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 08, max: 08, num_nulls: 0]
day:                     BINARY SNAPPY DO:0 FPO:1524 SZ:62/58/0.94 VC:5 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 10, max: 10, num_nulls: 0]
date:                    BINARY SNAPPY DO:0 FPO:1586 SZ:177/173/0.98 VC:5 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: year=2019/month=08/day=10, max: year=2019/month=08/day=10, num_nulls: 0]`
0reactions
vinothchandarcommented, Dec 6, 2019

visible to queries? queries dont go by file name, IIUC they read this metadata from within files to actually read them, right?

Read more comments on GitHub >

github_iconTop Results From Across the Web

All Configurations | Apache Hudi
These configs provide deep control over lower level aspects like file sizing, compression, parallelism, compaction, write schema, cleaning etc. Although Hudi ...
Read more >
Hive parquet snappy compression not working - Stack Overflow
The solution is using “TBLPROPERTIES ('parquet. compression'='SNAPPY')” (and the case matters) in the DDL instead of “TBLPROPERTIES ('PARQUET. ...
Read more >
Loading Parquet data from Cloud Storage | BigQuery
To avoid resourcesExceeded errors when loading Parquet files into BigQuery, follow these guidelines: Keep record sizes to 50 MB or less. If your...
Read more >
UNLOAD - Amazon Athena - AWS Documentation
CSV is the only output format used by the Athena SELECT query, but you can use UNLOAD to write the ... For Parquet,...
Read more >
The Delta Lake Series — Complete Collection | Databricks
Lake is powered by Apache Spark, it's not only possible for multiple users to modify a ... and cumbersome with other traditional data...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found