Efficiency of run length encoding
See original GitHub issueVersion: 3.9.1
Runtime Version: .NET 5
Expected behavior
Remove repetitions in columns.
Actual behavior
The file produced by the sample below contains a number of bytes that repeat - 00 FE 04 00 FE 04 00 FE 04 etc. When I compress this file with 7-zip, it becomes almost 200 times smaller.
Is this expected behavior?
Code snippet reproducing the behavior
var buf = new int[1024 * 1024];
for (var i = 0; i < buf.Length; i++)
{
buf[i] = 1;
}
using (var stream = File.Create("_test.parquet"))
{
var field = new DataField<int>("Val");
var schema = new Schema(field);
using var parquetWriter = new ParquetWriter(schema, stream);
using var groupWriter = parquetWriter.CreateRowGroup();
var column = new DataColumn(field, buf);
groupWriter.WriteColumn(column);
}
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (4 by maintainers)
Top Results From Across the Web
Run-length encoding
Run -length encoding (RLE) is a form of lossless data compression in which runs of data (sequences in which the same data value...
Read more >Is the Run-Length Encoding (RLE) Algorithm Flawed?
Compression Efficiency : RLE performs exceptionally well when applied to data with long runs of the same value. It can achieve significant ...
Read more >RLE compression | How run length encoding works
RLE stands for Run Length Encoding. It is a lossless algorithm that only offers decent compression ratios for specific types of data.
Read more >Run Length Encoding (RLE) Compression Algorithm in ...
Run Length Encoding is a lossless data compression algorithm. It compresses data by reducing repetitive, and consecutive data called runs.
Read more >Coding - Compression 7.2. Run length encoding
This is the basic idea behind run length encoding (RLE), which is used to save space when storing digital images. In run length...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Plenty of time to benefit from some one else’s work though.
@aloneguid well, it’s your project. And RLE is one of important features of Parquet. If you decided to not make a quality implementation, it’s up to you 😃