Failing for BGZIP'd streaming files
See original GitHub issueHi all, thanks for the wonderful library!
Unfortunately I think I’ve found a bug. Files compressed with bgzip
(block gzip) are failing when trying to use pako
to do streaming decompression.
The file pako-fail-test-data.txt.gz is an example file that is able to trigger what I believe to be an error. The file itself is 65,569 bytes big, which is just larger than what I assume to be a block size relevant to bgzip (somewhere around 65280?). Here is a small shell session that has some relevant information:
$ wc pako-fail-test-data.txt
1858 16831 65569 pako-fail-test-data.txt
$ md5sum pako-fail-test-data.txt
7eae4c6bc0e68326879728f80a0e002b pako-fail-test-data.txt
$ zcat pako-fail-test-data.gz | bgzip -c > pako-fail-test-data.txt.gz
$ md5sum pako-fail-test-data.txt.gz
f4d0b896c191f66ff6962de37d69db45 pako-fail-test-data.txt.gz
$ bgzip -h
Version: 1.4.1
Usage: bgzip [OPTIONS] [FILE] ...
Options:
-b, --offset INT decompress at virtual file pointer (0-based uncompressed offset)
-c, --stdout write on standard output, keep original files unchanged
-d, --decompress decompress
-f, --force overwrite files without asking
-h, --help give this help
-i, --index compress and create BGZF index
-I, --index-name FILE name of BGZF index file [file.gz.gzi]
-r, --reindex (re)index compressed file
-g, --rebgzip use an index file to bgzip a file
-s, --size INT decompress INT bytes (uncompressed size)
-@, --threads INT number of compression threads to use [1]
Here is some sample code that should decompress the whole file, but doesn’t. My apologies for it not being elegant, I’m still learning and I kind of threw a bunch of things together to get something that I believe triggers the error:
var pako = require("pako"),
fs = require("fs");
var CHUNK_SIZE = 1024*1024,
buffer = new Buffer(CHUNK_SIZE);
function _node_uint8array_to_string(data) {
var buf = new Buffer(data.length);
for (var ii=0; ii<data.length; ii++) {
buf[ii] = data[ii];
}
return buf.toString();
}
var inflator = new pako.Inflate();
inflator.onData = function(chunk) {
var v = _node_uint8array_to_string(chunk);
process.stdout.write(v);
};
fs.open("./pako-fail-test-data.txt.gz", "r", function(err,fd) {
if (err) { throw err; }
function read_chunk() {
fs.read(fd, buffer, 0, CHUNK_SIZE, null,
function(err, nread) {
var data = buffer;
if (nread<CHUNK_SIZE) { data = buffer.slice(0, nread); }
inflator.push(data, false);
if (nread > 0) { read_chunk(); }
});
};
read_chunk();
});
I did not indicate an end block (that is I did not do inflator.push(data.false)
anywhere) and there are maybe other problems with this in how the data blocks are read from fs
but I hope you’ll forgive this sloppiness in the interest of simplicity to illuminate the relevant issue.
Running this does successfully decompress a portion of the file but then stops at what I believe to the first block. Here are some shell commands that might be enlightening:
$ node pako-error-example.js | wc
1849 16755 65280
$ node pako-error-example.js | md5sum
a55dd4f2c7619a52fd6bc76e2af631b8 -
$ zcat pako-fail-test-data.txt.gz | md5sum
7eae4c6bc0e68326879728f80a0e002b -
$ zcat pako-fail-test-data.txt.gz | head -c 65280 | md5sum
a55dd4f2c7619a52fd6bc76e2af631b8 -
$ zcat pako-fail-test-data.txt.gz | wc
1858 16831 65569
Running another simple example using browser-zlib
triggers an error outright:
var fs = require("fs"),
zlib = require("browserify-zlib");
var r = fs.createReadStream('pako-fail-test-data.txt.gz');
var z = zlib.createGunzip();
z.on("data", function(chunk) {
process.stdout.write(chunk.toString());
});
r.pipe(z);
And when run via node stream-example-2.js
, the error produces is:
events.js:137
throw er; // Unhandled 'error' event
^
Error: invalid distance too far back
at Zlib._handle.onerror (/home/abe/play/js/browser-large-file/node_modules/browserify-zlib/lib/index.js:352:17)
at Zlib._error (/home/abe/play/js/browser-large-file/node_modules/browserify-zlib/lib/binding.js:283:8)
at Zlib._checkError (/home/abe/play/js/browser-large-file/node_modules/browserify-zlib/lib/binding.js:254:12)
at Zlib._after (/home/abe/play/js/browser-large-file/node_modules/browserify-zlib/lib/binding.js:262:13)
at /home/abe/play/js/browser-large-file/node_modules/browserify-zlib/lib/binding.js:126:10
at process._tickCallback (internal/process/next_tick.js:150:11)
I assume this is a pako
error as browserify-zlib
uses pako
underneath so my apologies if this is browserify-zlib
error and has nothing to do with pako
.
As a “control”, the following code works without issue:
var fs = require("fs"),
zlib = require("zlib");
var r = fs.createReadStream('pako-fail-test-data.txt.gz');
var z = zlib.createGunzip();
z.on("data", function(chunk) {
process.stdout.write(chunk.toString());
});
r.pipe(z);
bgzip
is used to allow for random access to gzipped files. The resulting block compressed file is a bit bigger than using straight gzip
compression but the small compressed file size inflation is often worth it for the ability to efficiently access arbitrary positions in the uncompressed data.
My specific use case is that I want to process a large text file (compressed ~115Mb
, ~650Mb
uncompressed with other files being even larger). Loading the complete file, either compressed or uncompressed, is not an option either because of memory exhaustion or straight up memory restrictions in JavaScript. I only need to process the data in a streaming manner (that is, I only need to look at the data once and then am able to mostly discard it) so this is why I was looking into this option. The bioinformatics community uses this method quite a bit (bgzip
is itself part of tabix
which is part of a bioinformatics library called htslib
) so it would be nice if pako
supported this use case.
If there is another library I should be using to allow for stream processing of compressed data in the browser, I would welcome any suggestions.
Issue Analytics
- State:
- Created 5 years ago
- Comments:35 (14 by maintainers)
@drtconway wrapper changed significantly but multistream test exist https://github.com/nodeca/pako/blob/0398fad238edc29df44f78e338cbcfd5ee2657d3/test/gzip_specials.js#L60-L77
@rbuels I’m working on reading local fastq.gz files in the browser and stumbled upon this issue. Haven’t been able to get pako to work so far. Is there currently a working solution for streaming bgzf files in the browser?
EDIT: I need streaming because the files are large. I don’t (and can’t) need to store the entire file in memory, just need to stream through all the lines to gather some statistics.