Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Failing for BGZIP'd streaming files

See original GitHub issue

Hi all, thanks for the wonderful library!

Unfortunately I think I’ve found a bug. Files compressed with bgzip (block gzip) are failing when trying to use pako to do streaming decompression.

The file pako-fail-test-data.txt.gz is an example file that is able to trigger what I believe to be an error. The file itself is 65,569 bytes big, which is just larger than what I assume to be a block size relevant to bgzip (somewhere around 65280?). Here is a small shell session that has some relevant information:

$ wc pako-fail-test-data.txt 
 1858 16831 65569 pako-fail-test-data.txt
$ md5sum pako-fail-test-data.txt 
7eae4c6bc0e68326879728f80a0e002b  pako-fail-test-data.txt
$ zcat pako-fail-test-data.gz | bgzip -c > pako-fail-test-data.txt.gz
$ md5sum pako-fail-test-data.txt.gz 
f4d0b896c191f66ff6962de37d69db45  pako-fail-test-data.txt.gz
$ bgzip -h

Version: 1.4.1
Usage:   bgzip [OPTIONS] [FILE] ...
Options:
   -b, --offset INT        decompress at virtual file pointer (0-based uncompressed offset)
   -c, --stdout            write on standard output, keep original files unchanged
   -d, --decompress        decompress
   -f, --force             overwrite files without asking
   -h, --help              give this help
   -i, --index             compress and create BGZF index
   -I, --index-name FILE   name of BGZF index file [file.gz.gzi]
   -r, --reindex           (re)index compressed file
   -g, --rebgzip           use an index file to bgzip a file
   -s, --size INT          decompress INT bytes (uncompressed size)
   -@, --threads INT       number of compression threads to use [1]

Here is some sample code that should decompress the whole file, but doesn’t. My apologies for it not being elegant, I’m still learning and I kind of threw a bunch of things together to get something that I believe triggers the error:

var pako = require("pako"),
    fs = require("fs");

var CHUNK_SIZE = 1024*1024,
    buffer = new Buffer(CHUNK_SIZE);

function _node_uint8array_to_string(data) {
  var buf = new Buffer(data.length);
  for (var ii=0; ii<data.length; ii++) {
    buf[ii] = data[ii];
  }
  return buf.toString();
}

var inflator = new pako.Inflate();
inflator.onData = function(chunk) {
  var v = _node_uint8array_to_string(chunk);
  process.stdout.write(v);
};

fs.open("./pako-fail-test-data.txt.gz", "r", function(err,fd) {
  if (err) { throw err; }
  function read_chunk() {
    fs.read(fd, buffer, 0, CHUNK_SIZE, null,
      function(err, nread) {
        var data = buffer;
        if (nread<CHUNK_SIZE) { data = buffer.slice(0, nread); }
        inflator.push(data, false);
        if (nread > 0) { read_chunk(); }
      });
  };
  read_chunk();
});

I did not indicate an end block (that is I did not do inflator.push(data.false) anywhere) and there are maybe other problems with this in how the data blocks are read from fs but I hope you’ll forgive this sloppiness in the interest of simplicity to illuminate the relevant issue.

Running this does successfully decompress a portion of the file but then stops at what I believe to the first block. Here are some shell commands that might be enlightening:

$ node pako-error-example.js | wc
   1849   16755   65280
$ node pako-error-example.js | md5sum
a55dd4f2c7619a52fd6bc76e2af631b8  -
$ zcat pako-fail-test-data.txt.gz | md5sum
7eae4c6bc0e68326879728f80a0e002b  -
$ zcat pako-fail-test-data.txt.gz | head -c 65280 | md5sum
a55dd4f2c7619a52fd6bc76e2af631b8  -
$ zcat pako-fail-test-data.txt.gz | wc
   1858   16831   65569

Running another simple example using browser-zlib triggers an error outright:

var fs = require("fs"),
    zlib = require("browserify-zlib");

var r = fs.createReadStream('pako-fail-test-data.txt.gz');
var z = zlib.createGunzip();

z.on("data", function(chunk) {
  process.stdout.write(chunk.toString());
});
r.pipe(z);

And when run via node stream-example-2.js, the error produces is:

events.js:137
      throw er; // Unhandled 'error' event
      ^

Error: invalid distance too far back
    at Zlib._handle.onerror (/home/abe/play/js/browser-large-file/node_modules/browserify-zlib/lib/index.js:352:17)
    at Zlib._error (/home/abe/play/js/browser-large-file/node_modules/browserify-zlib/lib/binding.js:283:8)
    at Zlib._checkError (/home/abe/play/js/browser-large-file/node_modules/browserify-zlib/lib/binding.js:254:12)
    at Zlib._after (/home/abe/play/js/browser-large-file/node_modules/browserify-zlib/lib/binding.js:262:13)
    at /home/abe/play/js/browser-large-file/node_modules/browserify-zlib/lib/binding.js:126:10
    at process._tickCallback (internal/process/next_tick.js:150:11)

I assume this is a pako error as browserify-zlib uses pako underneath so my apologies if this is browserify-zlib error and has nothing to do with pako.

As a “control”, the following code works without issue:

var fs = require("fs"),
    zlib = require("zlib");

var r = fs.createReadStream('pako-fail-test-data.txt.gz');
var z = zlib.createGunzip();

z.on("data", function(chunk) {
  process.stdout.write(chunk.toString());
});
r.pipe(z);

bgzip is used to allow for random access to gzipped files. The resulting block compressed file is a bit bigger than using straight gzip compression but the small compressed file size inflation is often worth it for the ability to efficiently access arbitrary positions in the uncompressed data.

My specific use case is that I want to process a large text file (compressed ~115Mb, ~650Mb uncompressed with other files being even larger). Loading the complete file, either compressed or uncompressed, is not an option either because of memory exhaustion or straight up memory restrictions in JavaScript. I only need to process the data in a streaming manner (that is, I only need to look at the data once and then am able to mostly discard it) so this is why I was looking into this option. The bioinformatics community uses this method quite a bit (bgzip is itself part of tabix which is part of a bioinformatics library called htslib) so it would be nice if pako supported this use case.

If there is another library I should be using to allow for stream processing of compressed data in the browser, I would welcome any suggestions.

Issue Analytics

State:
Created 5 years ago
Comments:35 (14 by maintainers)

Top GitHub Comments

1reaction

puzrincommented, Oct 25, 2021

@drtconway wrapper changed significantly but multistream test exist https://github.com/nodeca/pako/blob/0398fad238edc29df44f78e338cbcfd5ee2657d3/test/gzip_specials.js#L60-L77

1reaction

anderspitmancommented, Jan 31, 2020

@rbuels I’m working on reading local fastq.gz files in the browser and stumbled upon this issue. Haven’t been able to get pako to work so far. Is there currently a working solution for streaming bgzf files in the browser?

EDIT: I need streaming because the files are large. I don’t (and can’t) need to store the entire file in memory, just need to stream through all the lines to gather some statistics.

Top Results From Across the Web

Fix common problems with .stream files - Wowza

To fix this, go to the [install-dir]/content folder and remove the .txt file name extension from the file. The mpegts.stream file contains the ......

Google Drive File Stream Not Working – What to Do?

The Common Reasons for the “Google Drive File Stream Not Working” Problem. The main reasons for this error are: Outdated, corrupted cached ...

Failed to write to file error - ADODB.stream - Stack Overflow

It uses ADODB.stream to extract the file to my C drive. I ran it on Windows XP SP3 and it worked fine. Then...

Untitled

Ashkenazi genetic bottleneck, Download file android ics, At-14200, ... Australie galles streaming, Asus 1000he hackintosh guide, Kampanus, Okidata c3200n ...

Rsamtools | PDF | Vector Space | Constructor (Object Oriented ...

Download as PDF, TXT or read online from Scribd ... This package provides facilities for parsing samtools BAM (binary) files representing aligned se-...