Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TrecDocs: .Z and .z files are different.

See original GitHub issue

Describe the bug I’ve stumbled on this before, and it seems like the same issue happens here. .z and .Z files are not always equivalent, but TrecDocs treat them like so by calling .lower() on the suffix of the Path object:

https://github.com/allenai/ir_datasets/blob/27317b2951a2c7f843ffc7c8d0b245acdc784c7f/ir_datasets/formats/trec.py#L127-L137

.Z files are created by calling the Unix command compress: (from the man page:

Compress reduces the size of the named files using adaptive Lempel-Ziv coding. Whenever possible, each file is replaced by one with the extension .Z (…)

while .z files are created by using gzip:

gunzip takes a list of files on its command line and replaces each file whose name ends with .gz, -gz, .z, -z, _z or .Z (…)

Note that gunzip can decompress BOTH formats, in theory, but, it seems like unlzw3 can only read the first (.Z)

There are some Disks45 distributions (mine, for instance) that are compressed with .z (i.e. using gunzip with option -S .z):

-S .suf --suffix .suf When compressing, use suffix .suf instead of .gz. Any non-empty suffix can be given, but suffixes other than .z and .gz should be avoided to avoid confusion when files are transferred to other systems.

Affected dataset(s)

All that used TrecDocs, but Disks45 more likely.

To Reproduce Trying to read documents with a .z compressed files results in this:

TypeError: string argument without an encoding

Additional context Error is trigged on this line: https://github.com/allenai/ir_datasets/blob/27317b2951a2c7f843ffc7c8d0b245acdc784c7f/ir_datasets/formats/trec.py#L136

Issue Analytics

State:
Created a year ago
Comments:7 (2 by maintainers)

Top GitHub Comments

1reaction

ArthurCamaracommented, May 5, 2022

@seanmacavaney Of course! I think #188 should be REALLY straight forward to fix. We first got into this kind of problem with Robust during OSIRRC 2019. Everyone had a different version of it.

0reactions

ArthurCamaracommented, May 9, 2022

Of course. I don’t think you should worry about it, as I said in #191