TrecDocs: .Z and .z files are different.
See original GitHub issueDescribe the bug
I’ve stumbled on this before, and it seems like the same issue happens here. .z
and .Z
files are not always equivalent, but TrecDocs
treat them like so by calling .lower()
on the suffix of the Path
object:
.Z files are created by calling the Unix command compress: (from the man page:
Compress reduces the size of the named files using adaptive Lempel-Ziv coding. Whenever possible, each file is replaced by one with the extension .Z (…)
while .z files are created by using gzip:
gunzip takes a list of files on its command line and replaces each file whose name ends with .gz, -gz, .z, -z, _z or .Z (…)
Note that gunzip can decompress BOTH formats, in theory, but, it seems like unlzw3 can only read the first (.Z)
There are some Disks45 distributions (mine, for instance) that are compressed with .z
(i.e. using gunzip with option -S .z
):
-S .suf --suffix .suf When compressing, use suffix .suf instead of .gz. Any non-empty suffix can be given, but suffixes other than .z and .gz should be avoided to avoid confusion when files are transferred to other systems.
Affected dataset(s)
All that used TrecDocs
, but Disks45 more likely.
To Reproduce
Trying to read documents with a .z
compressed files results in this:
TypeError: string argument without an encoding
Additional context Error is trigged on this line: https://github.com/allenai/ir_datasets/blob/27317b2951a2c7f843ffc7c8d0b245acdc784c7f/ir_datasets/formats/trec.py#L136
Issue Analytics
- State:
- Created a year ago
- Comments:7 (2 by maintainers)
Top GitHub Comments
@seanmacavaney Of course! I think #188 should be REALLY straight forward to fix. We first got into this kind of problem with Robust during OSIRRC 2019. Everyone had a different version of it.
Of course. I don’t think you should worry about it, as I said in #191