question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TrecDocs: .Z and .z files are different.

See original GitHub issue

Describe the bug I’ve stumbled on this before, and it seems like the same issue happens here. .z and .Z files are not always equivalent, but TrecDocs treat them like so by calling .lower() on the suffix of the Path object:

https://github.com/allenai/ir_datasets/blob/27317b2951a2c7f843ffc7c8d0b245acdc784c7f/ir_datasets/formats/trec.py#L127-L137

.Z files are created by calling the Unix command compress: (from the man page:

Compress reduces the size of the named files using adaptive Lempel-Ziv coding. Whenever possible, each file is replaced by one with the extension .Z (…)

while .z files are created by using gzip:

gunzip takes a list of files on its command line and replaces each file whose name ends with .gz, -gz, .z, -z, _z or .Z (…)

Note that gunzip can decompress BOTH formats, in theory, but, it seems like unlzw3 can only read the first (.Z)

There are some Disks45 distributions (mine, for instance) that are compressed with .z (i.e. using gunzip with option -S .z):

-S .suf --suffix .suf When compressing, use suffix .suf instead of .gz. Any non-empty suffix can be given, but suffixes other than .z and .gz should be avoided to avoid confusion when files are transferred to other systems.

Affected dataset(s)

All that used TrecDocs, but Disks45 more likely.

To Reproduce Trying to read documents with a .z compressed files results in this:

TypeError: string argument without an encoding

Additional context Error is trigged on this line: https://github.com/allenai/ir_datasets/blob/27317b2951a2c7f843ffc7c8d0b245acdc784c7f/ir_datasets/formats/trec.py#L136

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
ArthurCamaracommented, May 5, 2022

@seanmacavaney Of course! I think #188 should be REALLY straight forward to fix. We first got into this kind of problem with Robust during OSIRRC 2019. Everyone had a different version of it.

0reactions
ArthurCamaracommented, May 9, 2022

Of course. I don’t think you should worry about it, as I said in #191

Read more comments on GitHub >

github_iconTop Results From Across the Web

Issues · allenai/ir_datasets - GitHub
#193 opened on May 25 by seanmacavaney. 8 tasks. TrecDocs: .Z and .z files are different. bug Something isn't working. #189 opened on...
Read more >
Z File (What It Is & How to Open One) - Lifewire
A Z file is a UNIX Compressed file that opens with most unzip programs. They're used to compress a single file for backup...
Read more >
In Unix, how can I uncompress *.Z or *.tar.Z files? - IU KB
Z or *.tar.Z files, at the shell prompt, enter: uncompress *.Z. Use the ls command to check the resulting files. If uncompress creates...
Read more >
Why are .z.* files created for - Unix & Linux Stack Exchange
In my home folder I have realized that there are plenty of .z.* files are created. They all are empty. I have no...
Read more >
Z File Extension: Open Z Files Now With WinZip
WinZip opens Z files. Use WinZip, the most popular zip file utility, to open and extract content from Z files and other compressed...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found