question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Determining Encoding of an Archive

See original GitHub issue

I am hoping someone can either help me with this problem or maybe raise it as an issue. I’m working on a small tool to recursively extract nested archives until there are no more archives, but I’m running into issues with encoding. It may just be because I am a novice, but I see no way to determine the encoding with another library like Ude or something dynamically and then create a readable string out of it.

Right now, I’m seeing a number of zip file entries that convert with symbols and gibberish in their paths. It isn’t many at all, but I’d much prefer to find a way to dynamically determine what encoding should be used.

I’ve seen other posts about this, such as #277, and I also see the comment it is resolved as of 0.18. But I can find no demonstration of how it is resolved.

I will share a snippet of my switch where I have a case to handle zip files so you can see how I am accessing the archive. I have tried specifying various encodings but really what I need is a way to dynamically determine the correct encoding or maybe for someone to help me understand something I currently do not and don’t know it.

case ".zip":
    using (var archive = ZipArchive.Open(file.FullName))
    {
	foreach (var entry in archive.Entries)
    {
        // perform extraction of the entry
    }
}
break;

Issue Analytics

  • State:open
  • Created 4 months ago
  • Comments:6 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
R0315commented, Aug 15, 2023

@R0315 how did you get CharsetDetector ? I have similar problem where i am trying to extract from Tar.tar in the TestArchives. if i dont provide the Archiveencoding as CP437, it is extracting with ???.txt as name. I will create an issue for that but just wanted to see if you have has success with the solution you mentioned above.

I used Ude.NetStandard package.

I ended up taking a route with my project where if the program could tell that files within have encoding issues, it left them in the archive and, in the end, it generates a report that shows you what the issue was. I went and grabbed that archive you mentioned and, indeed, my code is unable to tell what the file encoding should be.

I remember having issues using sharp compress to try to scrape a KB of the file and test the encoding when working with tar files. I think the problem was I couldn’t actually access the entry streams within to take a sample. However, in my use case, what I had was adequate to account for a lot of the issues I was trying to solve. So, leaving a few anomalous archives and just generating a report was more than satisfactory compared to all the time manually extracting that was being spent.

However, the heuristic approach I mentioned does work very well for any archive where you can access the entry streams and grab a chunk of bytes to test. If you can come up with a way to access the tar entry streams, then you might be able to try a the same approach.

0reactions
R0315commented, Aug 15, 2023

i have tried your approach and it seems to working to an extent. while it is not able to identify as cp437, it is no longer giving out ???.txt for the file name. I am also thinking may be i will read the filenames as well to see if that can be factored into decide the encoding. but right now reading the 1kb of binary is a good start i guess. Thanks

Awesome, glad it’s helping! Yes, I had a hard time with the encodings. I’m not sure if we’ll really be able to iron out a 100% solution so long as the archiving software itself doesn’t enforce a standard. As I recall, I learned that quite a few of them allow for just using whatever the system encoding of the user is–and that is the reason for this headache.

That said though, I’m sure what I did can be improved upon or even replaced with a better approach. If you get something going that works better please let me know! I’d be happy to apply it to my project too.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to detect the encoding of a file?
There is a pretty simple way using Firefox. Open your file using Firefox, then View > Character Encoding. Detailed here. – Catherine Gasnier....
Read more >
Get encoding of a file in Windows
File Encoding Checker is a GUI tool that allows you to validate the text encoding of one or more files. The tool can...
Read more >
utf 8 - How can I see which encoding is used in a file
What you can easily do though is to verify whether the complete file can be successfully decoded somehow (but not necessarily correctly) using...
Read more >
How to Determine File Encoding in Mac OS by Command ...
You can determine a files encoding and character set through the command line in Mac OS (and linux) by using the “file” command,...
Read more >
How to Find File Encoding in Linux
Explore two methods to find the character encoding of a file in Linux. ... In the snippet above, -L determines the language of...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found