Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Determining Encoding of an Archive

See original GitHub issue

I am hoping someone can either help me with this problem or maybe raise it as an issue. I’m working on a small tool to recursively extract nested archives until there are no more archives, but I’m running into issues with encoding. It may just be because I am a novice, but I see no way to determine the encoding with another library like Ude or something dynamically and then create a readable string out of it.

Right now, I’m seeing a number of zip file entries that convert with symbols and gibberish in their paths. It isn’t many at all, but I’d much prefer to find a way to dynamically determine what encoding should be used.

I’ve seen other posts about this, such as #277, and I also see the comment it is resolved as of 0.18. But I can find no demonstration of how it is resolved.

I will share a snippet of my switch where I have a case to handle zip files so you can see how I am accessing the archive. I have tried specifying various encodings but really what I need is a way to dynamically determine the correct encoding or maybe for someone to help me understand something I currently do not and don’t know it.

case ".zip":
    using (var archive = ZipArchive.Open(file.FullName))
    {
	foreach (var entry in archive.Entries)
    {
        // perform extraction of the entry
    }
}
break;

Issue Analytics

State:
Created 4 months ago
Comments:6 (1 by maintainers)

Top GitHub Comments

1reaction

R0315commented, Aug 15, 2023

@R0315 how did you get CharsetDetector ? I have similar problem where i am trying to extract from Tar.tar in the TestArchives. if i dont provide the Archiveencoding as CP437, it is extracting with ???.txt as name. I will create an issue for that but just wanted to see if you have has success with the solution you mentioned above.

I used Ude.NetStandard package.

I ended up taking a route with my project where if the program could tell that files within have encoding issues, it left them in the archive and, in the end, it generates a report that shows you what the issue was. I went and grabbed that archive you mentioned and, indeed, my code is unable to tell what the file encoding should be.

I remember having issues using sharp compress to try to scrape a KB of the file and test the encoding when working with tar files. I think the problem was I couldn’t actually access the entry streams within to take a sample. However, in my use case, what I had was adequate to account for a lot of the issues I was trying to solve. So, leaving a few anomalous archives and just generating a report was more than satisfactory compared to all the time manually extracting that was being spent.

However, the heuristic approach I mentioned does work very well for any archive where you can access the entry streams and grab a chunk of bytes to test. If you can come up with a way to access the tar entry streams, then you might be able to try a the same approach.

0reactions

R0315commented, Aug 15, 2023

i have tried your approach and it seems to working to an extent. while it is not able to identify as cp437, it is no longer giving out ???.txt for the file name. I am also thinking may be i will read the filenames as well to see if that can be factored into decide the encoding. but right now reading the 1kb of binary is a good start i guess. Thanks

Awesome, glad it’s helping! Yes, I had a hard time with the encodings. I’m not sure if we’ll really be able to iron out a 100% solution so long as the archiving software itself doesn’t enforce a standard. As I recall, I learned that quite a few of them allow for just using whatever the system encoding of the user is–and that is the reason for this headache.

That said though, I’m sure what I did can be improved upon or even replaced with a better approach. If you get something going that works better please let me know! I’d be happy to apply it to my project too.