Determining Encoding of an Archive
See original GitHub issueI am hoping someone can either help me with this problem or maybe raise it as an issue. I’m working on a small tool to recursively extract nested archives until there are no more archives, but I’m running into issues with encoding. It may just be because I am a novice, but I see no way to determine the encoding with another library like Ude or something dynamically and then create a readable string out of it.
Right now, I’m seeing a number of zip file entries that convert with symbols and gibberish in their paths. It isn’t many at all, but I’d much prefer to find a way to dynamically determine what encoding should be used.
I’ve seen other posts about this, such as #277, and I also see the comment it is resolved as of 0.18. But I can find no demonstration of how it is resolved.
I will share a snippet of my switch where I have a case to handle zip files so you can see how I am accessing the archive. I have tried specifying various encodings but really what I need is a way to dynamically determine the correct encoding or maybe for someone to help me understand something I currently do not and don’t know it.
case ".zip":
using (var archive = ZipArchive.Open(file.FullName))
{
foreach (var entry in archive.Entries)
{
// perform extraction of the entry
}
}
break;
Issue Analytics
- State:
- Created 4 months ago
- Comments:6 (1 by maintainers)
I used Ude.NetStandard package.
I ended up taking a route with my project where if the program could tell that files within have encoding issues, it left them in the archive and, in the end, it generates a report that shows you what the issue was. I went and grabbed that archive you mentioned and, indeed, my code is unable to tell what the file encoding should be.
I remember having issues using sharp compress to try to scrape a KB of the file and test the encoding when working with tar files. I think the problem was I couldn’t actually access the entry streams within to take a sample. However, in my use case, what I had was adequate to account for a lot of the issues I was trying to solve. So, leaving a few anomalous archives and just generating a report was more than satisfactory compared to all the time manually extracting that was being spent.
However, the heuristic approach I mentioned does work very well for any archive where you can access the entry streams and grab a chunk of bytes to test. If you can come up with a way to access the tar entry streams, then you might be able to try a the same approach.
Awesome, glad it’s helping! Yes, I had a hard time with the encodings. I’m not sure if we’ll really be able to iron out a 100% solution so long as the archiving software itself doesn’t enforce a standard. As I recall, I learned that quite a few of them allow for just using whatever the system encoding of the user is–and that is the reason for this headache.
That said though, I’m sure what I did can be improved upon or even replaced with a better approach. If you get something going that works better please let me know! I’d be happy to apply it to my project too.