File Magic Numbers
See original GitHub issueThere are certain points in Haystack where we perform file type checks based on their extensions (e.g. .txt .pdf). As pointed out in #708 by @lalitpagaria, it would be better to do this using file magic numbers. We could implement this in 2 places:
- When performing file conversion
- When downloading a file using fetch_archive_from_http
TODO:
- implement double check for docx (e.g.
if file_type == "xml" and extension ==".docx"
) - Investigate behaviour on compressed files, compare when the
uncompressed
param is set toTrue
/False
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (5 by maintainers)
Top Results From Across the Web
List of file signatures - Wikipedia
Hex signature ISO 8859‑1 Offset Extension
23 21 #! 0
D4 C3 B2 A1 (little‑endian) Ôò¡ 0 pcap
A1 B2 C3 D4 (big‑endian) ¡²ÃÔ 0 pcap...
Read more >File Magic Numbers - GitHub Gist
Magic numbers are the first bits of a file which uniquely identify the type of file. This makes programming easier because complicated file...
Read more >GCK'S File signatures table
Free file signature page since 2002! ... This table of file signatures (aka "magic numbers") is a continuing work-in-progress.
Read more >Working with Magic numbers in Linux - GeeksforGeeks
Magic numbers are the first few bytes of a file that are unique to a particular file type. These unique bits are referred...
Read more >What is a magic number? - IBM
QUESTION: What is a magic number? ANSWER: A magic number is a numeric or string constant that indicates the file type. This number...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Ahhh good to know! Thanks
I ran the above experiment again using this python magic object with some new params and actually got much better and more interpretable results.
I think it would be worth going ahead with incorporating this package! @lalitpagaria would you be interested in opening a PR?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 21 days if no further activity occurs.