question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

File Magic Numbers

See original GitHub issue

There are certain points in Haystack where we perform file type checks based on their extensions (e.g. .txt .pdf). As pointed out in #708 by @lalitpagaria, it would be better to do this using file magic numbers. We could implement this in 2 places:

  • When performing file conversion
  • When downloading a file using fetch_archive_from_http

TODO:

  • implement double check for docx (e.g. if file_type == "xml" and extension ==".docx")
  • Investigate behaviour on compressed files, compare when the uncompressed param is set to True/False

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
brandenchancommented, Jan 12, 2021

Actually docx is nothing but compressed collection of xml files. https://docs.fileformat.com/word-processing/docx/

Ahhh good to know! Thanks

I ran the above experiment again using this python magic object with some new params and actually got much better and more interpretable results.

f = magic.Magic(mime=False, uncompress=False)

"""
Results

bert
PDF document, version 1.5

classics
UTF-8 Unicode text, with very long lines

everything.tar.gz
gzip compressed data, from Unix, original size modulo 2^32 189440000 gzip compressed data, reserved method, ASCII, has CRC, extra field, has comment, encrypted, from FAT filesystem (MS-DOS, OS/2, NT), original size modulo 2^32 189440000

heavy_metal
Microsoft OOXML

nq-dev-small.json
ASCII text, with very long lines, with no line terminators

nq-dev-small.json.gz
gzip compressed data, from Unix, original size modulo 2^32 109076480

nq-dev-small.zip
Zip archive data, at least v2.0 to extract

test_file_magic.py
Python script, ASCII text executable
"""

I think it would be worth going ahead with incorporating this package! @lalitpagaria would you be interested in opening a PR?

0reactions
stale[bot]commented, May 13, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 21 days if no further activity occurs.

Read more comments on GitHub >

github_iconTop Results From Across the Web

List of file signatures - Wikipedia
Hex signature ISO 8859‑1 Offset Extension 23 21 #! 0 D4 C3 B2 A1 (little‑endian) Ôò¡ 0 pcap A1 B2 C3 D4 (big‑endian) ¡²ÃÔ 0 pcap...
Read more >
File Magic Numbers - GitHub Gist
Magic numbers are the first bits of a file which uniquely identify the type of file. This makes programming easier because complicated file...
Read more >
GCK'S File signatures table
Free file signature page since 2002! ... This table of file signatures (aka "magic numbers") is a continuing work-in-progress.
Read more >
Working with Magic numbers in Linux - GeeksforGeeks
Magic numbers are the first few bytes of a file that are unique to a particular file type. These unique bits are referred...
Read more >
What is a magic number? - IBM
QUESTION: What is a magic number? ANSWER: A magic number is a numeric or string constant that indicates the file type. This number...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found