Carved images bloated to end even if length or end magic is known
See original GitHub issueThis seem to be a rather annoying bug, as is show on the multiple questions on places like SO and SO RE and here
Unfortunately the problem may be because of poor maintenance of the magic signatures, that does rarely provide proper length or end tags, even when available.
The issue was raised already back in #153, but was closed without comment! But it doesn’t seem to have been fixed… 👎
# mkdir testcarv
# cd testcarv/
# dd if=/dev/zero of=head bs=1 count=512
# dd if=/dev/zero of=junk bs=1 count=512000
# wget https://www.debian.org/logos/openlogo-100.jpg
# cat head openlogo-100.jpg junk > full
# ls -alh
-rw-rw-rw-+ 1 xxxx xxxx 509K Oct 19 08:20 full
-rw-rw-rw-+ 1 xxxx xxxx 512 Oct 19 08:19 head
-rw-rw-rw-+ 1 xxxx xxxx 500K Oct 19 08:19 junk
-rw-rw-rw-+ 1 xxxx xxxx 8.3K Jun 1 07:50 openlogo-100.jpg
# binwalk -z -C demo -D 'jpeg:jpg' full
DECIMAL HEXADECIMAL DESCRIPTION
--------------------------------------------------------------------------------
512 0x200 JPEG image data, JFIF standard 1.02
1074 0x432 JPEG image data, JFIF standard 1.02
# ls -alhR demo/
-rw-rw-rw-+ 1 xxxx xxxx 509K Oct 19 08:20 200.jpg
-rw-rw-rw-+ 1 xxxx xxxx 508K Oct 19 08:20 432.jpg
The question is how we can best address this issue, without just brushing it of and repeatedly closing these same issues without comments? Perhaps answering questions like:
- Is there an extractor available? (Is it implemented into binwalk?) Where to find it?
- How to create an extractor from binwalk info and magic?
- How to maintain the magic files and clearly identify the ones that:
a. Do and do not have an END section signature or magic
b. How to implement the difference (in what to do) in binwalk in those cases they don’t.
c. raise
TODO
orHelpNeeded
issues along with a WIKI list, to flag those signatures that need fixing d. distinguish sig’s that can be handled by standard package tools like 7zip and gzip. e. Consider making binwalk warn when using too short magics, like <8 bytes ones.
Just some ideas. 👍 What are you guys thoughts?
Issue Analytics
- State:
- Created 5 years ago
- Comments:6 (4 by maintainers)
Top Results From Across the Web
Childe Harold's Pilgrimage Summary & Analysis
With Childe Harold, particularly the final two cantos, he explores history – its titanic forces, and its impact upon the common man –...
Read more >13 Women Showing Their Bloated Bellies to Prove Extreme ...
Anna Victoria, Emily Skye, Sara Puhto, and other Instagrammers have posted photos of their bloated tummies on Instagram.
Read more >The Malleus Maleficarum - OAPEN Library
In the. Malleus, as in some other German texts, the witch was defined through her maleficium and practice of magic.Throughout southern Europe authors...
Read more >The Joy Luck Club - SharpSchool
When published in 1986 The Joy Luck Club spent 40 weeks on The New York Times Bestseller list. It was nominated for the...
Read more >Caleb Widogast - Critical Role Wiki - Fandom
He pursued magical knowledge and artifacts even if it meant risking his life. ... Caleb asked the Bright Queen what it would take...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I added this to the FAQ
Binwalk’s Approach to Exctraction and Data Carving
Since this issue stemmed from issue #153, I’ll address the general problem of data carving first, as that seems to be the root of confusion regarding that issue. I think it’s important to distinguish the difference between “extraction” and “carving” in the context of binwalk.
Carving data out of a file is simply just running
dd
over a selected portion of the file. The carved data requires no additional manipulation. Your JPEG example is a good one; once a JPEG is carved out of a firmware image (or any other file for that matter), you can open the JPEG image in an image viewer/editor without any additional manipulation of the JPEG data.Extracting data on the other hand, requires first carving some selected data out of the firmware image, and then performing manipulation of that carved data in order to present it in a useful manner to the user. File systems are a good example here; one can easily
dd
the raw bytes of a CramFS image out of a firmware file, but it’s much more useful to actually extract the file system’s contents (ELF files, config files, HTML, JPEG, etc) and display them to the user. This requires carving the CramFS data out of the original file and then running some extraction utility to process the CramFS image and extract its contents.Binwalk is primarily a signature analysis and extraction utility, specifically focused on firmware. Can it perform general data carving? Yes, but that’s really only there as a necessary means to an end. The
--carve
command line option can be useful, especially if you want to examine some data for which you have a signature but no extraction utility, but you have to be aware of its limitations. Can I cram 4 people into a Porche 911 and drive cross country? Sure, but there are probably better vehicles I can use for that. There are many data carving utilities out there that pre-date binwalk and will do a much better job at carving out files like JPEG’s, word docs, etc, if that’s what you want to do.Why do I mention all this? Because data carving is difficult. Many common file types have no size field in their header, nor do they have any kind of end-of-file marker. Zlib compressed files are a good example of this. So what to do about it?
One option would be to say, “Well, I see that there is a zlib signature at offset
0
, and a JPEG signature at offset512
, so the zlib data must only be 512 bytes long.” But what if the JPEG signature is a false positive, and the zlib data is actually 1024 bytes long? Now you get nothing because your carved zlib data is truncated and can’t be properly decompressed, and the JPEG image doesn’t even exist because it was a false positive. And yes, there will always be false positives, so when designing binwalk I opted against this approach. If I understand it correctly, your suggestion “perhaps run binwalk a second pass backwards from EOF up to signature or end tag that matches” also relies on not having false positives, not to mention would increase scan time and require what I suspect would be a non-trivial code re-write.A second option would be to say, “OK, well let’s write a JPEG analyzer that can examine the suspected JPEG data and validate that it actually loads as a real JPEG image”. Binwalk can, and does, do this for some selected file types, but it would be very time consuming to have to do that for every single signature that binwalk supports. It would also make adding signatures very difficult and time consuming.
The option that I opted for was to simply say, “If we know the size of the data that we’re carving, then only carve that size. Otherwise, take all the data up to the end of the file and let the extraction utility deal with it.” While this is inefficient in terms of disk space, that is its only real drawback. Most extraction utilities don’t care about trailing data, false positive signatures don’t prevent real data from being extracted, and we don’t have to write code to support every single signature, making adding signatures as simple as editing a text file. Additionally there are work-arounds to address disk space usage, such as the
--size
and--rm
command line options.So going back to issue #153: when you manually instruct binwalk to carve out a JPEG file, it’ll do it, but you should expect that any trailing data will be included. Generally, this won’t prevent you from viewing the picture, but the file size may be larger than you expected. This is expected and intentional behavior. Could it possibly be addressed for all signatures? Yes, but I don’t see how that could be accomplished without a significant amount of effort, the benefits of which would be minimal (i.e., it would not appreciably help binwalk’s primary goal of scanning and extracting firmware). The one thing that I did see in issue #153 which was concerning were reports that the carved JPEG files were larger than the total size of the original file from which they were carved. This would certainly be considered a bug, but I have not been able to reproduce this issue, nor do I see from the code how this could happen, nor was anyone willing/able to share files that they claimed produced such behavior.
Aside from that, I felt that my responses to #153 indicated that the observed behavior was expected and intentional, which is why @CoffeeExpress closed the ticket; so I wouldn’t say that ticket was closed “without comment”.
Stack Exchange Issues
https://reverseengineering.stackexchange.com/questions/13616/simple-carving-of-zip-file-using-binwalk
There are two zip files extracted for this example, because binwalk sees two “end of zip archive” footers. And yes, there is trailing data at the end of the zip file, but all the zip file contents are correctly extracted. I understand how this is puzzling if you’re not familiar with binwalk’s internals, but IMHO this is a non-issue as binwalk has done its job of signature scanning and extraction.
https://reverseengineering.stackexchange.com/questions/11791/unpack-ipcam-firmware-binwalk-extraction-issue?rq=1
AFAICT, the issue here wasn’t that binwalk included trailing data in the carved zip file, but rather the extraction of the zip file contents failed. This was actually an issue with the extraction utilities and not binwalk per-se (and it’s already been fixed). Because of binwalk’s policy of
carve-everything-up-to-eof
, all the zip data was available to the extraction utilities, but most extraction utilities choked claming the zip file had “no end of central directory structure” (such as unzip), or worse, reported that they extracted everything just fine when really they only extracted one file from the zip archive (such as p7zip). Java’sjar
utility extracts these zip archives with no issue, and that is what binwalk now uses by default for zip extraction.Answering Your Specific Questions
It’s pretty easy to see which file types are supported for extraction. The binwalk extraction rules are in
extract.conf
(https://github.com/ReFirmLabs/binwalk/blob/master/src/binwalk/config/extract.conf). If it’s not listed there, thenbinwalk -e
won’t extract it (unless of course you’ve created your own customextract.conf
or specified an extractor on the command line with--dd
). It would probably be worthwhile to create a wiki page listing the extraction rules.Now, to answer the other question “is there an extractor available for file type X?” can be more difficult, but generally a google search can usually answer this question. If you’re specifically asking about JPEG files, the answer is no, because JPEGs don’t need to be “extracted” (per my definition of extraction vs carving, above).
If binwalk has a magic signature for a file type, you just need a command line utility that knows how to extract your file type. You can specify this extractor on the command line using the
--dd
option that you’re already using, or by creating your ownextract.conf
rule (theextract.conf
rules use the same format as the--dd
option).I’ll say up front that these are all great suggestions that will probably never happen because I simply don’t have the time to devote to them…but I’ll address them here anyway.
I’d also include signatures that have a SIZE field (most file types that I’ve dealt with anyway are more likely to have a size field in their header, rather than an EOF signature).
There are currently two options for signatures of unknown sizes: let binwalk extract everything up to the EOF, which is obviously the default, or write a plugin to specifically handle that signature.
This would probably go hand-in-hand with
a
above.I think most signatures in binwalk, if not all, which can be handled by standard utilities, already are. It would be good to go through and verify this though.
This is easier said than done. First, most magics are less than 8 bytes, so you’d get warnings all over the place. Second, some signatures have inherently short magic bytes (such as 2 bytes or less), but produce practically zero false positives through the use of the
{invalid}
tag inside the signature itself, and/or binwalk plugins that perform more detailed analysis of the signature. I think a better solution would be to examine any signatures that are added to binwalk (either binwalk proper, or your own personal signature file(s)) and test them for the probability of false positives prior to integrating them into binwalk.