EXIF parsing causing memory explosion
See original GitHub issueThumbor request URL
The request url for the issue you are having, you can swap the host name with a fake
http://thumbor-host.com/unsafe/1400x0/filters:no_upscale()/file-host.com/GettyImages_539746540.jpg
Here is the image with corrupted/invalid EXIF data used in the example request GettyImages_539746540.jpg.zip
Expected behaviour
Tell us what should happen An image should be returned
Actual behaviour
Tell us what happens instead No response is received, thumbor crashes due to running out of memory. Reverse proxy in front of thumbor may respond with a timeout code if that happens first.
Operating system
OSx, Ubuntu, etc Ubuntu
Your thumbor.conf
## IMAGE PROCESSING ##
ENGINE = 'thumbor.engines.pil'
DETECTORS = [
'thumbor.detectors.face_detector',
'thumbor.detectors.feature_detector'
]
FILTERS = [
'thumbor.filters.blur',
'thumbor.filters.colorize',
'thumbor.filters.extract_focal',
'thumbor.filters.format',
'thumbor.filters.focal',
'thumbor.filters.no_upscale',
'thumbor.filters.quality',
'thumbor.filters.saturation',
'thumbor.filters.fill',
]
OPTIMIZERS = [
'thumbor.optimizers.jpegtran',
'thumbor.optimizers.gifv',
'thumbor_plugins.optimizers.autojpeg',
]
JPEGTRAN_PATH = '/usr/bin/jpegtran'
FFMPEG_PATH = '/usr/bin/ffmpeg'
ALLOW_ANIMATED_GIFS = True
USE_GIFSICLE_ENGINE = True
PRESERVE_EXIF_INFO = False
AUTO_WEBP = True
QUALITY = 80
WEBP_QUALITY = 80
AUTOJPEG_QUALITY='90'
AUTOJPEG_SUBSAMPLING='0'
GC_INTERVAL=60
ENGINE_THREADPOOL_SIZE=12
I already know what is causing the issue. As documented here https://github.com/hMatoba/Piexif/issues/90 the piexif library has a bug which is causing the memory explosion. Both piexif
and the older pexif
libraries seem to have this bug.
We aren’t even using RESPECT_ORIENTATION
with our cluster but I found that thumbor tries to parse the EXIF data regardless of that setting - but it won’t be used unless the setting is enabled. It’s probably a bad idea to parse the EXIF data if it’s not necessary - it takes time to parse that data. But it’s also not a good solution to disable EXIF parsing as a means of working around this problem (though that is exactly what we have done as an interim patch)
I am also a bit perplexed as to why we are even using the piexif
library at all. The pillow
library can very easily return the EXIF data. The example below doesn’t have orientation data but it would be in position 274 if it existed.
im = Image.open("GettyImages_539746540.jpg")
im._getexif()
/usr/local/lib/python2.7/site-packages/PIL/TiffImagePlugin.py:768: UserWarning: Possibly corrupt EXIF data. Expecting to read 34225520648 bytes but only got 104. Skipping tag 33437
" Skipping tag %s" % (size, len(data), tag))
/usr/local/lib/python2.7/site-packages/PIL/TiffImagePlugin.py:768: UserWarning: Possibly corrupt EXIF data. Expecting to read 33685506 bytes but only got 0. Skipping tag 34850
" Skipping tag %s" % (size, len(data), tag))
{36864: '0221', 37377: (9965784, 1000000), 37378: (4970854, 1000000), 36867: u'2008:02:21 14:18:14', 36868: u'2008:02:21 14:18:14', 37381: (3, 1), 41990: 0, 37383: 6, 37385: 16, 37386: (300, 1), 41986: 0, 270: u'Austin, TX February 21, 2008: Supporters of candidates Hillary Clinton and Barack Obama line up outside the Rec Sports Center at the Univeristy of Texas at Austin hours prior to the debate between the Democratic candidates Thursday.', 271: u'Canon', 272: u'Canon EOS 5D', 41987: 0, 33432: u'Bob Daemmrich Photography, Inc.', 37380: (-1, 3), 282: (300, 1), 283: (300, 1), 33434: (1, 1000), 34855: 160, 296: 2, 306: u'2008:02:21 14:46:13', 315: u'Bob Daemmrich', 41985: 0, 41486: (4368000, 1415), 41487: (2912000, 942), 41488: 2, 34665: 470}
^ As you can see, it encounters the same bad EXIF data but it doesn’t OOM any servers which is a nice feature : D
** I think we should
- Use pillow to extract EXIF orientation data (it’s no more cryptic than piexif)
- Remove piexif as a dependency
- Stop parsing EXIF data unless it might be used in some later operation (reorientation)
^ I’m happy to code this up if folks feel like that is a good solution. I’d like to know what people think.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:9 (9 by maintainers)
pyexiv2 is already optional dependency for additional metadata https://github.com/thumbor/thumbor/blob/bf92bb033507cb1df7facb8a651fcf6ada22e942/thumbor/engines/__init__.py#L20 and docs https://github.com/thumbor/thumbor/blob/b4096a957b8ea2420ea2d91818129dbc482c1ca6/docs/metadata.rst
The solution to this issue is in a pending pull request: https://github.com/thumbor/thumbor/pull/1210