Not defining the document encoding can be slow when chardet is installed
See original GitHub issueI have a 9 pages sized pdf document which includes 5 images. those 5 images are included via base64 encoded sources inline.
- When i exclude those images during the rendering, the entire pipeline takes about 16seconds.
- When i include those images, i end up with 49 seconds.
That’s a fairly big hit performance wise - are there any tools to optimize this at all? You find the PDF attached - it is not really complex but more or less is test document for our print.
We are using weasyprint using a REST api like this
@app.route('/pdf', methods=['POST'])
def generate():
name = request.args.get('filename', 'unnamed.pdf')
app.logger.info('POST /pdf?filename=%s' % name)
html = HTML(string=request.data)
document = html.render(stylesheets=[CSS('css/local.css')], presentational_hints=True)
pdf = document.write_pdf(zoom=0.7936507936507937)
response = make_response(pdf)
response.headers['Content-Type'] = 'application/pdf'
response.headers['Content-Disposition'] = 'inline;filename=%s' % name
app.logger.info(' ==> POST /pdf?filename=%s ok' % name)
return response
This whole service runs on an developer machine (linux desktop) under docker
- Intel® Core™ i7-8565U CPU @ 1.80GHz
- with 32GB ram
- very fast m2 SSD
I would have expected it to be quicker then that, but it seems those images have a huge impact
Issue Analytics
- State:
- Created 3 years ago
- Comments:19 (10 by maintainers)
Top Results From Across the Web
Charset detection in text() can be pathologically slow #2359
When calling response.text() and no encoding was set or determined, requests relies on chardet to detect the encoding: @property def ...
Read more >In Python, how to begin with chardet module? - Stack Overflow
Then just run "pip install chardet" and it will install the latest version of chardet that will work with your version of python....
Read more >Frequently asked questions — chardet 5.0.0 documentation
XML documents can define an encoding attribute in the XML prolog. If text comes with explicit character encoding information, you should use it....
Read more >beautifulsoup4 Changelog - pyup.io
non -ASCII characters. * When sniffing encodings, if the cchardet library is installed, Beautiful Soup uses it instead of chardet. cchardet is much...
Read more >charset-normalizer - PyPI
Open, modern and actively maintained alternative to Chardet. ... Discard all charset encoding table that could not fit the binary content.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Great we could nail this one down. Since this docker image is only meant for weasyprint i’am suprised that i have more then the required dependencies. Alpine usually tries to keep the packages tiny and slim, but well, idnk where it comes from.
Having this in the docs makes a lot of sense, installation and usage docs.
Thank you so much for your time!
Problem solved: chardet is slow for your document.
Chardet is an optional dependency of html5lib that tries to detect a document encoding. It’s not slow for me, because it’s not installed, and that’s why I had to use
-e utf8
to get the correct rendering. You don’t need the-e
option, because chardet is installed on your system and (very slowly) detects the right encoding.That’s a good question. I really don’t know why chardet is installed on your system, because it’s not a dependency of gunicorn, flask or WeasyPrint. It’s probably installed as a dependency of your alpine packages.
By the way, the documentation will be rewritten. I can keep this ticket open to add a comment about this.