question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Not defining the document encoding can be slow when chardet is installed

See original GitHub issue

I have a 9 pages sized pdf document which includes 5 images. those 5 images are included via base64 encoded sources inline.

  • When i exclude those images during the rendering, the entire pipeline takes about 16seconds.
  • When i include those images, i end up with 49 seconds.

That’s a fairly big hit performance wise - are there any tools to optimize this at all? You find the PDF attached - it is not really complex but more or less is test document for our print.

We are using weasyprint using a REST api like this


@app.route('/pdf', methods=['POST'])
def generate():
    name = request.args.get('filename', 'unnamed.pdf')
    app.logger.info('POST  /pdf?filename=%s' % name)

    html = HTML(string=request.data)
    document = html.render(stylesheets=[CSS('css/local.css')], presentational_hints=True)
    pdf = document.write_pdf(zoom=0.7936507936507937)

    response = make_response(pdf)
    response.headers['Content-Type'] = 'application/pdf'
    response.headers['Content-Disposition'] = 'inline;filename=%s' % name
    app.logger.info(' ==> POST  /pdf?filename=%s  ok' % name)
    return response

This whole service runs on an developer machine (linux desktop) under docker

  • Intel® Core™ i7-8565U CPU @ 1.80GHz
  • with 32GB ram
  • very fast m2 SSD

I would have expected it to be quicker then that, but it seems those images have a huge impact

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:19 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
EugenMayercommented, Aug 5, 2020

Great we could nail this one down. Since this docker image is only meant for weasyprint i’am suprised that i have more then the required dependencies. Alpine usually tries to keep the packages tiny and slim, but well, idnk where it comes from.

Having this in the docs makes a lot of sense, installation and usage docs.

Thank you so much for your time!

1reaction
liZecommented, Aug 5, 2020

Problem solved: chardet is slow for your document.

Chardet is an optional dependency of html5lib that tries to detect a document encoding. It’s not slow for me, because it’s not installed, and that’s why I had to use -e utf8 to get the correct rendering. You don’t need the -e option, because chardet is installed on your system and (very slowly) detects the right encoding.

Well actually the question maybe is, if we should document the importance of setting then encoding.

That’s a good question. I really don’t know why chardet is installed on your system, because it’s not a dependency of gunicorn, flask or WeasyPrint. It’s probably installed as a dependency of your alpine packages.

By the way, the documentation will be rewritten. I can keep this ticket open to add a comment about this.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Charset detection in text() can be pathologically slow #2359
When calling response.text() and no encoding was set or determined, requests relies on chardet to detect the encoding: @property def ...
Read more >
In Python, how to begin with chardet module? - Stack Overflow
Then just run "pip install chardet" and it will install the latest version of chardet that will work with your version of python....
Read more >
Frequently asked questions — chardet 5.0.0 documentation
XML documents can define an encoding attribute in the XML prolog. If text comes with explicit character encoding information, you should use it....
Read more >
beautifulsoup4 Changelog - pyup.io
non -ASCII characters. * When sniffing encodings, if the cchardet library is installed, Beautiful Soup uses it instead of chardet. cchardet is much...
Read more >
charset-normalizer - PyPI
Open, modern and actively maintained alternative to Chardet. ... Discard all charset encoding table that could not fit the binary content.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found