Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Not defining the document encoding can be slow when chardet is installed

See original GitHub issue

I have a 9 pages sized pdf document which includes 5 images. those 5 images are included via base64 encoded sources inline.

When i exclude those images during the rendering, the entire pipeline takes about 16seconds.
When i include those images, i end up with 49 seconds.

That’s a fairly big hit performance wise - are there any tools to optimize this at all? You find the PDF attached - it is not really complex but more or less is test document for our print.

We are using weasyprint using a REST api like this


@app.route('/pdf', methods=['POST'])
def generate():
    name = request.args.get('filename', 'unnamed.pdf')
    app.logger.info('POST  /pdf?filename=%s' % name)

    html = HTML(string=request.data)
    document = html.render(stylesheets=[CSS('css/local.css')], presentational_hints=True)
    pdf = document.write_pdf(zoom=0.7936507936507937)

    response = make_response(pdf)
    response.headers['Content-Type'] = 'application/pdf'
    response.headers['Content-Disposition'] = 'inline;filename=%s' % name
    app.logger.info(' ==> POST  /pdf?filename=%s  ok' % name)
    return response

This whole service runs on an developer machine (linux desktop) under docker

Intel® Core™ i7-8565U CPU @ 1.80GHz
with 32GB ram
very fast m2 SSD

I would have expected it to be quicker then that, but it seems those images have a huge impact

Issue Analytics

State:
Created 3 years ago
Comments:19 (10 by maintainers)

Top GitHub Comments

1reaction

EugenMayercommented, Aug 5, 2020

Great we could nail this one down. Since this docker image is only meant for weasyprint i’am suprised that i have more then the required dependencies. Alpine usually tries to keep the packages tiny and slim, but well, idnk where it comes from.

Having this in the docs makes a lot of sense, installation and usage docs.

Thank you so much for your time!

1reaction

liZecommented, Aug 5, 2020

Problem solved: chardet is slow for your document.

Chardet is an optional dependency of html5lib that tries to detect a document encoding. It’s not slow for me, because it’s not installed, and that’s why I had to use -e utf8 to get the correct rendering. You don’t need the -e option, because chardet is installed on your system and (very slowly) detects the right encoding.

Well actually the question maybe is, if we should document the importance of setting then encoding.

That’s a good question. I really don’t know why chardet is installed on your system, because it’s not a dependency of gunicorn, flask or WeasyPrint. It’s probably installed as a dependency of your alpine packages.

By the way, the documentation will be rewritten. I can keep this ticket open to add a comment about this.

Top Results From Across the Web

Charset detection in text() can be pathologically slow #2359

When calling response.text() and no encoding was set or determined, requests relies on chardet to detect the encoding: @property def ...

In Python, how to begin with chardet module? - Stack Overflow

Then just run "pip install chardet" and it will install the latest version of chardet that will work with your version of python....

Frequently asked questions — chardet 5.0.0 documentation

XML documents can define an encoding attribute in the XML prolog. If text comes with explicit character encoding information, you should use it....

beautifulsoup4 Changelog - pyup.io

non -ASCII characters. * When sniffing encodings, if the cchardet library is installed, Beautiful Soup uses it instead of chardet. cchardet is much...

charset-normalizer - PyPI

Open, modern and actively maintained alternative to Chardet. ... Discard all charset encoding table that could not fit the binary content.