Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`computeTruncatedLength` may cause 20% table scan slow down for Raptor (or Hive)

See original GitHub issue

I’m looking into raptor performance recently. It turns out computeTruncatedLength may eat 20% CPU (from my local benchmark). This function is a sanity check to make sure unicode codepoints are valid otherwise truncate. Though it is necessary, no sure if there is other way to avoid such high overhead.

Benchmark: https://github.com/highker/presto/commit/4b74a468f3b0d60799603a551e6d8fe0eb7b531b

Results:

without computeTruncatedLength: 569.7778528263376 MB/s with computeTruncatedLength: 441.1912459425512 MB/s

The table (orc format, single file, 7.5GB, 150M roles) I used for this benchmark is a materialized tpch table with a varchar column.

Issue Analytics

State:
Created 5 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

daincommented, Oct 11, 2018

Does the benchmark contain any data with multibyte characters? If not, I would expect the whole thing to generate assembly with just validation of the assumption and no real byte counting due to this check https://github.com/prestodb/presto/blob/master/presto-spi/src/main/java/com/facebook/presto/spi/type/Varchars.java#L83 . If that is not happening, I’d look into restructuring the code so that the inlining happens, or hoist that check closer to the main loop so that the common path avoids that call entirely.

0reactions

highkercommented, Oct 11, 2018

Aha~ Works very well~~~~~~