`computeTruncatedLength` may cause 20% table scan slow down for Raptor (or Hive)
See original GitHub issueI’m looking into raptor performance recently. It turns out computeTruncatedLength
may eat 20% CPU (from my local benchmark). This function is a sanity check to make sure unicode codepoints are valid otherwise truncate. Though it is necessary, no sure if there is other way to avoid such high overhead.
Benchmark: https://github.com/highker/presto/commit/4b74a468f3b0d60799603a551e6d8fe0eb7b531b
Results:
without computeTruncatedLength
: 569.7778528263376 MB/s
with computeTruncatedLength
: 441.1912459425512 MB/s
The table (orc format, single file, 7.5GB, 150M roles) I used for this benchmark is a materialized tpch table with a varchar column.
Issue Analytics
- State:
- Created 5 years ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
Fewer Accidental Full Table Scans Brought to You by Apache ...
The problem arises when you want to query a particular set of days within a month (example below).
Read more >Can bad table-scan queries slow down nice index-covered ...
I can definitely say "Yes, your bad query can impact Ivan's query" due to resource competition at instance level.
Read more >Truncate Table Operations in SQL Server - SQLShack
Truncating a table is removing all the records in an entire table or a table partition.
Read more >0xdf hacks stuff | CTF solutions, malware analysis, home lab ...
I'll exploit a directory traversal to read outside the current directory, and find a password that can be used to access the system....
Read more >Avoiding Table Scans - Oracle Help Center
A table scan is the reading of every row in a table and is caused by queries that don't properly use indexes. Table...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Does the benchmark contain any data with multibyte characters? If not, I would expect the whole thing to generate assembly with just validation of the assumption and no real byte counting due to this check https://github.com/prestodb/presto/blob/master/presto-spi/src/main/java/com/facebook/presto/spi/type/Varchars.java#L83 . If that is not happening, I’d look into restructuring the code so that the inlining happens, or hoist that check closer to the main loop so that the common path avoids that call entirely.
Aha~ Works very well~~~~~~