question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Joins are too slow

See original GitHub issue

Hi guys! I’m trying to migrate from python+pandas to kotlin+tablesaw. Some parts of my code are already working fast (like csv parsing, x2 times faster than in pandas) But also i’ve noticed that inner join operation is pretty slow (~ 2 times slower than in pandas)

Then i tried to optimize my code and use isIn Selection instead of simple join. Unfortunately it uses strings.toArray(new String[0]) under the hood for input parameter collection. It would be more sense to use HashSet to quicker lookup. So i wrote my own predicate:

val customersSet = customers.toHashSet()    // for faster lookup
val idColumn = transactions.stringColumn("customer_id")
idColumn.eval { it in customersSet }

Which is x15 times faster than original inner join. At least on my huge dataset. Of course this is much simpler than join operation since i haven’t appended columns etc. But still the difference is huge. I didn’t investigate joins code yet, but i hope there is space for improvements there. My key point is: kotlin+jvm should be at least not slower than python+pandas What do you guys think?

ps: do you use hash indexing on table columns?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:17

github_iconTop GitHub Comments

2reactions
lwhite1commented, Jan 20, 2019

I didn’t get this finished. It’s more complex than it first seemed. These notes are here for the next attempt.

There were two things contributing to slow joins.

  1. Indexes are recreated in a loop, once for each row in the main join table.
  2. The running time of the cross products seems to be sensitive to the number of times it is called. It is called once for each row in the main join table.

The second issue accounts for most of the time in the join.

I put through a fix for the first issue, which was to pre-create the indexes so they’re only created once per join. The performance is now independent of which table goes first.

While it sounds easy re-swap the tables in the join result, it’s difficult to make the result come out the same as it would have if the tables were handled in the order specified. The row order in the result is different, but I think that is an acceptable issue. It does seem useful, however to get the column ordering and naming to be the same. Achieving that, however, is made difficult by interactions between other already implemented features:

  • handling multi-table joins means the join logic called recursively if there are more than two tables involved.
  • handling tables with duplicate column names means that some of the columns in the result have different names than they have in the source tables. The algorithm for renaming the columns works by assigning a number to each table and prepending T[number]. to each duplicate column. This relies on the recursion to increment the table number for multi-table joins.
  • removing one of the join columns to avoid having two copies of the same data in the result means that you the join table doesn’t have all the columns in the original main table.
  • handling multi-column joins means there’s an arbitrary number of such missing columns in the result table.

Some of these are easier to deal with than others, but together they make this a non-trivial fix.

1reaction
ryancerfcommented, Aug 3, 2019

A left join with a schema similar to the one laid out above by @deviant-studio ddl19901201 now runs in about 300ms with PR #562

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why are joins bad when considering scalability?
Joins can be slower than avoiding them through de-normalisation but if used correctly (joining on columns with appropriate indexes ...
Read more >
How to optimize very slow SELECT with LEFT JOINs over big ...
This is my query witch tooks 3~4 min. and I'd like to optimize: SELECT person_id FROM person LEFT JOIN attribute location ON location.attribute_type_id...
Read more >
Are joins in databases inherently slow? Is there any way to ...
Joins consume more resources, but are not necessarily slower in elapsed time. There are two ways to optimise join queries that I can...
Read more >
is it just because MySQL doesn't have hash joins? - Reddit
So there is some evidence that a lot of people think joins are slow merely because MySQL doesn't have hash join. "Everybody" is...
Read more >
Performance dramatically slower on left join - Google Groups
Hi, I am seeing slow speeds on the following LEFT JOIN: SELECT a.objId ... In H2, this query also runs <10ms if the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found