Joins are too slow
See original GitHub issueHi guys!
I’m trying to migrate from python
+pandas
to kotlin
+tablesaw
.
Some parts of my code are already working fast (like csv parsing, x2 times faster than in pandas)
But also i’ve noticed that inner join
operation is pretty slow (~ 2 times slower than in pandas)
Then i tried to optimize my code and use isIn
Selection instead of simple join. Unfortunately it uses strings.toArray(new String[0])
under the hood for input parameter collection. It would be more sense to use HashSet to quicker lookup. So i wrote my own predicate:
val customersSet = customers.toHashSet() // for faster lookup
val idColumn = transactions.stringColumn("customer_id")
idColumn.eval { it in customersSet }
Which is x15 times faster than original inner join. At least on my huge dataset. Of course this is much simpler than join
operation since i haven’t appended columns etc. But still the difference is huge.
I didn’t investigate joins code yet, but i hope there is space for improvements there.
My key point is: kotlin+jvm should be at least not slower than python+pandas
What do you guys think?
ps: do you use hash indexing on table columns?
Issue Analytics
- State:
- Created 5 years ago
- Comments:17
Top GitHub Comments
I didn’t get this finished. It’s more complex than it first seemed. These notes are here for the next attempt.
There were two things contributing to slow joins.
The second issue accounts for most of the time in the join.
I put through a fix for the first issue, which was to pre-create the indexes so they’re only created once per join. The performance is now independent of which table goes first.
While it sounds easy re-swap the tables in the join result, it’s difficult to make the result come out the same as it would have if the tables were handled in the order specified. The row order in the result is different, but I think that is an acceptable issue. It does seem useful, however to get the column ordering and naming to be the same. Achieving that, however, is made difficult by interactions between other already implemented features:
Some of these are easier to deal with than others, but together they make this a non-trivial fix.
A left join with a schema similar to the one laid out above by @deviant-studio ddl19901201 now runs in about 300ms with PR #562