Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Increase table loading speed possible?

See original GitHub issue

Hi,

We are trying to load this table into TableSaw.

We downloaded above file onto a SSD disk and are running this code:

final String tableSource = "/Users/tischer/Desktop/default.tsv";
System.out.println("Table source: " + tableSource );
builder = CsvReadOptions.builder( new File( tableSource ) ).separator( '\t' ).missingValueIndicator( "na", "none", "nan" );
start = System.currentTimeMillis();
Table.read().usingOptions( builder );
System.out.println("Build Table from File [ms]: " + ( System.currentTimeMillis() - start ));

This takes around 1600 ms.

Do you have any suggestions for how to potentially speed this up? We are also open to storing the table in another file format if that would help.

Thank you very much!

Issue Analytics

State:
Created a year ago
Comments:7 (1 by maintainers)

Top GitHub Comments

2reactions

cclevacommented, Oct 18, 2022

Yes, it’s the JVM loading the code from the libraries, and the JIT compiler optimizing it. As far as I know there is no easy way to speed this up (and it’s already highly optimized). This is why code performance is a particularly tricky topic in Java…

I gave a try to your file in parquet.

First step: read the csv, write to parquet, read the parquet file

Build Table from csv [ms]: 1164
Write Table to parquet [ms]: 904
Build Table from parquet [ms]: 181

It looks much faster in parquet, but the JVM is already warm when I read the parquet back.

Second step: read the csv / parquet files 4 times in separate programs

Build Table from csv [ms]: 1226
Build Table from csv [ms]: 225
Build Table from csv [ms]: 208
Build Table from csv [ms]: 209

Build Table from parquet [ms]: 1554
Build Table from parquet [ms]: 85
Build Table from parquet [ms]: 68
Build Table from parquet [ms]: 69

While on a warm JVM reading the parquet file is consistently faster (because it’s a binary format), on a cold JVM it is actually slower, probably because there is more code to load and/or optimize.

You can see code loading taking its toll on the first run if you compare the parquet reader log to the externally timed operation:

DEBUG: Finished reading 100541 rows from default.tsv.parquet in 969 ms
Build Table from parquet [ms]: 1554

Context is very important for performance considerations. Hope this helps.

0reactions

tischicommented, Oct 18, 2022

I made another test, where I just read a table with only two rows!

Table source: /Users/tischer/Desktop/default_regions.tsv
Build Table from File [ms]: 1206
Table source: /Users/tischer/Desktop/default_regions.tsv
Build Table from File [ms]: 12

This is extreme 😉

Is that the JIT building all the code for parsing tables during the first go?

If so, do you have any experience with multi-threading in that regard? To me this suggests that it could in fact be better to read many tables rather sequentially to give the JIT a chance to compile the code?!