Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error: Unable to instantiate java compiler

See original GitHub issue

Hi! @nils-braun,

As you already know I mistakenly opened this issue on Dask-Docker repo and you were kindly alerted by @jrbourbeau

I will copy/paste my original post here as well as your initial answer (Thank you for your quick reply)

Here is my original post:

####################################################################

What happened:

After installing Java and dask-sql using pip, whenever I try to run a SQL query from my python code I get the following error:

...
File "/home/vquery/.local/lib/python3.8/site-packages/dask_sql/context.py", line 378, in sql
    rel, select_names, _ = self._get_ral(sql)
  File "/home/vquery/.local/lib/python3.8/site-packages/dask_sql/context.py", line 515, in _get_ral
    nonOptimizedRelNode = generator.getRelationalAlgebra(validatedSqlNode)
java.lang.java.lang.IllegalStateException: java.lang.IllegalStateException: Unable to instantiate java compiler
...
...
File "JaninoRelMetadataProvider.java", line 426, in org.apache.calcite.rel.metadata.JaninoRelMetadataProvider.compile
  File "CompilerFactoryFactory.java", line 61, in org.codehaus.commons.compiler.CompilerFactoryFactory.getDefaultCompilerFactory
java.lang.java.lang.NullPointerException: java.lang.NullPointerException

What you expected to happen:

I should get a dataframe as a result.

Minimal Complete Verifiable Example:


# The cluster/client setup is done first, in another module not the one executing the SQL query
# Also tried other cluster/scheduler types with the same error
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(
    n_workers=4,
    threads_per_worker=1,
    processes=False,
    dashboard_address=':8787',
    asynchronous=False,
    memory_limit='1GB'
    )
client = Client(cluster)

# The SQL code is executed in its own module
import dask.dataframe as dd
from dask_sql import Context

c = Context()
df = dd.read_parquet('/vQuery/files/results/US_Accidents_June20.parquet') 
c.register_dask_table(df, 'df')
df = c.sql("""select ID, Source from df""") # This line fails with the error reported

Anything else we need to know?:

As mentioned in the code snippet above, due to the way my application is designed, the Dask client/cluster setup is done before dask-sql context is created.

Environment:

Dask version:
- dask: 2020.12.0
- dask-sql: 0.3.1
Python version:
- Python 3.8.5
Operating System:
- Ubuntu 20.04.1 LTS
Install method (conda, pip, source):
- pip
Application Framework
- Jupyter Notebook/Ipywidgets & Voila Server

Install steps

$ sudo apt install default-jre

$ sudo apt install default-jdk

$ java -version
openjdk version "11.0.10" 2021-01-19
OpenJDK Runtime Environment (build 11.0.10+9-Ubuntu-0ubuntu1.20.04)
OpenJDK 64-Bit Server VM (build 11.0.10+9-Ubuntu-0ubuntu1.20.04, mixed mode, sharing)

$ javac -version
javac 11.0.10

$ echo $JAVA_HOME
/usr/lib/jvm/java-11-openjdk-amd64

$ pip install dask-sql

$ pip list | grep dask-sql
dask-sql               0.3.1

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:9

Top GitHub Comments

1reaction

LaurentEsinglecommented, Mar 4, 2021

@nils-braun

THANK YOU!

It works.

I was worried at firs about the overhead of importing dask-sql/spawning the JVM, setting the context and computing the results in a new thread each time a query is issued.

It turns out the overhead is minimal and not significant in my particular scenario (I am not expecting hundreds of users to connect to the application).

From a cold start (no previous queries), the timing is as follows in my system:

spawning the JVM: about 1,5 seconds
getting the graph: about 2 seconds

On the second execution, the duration of each of the steps above is greatly reduced (a few microseconds)

Once the JVM is spawned, that same JVM is used for subsequent queries.

I think this usage of threads is worth mentioning in the documentation.

Again thank you!

1reaction

nils-brauncommented, Mar 2, 2021

Unfortunately I still get the “Unable to instantiate java compiler” error when using Futures.

Maybe I misunderstood your use case then. Let me try to write what I understood and you can tell me, how wrong I was 😃 (but also see below for a possible solution without the futures using the SQL server and even more below with using your original proposal)

I imagine you have some sort of “frontend”/UI, where users can click a button or create some other type of input. In your original idea (which was quite reasonable, except the Java problem), you then spin up a new thread to kick of the computation with dask-sql, which then triggers a Dask computation with a LocalScheduler. Now (that is what I understood), with the futures you should be able to live without this thread. If a user enters some input you kick off the dask-sql computation directly from within your UI process. This will basically only send the computation to the (local) cluster. This cluster does not run dask-sql, but only Dask - so you should not face the same problems as before.

While writing this, it came to my mind that - depending on your use case - you might also be interested in using dask-sql in server mode. If you are only interested in sending SQL strings to dask-sql (if this is your use case), you might be able to do this with a usual sqlalchemy connection (which will work in a separate thread). You would then run the dask-sql server in a separate process (unrelated to your main process).

Are you getting the same error when using threads in your testing environment?

When using the “single-machine distributed scheduler” I do not get any errors (that is my usual development environment).

But, thanks for asking this question - as it might have brought us a bit closer to the solution.

Compare those two code blocks:

import threading
import dask_sql


def run():
    c = dask_sql.Context()
    print(c.sql("SELECT 1 + 1").compute())

t = threading.Thread(target=run)
t.start()
t.join()

import threading


def run():
    import dask_sql
    c = dask_sql.Context()
    print(c.sql("SELECT 1 + 1").compute())

t = threading.Thread(target=run)
t.start()
t.join()

The first one throws the error you described whereas the second one works. The reason is, that while importing dask_sql, the internal java virtual machine is spun up. If you now switch the thread, the JVM is accessed from a different thread, which probably confuses it. If you are able to only run dask-sql completely in the newly created thread, you would also be able to import it only after having created the thread - and this might already solve your issue (that is also new to me, so thanks for sharing your issue!)

Can you have a look at the following link :

That is unfortunately not the same exception than we see here. Also, the code that raises the exception is in the Calcite library - not something I have directly access to (but I think with the solutions above, we might not even need to “fix” it).