Error: Unable to instantiate java compiler
See original GitHub issueHi! @nils-braun,
As you already know I mistakenly opened this issue on Dask-Docker repo and you were kindly alerted by @jrbourbeau
I will copy/paste my original post here as well as your initial answer (Thank you for your quick reply)
Here is my original post:
####################################################################
What happened:
After installing Java and dask-sql using pip, whenever I try to run a SQL query from my python code I get the following error:
...
File "/home/vquery/.local/lib/python3.8/site-packages/dask_sql/context.py", line 378, in sql
rel, select_names, _ = self._get_ral(sql)
File "/home/vquery/.local/lib/python3.8/site-packages/dask_sql/context.py", line 515, in _get_ral
nonOptimizedRelNode = generator.getRelationalAlgebra(validatedSqlNode)
java.lang.java.lang.IllegalStateException: java.lang.IllegalStateException: Unable to instantiate java compiler
...
...
File "JaninoRelMetadataProvider.java", line 426, in org.apache.calcite.rel.metadata.JaninoRelMetadataProvider.compile
File "CompilerFactoryFactory.java", line 61, in org.codehaus.commons.compiler.CompilerFactoryFactory.getDefaultCompilerFactory
java.lang.java.lang.NullPointerException: java.lang.NullPointerException
What you expected to happen:
I should get a dataframe as a result.
Minimal Complete Verifiable Example:
# The cluster/client setup is done first, in another module not the one executing the SQL query
# Also tried other cluster/scheduler types with the same error
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(
n_workers=4,
threads_per_worker=1,
processes=False,
dashboard_address=':8787',
asynchronous=False,
memory_limit='1GB'
)
client = Client(cluster)
# The SQL code is executed in its own module
import dask.dataframe as dd
from dask_sql import Context
c = Context()
df = dd.read_parquet('/vQuery/files/results/US_Accidents_June20.parquet')
c.register_dask_table(df, 'df')
df = c.sql("""select ID, Source from df""") # This line fails with the error reported
Anything else we need to know?:
As mentioned in the code snippet above, due to the way my application is designed, the Dask client/cluster setup is done before dask-sql context is created.
Environment:
- Dask version:
- dask: 2020.12.0
- dask-sql: 0.3.1
- Python version:
- Python 3.8.5
- Operating System:
- Ubuntu 20.04.1 LTS
- Install method (conda, pip, source):
- pip
- Application Framework
- Jupyter Notebook/Ipywidgets & Voila Server
Install steps
$ sudo apt install default-jre
$ sudo apt install default-jdk
$ java -version
openjdk version "11.0.10" 2021-01-19
OpenJDK Runtime Environment (build 11.0.10+9-Ubuntu-0ubuntu1.20.04)
OpenJDK 64-Bit Server VM (build 11.0.10+9-Ubuntu-0ubuntu1.20.04, mixed mode, sharing)
$ javac -version
javac 11.0.10
$ echo $JAVA_HOME
/usr/lib/jvm/java-11-openjdk-amd64
$ pip install dask-sql
$ pip list | grep dask-sql
dask-sql 0.3.1
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:9
Top GitHub Comments
@nils-braun
THANK YOU!
It works.
I was worried at firs about the overhead of importing dask-sql/spawning the JVM, setting the context and computing the results in a new thread each time a query is issued.
It turns out the overhead is minimal and not significant in my particular scenario (I am not expecting hundreds of users to connect to the application).
From a cold start (no previous queries), the timing is as follows in my system:
On the second execution, the duration of each of the steps above is greatly reduced (a few microseconds)
Once the JVM is spawned, that same JVM is used for subsequent queries.
I think this usage of threads is worth mentioning in the documentation.
Again thank you!
Maybe I misunderstood your use case then. Let me try to write what I understood and you can tell me, how wrong I was 😃 (but also see below for a possible solution without the futures using the SQL server and even more below with using your original proposal)
I imagine you have some sort of “frontend”/UI, where users can click a button or create some other type of input. In your original idea (which was quite reasonable, except the Java problem), you then spin up a new thread to kick of the computation with dask-sql, which then triggers a Dask computation with a
LocalScheduler
. Now (that is what I understood), with thefutures
you should be able to live without this thread. If a user enters some input you kick off the dask-sql computation directly from within your UI process. This will basically only send the computation to the (local) cluster. This cluster does not run dask-sql, but only Dask - so you should not face the same problems as before.While writing this, it came to my mind that - depending on your use case - you might also be interested in using dask-sql in server mode. If you are only interested in sending SQL strings to dask-sql (if this is your use case), you might be able to do this with a usual
sqlalchemy
connection (which will work in a separate thread). You would then run the dask-sql server in a separate process (unrelated to your main process).When using the “single-machine distributed scheduler” I do not get any errors (that is my usual development environment).
But, thanks for asking this question - as it might have brought us a bit closer to the solution.
Compare those two code blocks:
The first one throws the error you described whereas the second one works. The reason is, that while importing
dask_sql
, the internal java virtual machine is spun up. If you now switch the thread, the JVM is accessed from a different thread, which probably confuses it. If you are able to only run dask-sql completely in the newly created thread, you would also be able to import it only after having created the thread - and this might already solve your issue (that is also new to me, so thanks for sharing your issue!)That is unfortunately not the same exception than we see here. Also, the code that raises the exception is in the Calcite library - not something I have directly access to (but I think with the solutions above, we might not even need to “fix” it).