NULL values on indexed columns support
See original GitHub issueWhat went wrong?
We tried to index a dataset from the TPC-DS, and two of the chosen columns for indexing had null values. We should add support to null values.
21/09/28 09:17:18 ERROR Executor: Exception in task 10.0 in stage 4.0 (TID 28)
org.apache.spark.sql.AnalysisException: Column to index contains null values. Please initialize them before indexing
How to reproduce?
- Code that triggered the bug, or steps to reproduce: Indexing notebook s3 public dataset by “ss_customer_sk”, “ss_item_sk”, “ss_sold_date_sk”
parquet_df.write.mode("overwrite").format("qbeast").option("columnsToIndex", "ss_customer_sk,ss_item_sk,ss_sold_date_sk").save(qbeast_table_path)
-
Branch and commit id: Main on 15667c27bb2cc6d76cecd680d61e22fa8f571d49
-
Spark version: 3.1.1
-
Hadoop version: 3.2.2
-
Are you running Spark inside a container? Are you launching the app on a remote K8s cluster? Or are you just running the tests on a local computer? I am running Spark shell on a local computer.
-
Stack trace:
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2253)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2202)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2201)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2201)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1078)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1078)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1078)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2440)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2382)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2371)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Caused by: org.apache.spark.sql.AnalysisException: Column to index contains null values. Please initialize them before indexing
at org.apache.spark.sql.AnalysisExceptionFactory$.create(AnalysisExceptionFactory.scala:36)
at io.qbeast.spark.index.OTreeAlgorithmImpl.$anonfun$rowValuesToPoint$1(OTreeAlgorithm.scala:322)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at io.qbeast.spark.index.OTreeAlgorithmImpl.rowValuesToPoint(OTreeAlgorithm.scala:317)
at io.qbeast.spark.index.OTreeAlgorithmImpl.$anonfun$index$3(OTreeAlgorithm.scala:233)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at io.qbeast.spark.index.OTreeAlgorithmImpl.$anonfun$index$2(OTreeAlgorithm.scala:231)
at org.apache.spark.sql.execution.MapPartitionsExec.$anonfun$doExecute$3(objects.scala:195)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
Index for nullable column - Stack Overflow
By default, relational databases ignore NULL values (because the relational model says that NULL means "not present"). So, Index does not store NULL...
Read more >Indexing on Columns That Allow Nulls - SQLTeam.com Forums
Yes, SQL will use an index with NULLable columns. NULL is effectively just another "value" in an index. The index will be searched...
Read more >Filtered index with Column IS NULL - SQL Server
You create a filtered index together with the Column IS NULL predicate expression in SQL Server. The Column field is not included in...
Read more >Index your NULL table column trick - Burleson Consulting
Indexing NULL table column ... Here, the "1" tells Oracle that to index on NULL values within the tables. One problem with pre...
Read more >Indexing NULL in the Oracle Database
The Oracle database does not include rows in an index if all indexed columns are NULL . That means that every index is...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Merged in #67
First, indeed, we don’t care so much about those null values for schema evolution. Since we will be working with different space revisions, the null values of the new columns do not affect older versions. Then, for the normal datasets with null values, I think you are right. We can treat them with 0.
Yes, on Spark, all elements retrieved from index blocks will be filtered again on memory to satisfy the where clause.