Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

NULL values on indexed columns support

See original GitHub issue

What went wrong?

We tried to index a dataset from the TPC-DS, and two of the chosen columns for indexing had null values. We should add support to null values.

21/09/28 09:17:18 ERROR Executor: Exception in task 10.0 in stage 4.0 (TID 28)
org.apache.spark.sql.AnalysisException: Column to index contains null values. Please initialize them before indexing

How to reproduce?

Code that triggered the bug, or steps to reproduce: Indexing notebook s3 public dataset by “ss_customer_sk”, “ss_item_sk”, “ss_sold_date_sk”

parquet_df.write.mode("overwrite").format("qbeast").option("columnsToIndex", "ss_customer_sk,ss_item_sk,ss_sold_date_sk").save(qbeast_table_path)

Branch and commit id: Main on 15667c27bb2cc6d76cecd680d61e22fa8f571d49
Spark version: 3.1.1
Hadoop version: 3.2.2
Are you running Spark inside a container? Are you launching the app on a remote K8s cluster? Or are you just running the tests on a local computer? I am running Spark shell on a local computer.
Stack trace:

Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2253)
  at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2202)
  at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2201)
  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2201)
  at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1078)
  at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1078)
  at scala.Option.foreach(Option.scala:407)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1078)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2440)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2382)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2371)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Caused by: org.apache.spark.sql.AnalysisException: Column to index contains null values. Please initialize them before indexing
  at org.apache.spark.sql.AnalysisExceptionFactory$.create(AnalysisExceptionFactory.scala:36)
  at io.qbeast.spark.index.OTreeAlgorithmImpl.$anonfun$rowValuesToPoint$1(OTreeAlgorithm.scala:322)
  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
  at io.qbeast.spark.index.OTreeAlgorithmImpl.rowValuesToPoint(OTreeAlgorithm.scala:317)
  at io.qbeast.spark.index.OTreeAlgorithmImpl.$anonfun$index$3(OTreeAlgorithm.scala:233)
  at scala.collection.Iterator.foreach(Iterator.scala:941)
  at scala.collection.Iterator.foreach$(Iterator.scala:941)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
  at io.qbeast.spark.index.OTreeAlgorithmImpl.$anonfun$index$2(OTreeAlgorithm.scala:231)
  at org.apache.spark.sql.execution.MapPartitionsExec.$anonfun$doExecute$3(objects.scala:195)
  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
  at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
  at org.apache.spark.scheduler.Task.run(Task.scala:131)
  at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)

Issue Analytics

State:
Created 2 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

osopardo1commented, Mar 16, 2022

Merged in #67

0reactions

osopardo1commented, Aug 26, 2021

First, indeed, we don’t care so much about those null values for schema evolution. Since we will be working with different space revisions, the null values of the new columns do not affect older versions. Then, for the normal datasets with null values, I think you are right. We can treat them with 0.

Yes, on Spark, all elements retrieved from index blocks will be filtered again on memory to satisfy the where clause.

Top Results From Across the Web

Index for nullable column - Stack Overflow

By default, relational databases ignore NULL values (because the relational model says that NULL means "not present"). So, Index does not store NULL...

Indexing on Columns That Allow Nulls - SQLTeam.com Forums

Yes, SQL will use an index with NULLable columns. NULL is effectively just another "value" in an index. The index will be searched...

Filtered index with Column IS NULL - SQL Server

You create a filtered index together with the Column IS NULL predicate expression in SQL Server. The Column field is not included in...

Index your NULL table column trick - Burleson Consulting

Indexing NULL table column ... Here, the "1" tells Oracle that to index on NULL values within the tables. One problem with pre...

Indexing NULL in the Oracle Database

The Oracle database does not include rows in an index if all indexed columns are NULL . That means that every index is...