question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

NULL values on indexed columns support

See original GitHub issue

What went wrong?

We tried to index a dataset from the TPC-DS, and two of the chosen columns for indexing had null values. We should add support to null values.

21/09/28 09:17:18 ERROR Executor: Exception in task 10.0 in stage 4.0 (TID 28)
org.apache.spark.sql.AnalysisException: Column to index contains null values. Please initialize them before indexing

How to reproduce?

  1. Code that triggered the bug, or steps to reproduce: Indexing notebook s3 public dataset by “ss_customer_sk”, “ss_item_sk”, “ss_sold_date_sk”
parquet_df.write.mode("overwrite").format("qbeast").option("columnsToIndex", "ss_customer_sk,ss_item_sk,ss_sold_date_sk").save(qbeast_table_path)
  1. Branch and commit id: Main on 15667c27bb2cc6d76cecd680d61e22fa8f571d49

  2. Spark version: 3.1.1

  3. Hadoop version: 3.2.2

  4. Are you running Spark inside a container? Are you launching the app on a remote K8s cluster? Or are you just running the tests on a local computer? I am running Spark shell on a local computer.

  5. Stack trace:

Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2253)
  at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2202)
  at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2201)
  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2201)
  at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1078)
  at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1078)
  at scala.Option.foreach(Option.scala:407)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1078)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2440)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2382)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2371)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Caused by: org.apache.spark.sql.AnalysisException: Column to index contains null values. Please initialize them before indexing
  at org.apache.spark.sql.AnalysisExceptionFactory$.create(AnalysisExceptionFactory.scala:36)
  at io.qbeast.spark.index.OTreeAlgorithmImpl.$anonfun$rowValuesToPoint$1(OTreeAlgorithm.scala:322)
  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
  at io.qbeast.spark.index.OTreeAlgorithmImpl.rowValuesToPoint(OTreeAlgorithm.scala:317)
  at io.qbeast.spark.index.OTreeAlgorithmImpl.$anonfun$index$3(OTreeAlgorithm.scala:233)
  at scala.collection.Iterator.foreach(Iterator.scala:941)
  at scala.collection.Iterator.foreach$(Iterator.scala:941)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
  at io.qbeast.spark.index.OTreeAlgorithmImpl.$anonfun$index$2(OTreeAlgorithm.scala:231)
  at org.apache.spark.sql.execution.MapPartitionsExec.$anonfun$doExecute$3(objects.scala:195)
  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
  at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
  at org.apache.spark.scheduler.Task.run(Task.scala:131)
  at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
osopardo1commented, Mar 16, 2022

Merged in #67

0reactions
osopardo1commented, Aug 26, 2021

First, indeed, we don’t care so much about those null values for schema evolution. Since we will be working with different space revisions, the null values of the new columns do not affect older versions. Then, for the normal datasets with null values, I think you are right. We can treat them with 0.

Yes, on Spark, all elements retrieved from index blocks will be filtered again on memory to satisfy the where clause.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Index for nullable column - Stack Overflow
By default, relational databases ignore NULL values (because the relational model says that NULL means "not present"). So, Index does not store NULL...
Read more >
Indexing on Columns That Allow Nulls - SQLTeam.com Forums
Yes, SQL will use an index with NULLable columns. NULL is effectively just another "value" in an index. The index will be searched...
Read more >
Filtered index with Column IS NULL - SQL Server
You create a filtered index together with the Column IS NULL predicate expression in SQL Server. The Column field is not included in...
Read more >
Index your NULL table column trick - Burleson Consulting
Indexing NULL table column ... Here, the "1" tells Oracle that to index on NULL values within the tables. One problem with pre...
Read more >
Indexing NULL in the Oracle Database
The Oracle database does not include rows in an index if all indexed columns are NULL . That means that every index is...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found