question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Empty partitions are left behind after `DELETE FROM`

See original GitHub issue

CrateDB version

4.7.0

CrateDB setup information

Single node; docker

Steps to Reproduce

Create table:

create table test (
  ts timestamp, 
  ts_day generated always as date_trunc('day', ts), 
  value int)
partitioned by (ts_day);

Insert sample data:

insert into test (ts, value) values ('2022-02-21T00:00', 1), ('2022-02-22T00:00', 2), ('2022-02-23T00:00', 3);

Delete based on ts column:

delete from test where ts <= '2022-02-22T12:00';

Expected Result

1 Partition (ts_day=1645574400000) with 1 record

Actual Result

3 Partitions:

  • ts_day=1645401600000 with zero records
  • ts_day=1645488000000 with zero records
  • ts_day=1645574400000 with 1 records

Working query

delete from test where ts_day <= '2022-02-22T12:00';

This query drops partitions as well.

I would have expected that optimzier can also infer from first query (with WHERE ts...) that full partition can be dropped.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:2
  • Comments:10 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
mayraq10commented, Jun 27, 2022

Thanks @mfussenegger . So, I tried the first suggestion but could not find the debug configuration Crate. But that’s ok, because the second option worked with one minor adjustment: we used this instead export JAVA_OPTS='-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=0.0.0.0:5005' , then attached. Making progress now.

0reactions
matrivcommented, Jun 28, 2022

When the where clause is containing a variable which participates in a generated column which is part of the partition by expression, then we add “translate” the filter comparison using the generating expression and then we add it with an AND to the original where filter: i.e.:

CREATE TABLE t (x int, xv generated always as x + 1) PARTITIONED by (xv);
INSERT INTO t (x) VALUES (1), (2), (3);
DELETE FROM t where v = 1;

the DELETE becomes:

DELETE FROM t where (x = 1) AND (xv AS (x + 1) = 2)

Then, in the WhereClauseAnalyzer#resolvePartitions(), we check this new where query against all partitions. In our example: the table has 3 partitions with values 2, 3 and 4 (for the xv): For partition 2 the query normalizes to x = 1 AND true => x = 1 For partition 3 the query normalizes to x = 1 AND false => false For partition 4 the query normalizes to x = 1 AND false => false Therefore, we end up with a query running on partition 2 (but not deleting it, only the relevant doc) and the other 2 partitions are not visited.

  1. If we added the partition/generated condition with OR instead of AND:
DELETE FROM t where (x = 1) OR (xv AS (x + 1) = 2)

Then: For partition 2 the query normalizes to x = 1 OR true => true For partition 3 the query normalizes to x = 1 OR false => x = 1 For partition 4 the query normalizes to x = 1 OR false => x = 1

In turn because of the WhereClauseAnalyzer#tieBreakPartitionQueries() code we end up with a map with 2 entries, and we cannot optimize, and we run the whole query: (x = 1) OR (xv AS (x + 1) = 2) on all 3 partitions

  1. If we replace the original condition:
DELETE FROM t where xv AS (x + 1) = 2

Then somehow it seems to work in this case but it’s not the correct solution, because we have lost the original query and instead we need to run something more complex to match docs, Keep in mind that WhereClauseAnalyzer works also for selects.

Solution:

  1. I think we need to preserve the original query plus the one based on the generated/partition columns. For the DELETE: if a partition matches to TRUE then we can add it to the list of the partitions to completely remove, if it results in a canMatch then we add it to a list of partitions for which we run the original query and delete docs. For SELECT statements we can use the translated query and add all partitions with TRUEandcanMatch` to the list and for those run the original query to select docs.
Read more comments on GitHub >

github_iconTop Results From Across the Web

Empty Partitions - Microsoft Community
So my strange question is this: how do you reset these empty partitions so that you can use the disk space without formatting...
Read more >
Hive delete partitions are broken when presto inserts into ...
Hive delete partitions are broken when presto inserts into them first. Presto's S3 driver does not erase HDFS S3A FakeDir blobs when inserting...
Read more >
Which empty disk partitions can be safely deleted (and there ...
Yes. Delete each empty partition, starting with the one to the right of C, then expand C to the now unallocated space. the...
Read more >
"Partition is not empty" error when trying to delete partition - VOX
I have a vault store, dedicated database and partition which I want to delete. The data is test only. On the Centera side,...
Read more >
Drop empty Impala partitions - Stack Overflow
1 Answer 1 ... Found a workaround through HIVE. By issuing MSCK REPAIR TABLE tablename SYNC PARTITIONS then refreshing the table in impala,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found