SparseArray read order affects correctness
See original GitHub issueThis is a weird one: it seems that in some cases, after opening a SparseArray in read-only mode, performing a query of an empty subset of the domains will cause all future queries (even of non-empty subsets) to appear empty. Performing queries of empty domain subsets does not affect correctness if they are preceded by a query of a non-empty domain subset.
My current workaround is just to perform a dummy query of the entire array (_ = arr[:, :]
) every time I open one, and this seems to fix the issue, but it shouldn’t be necessary.
I’m filing the bug here because I noticed it using the TileDB Python interface, but it’s possible it affects other language bindings too, I haven’t checked.
I’ve put a full set of data files and example code to reproduce here: http://mitra.stanford.edu/kundaje/cprobert/tiledb_bug/
In particular you might want to check out the html version of the notebook (since it’s easier to view without downloading): http://mitra.stanford.edu/kundaje/cprobert/tiledb_bug/tiledb_bug.html
I’ve also copied the contents of the notebook and output below. These results were generated with the TileDB-Py v0.2.1 pre-release, but I’ve been able to reproduce with the v0.2.0 release too. I’m using Python 3.6.6, and NumPy 1.15.0, but have also seen this bug in a Python 3.5 environment and with different NumPy versions. The platform is an Intel Xeon server running Ubuntu 16.04.5 LTS.
import tempfile
import os
import numpy as np
import pandas as ps
import tiledb
n_idxs = np.load('x_coords.npy')
m_idxs = np.load('y_coords.npy')
values = np.load('values.npy')
# Check the (x, y) coordinate pairs are unique
df = ps.DataFrame({'x': n_idxs, 'y': m_idxs})
df = ps.DataFrame(np.sort(df.values, axis=1), index=df.index, columns=df.columns).drop_duplicates()
assert df.shape[0] == n_idxs.shape[0]
num_entries = n_idxs.shape[0]
values_sum = values.sum()
ctx = tiledb.Ctx()
n = 249250621
m = 400
n_tile_extent = 50000
d1 = tiledb.Dim(ctx, "ndom", domain=(0, n - 1), tile=n_tile_extent, dtype="uint32")
d2 = tiledb.Dim(ctx, "mdom", domain=(0, m - 1), tile=m, dtype="uint32")
domain = tiledb.Domain(ctx, d1, d2)
v = tiledb.Attr(ctx, "v", compressor=("lz4", -1), dtype="uint8")
schema = tiledb.ArraySchema(
ctx,
domain=domain,
attrs=(v,),
capacity=10000,
cell_order="row-major",
tile_order="row-major",
sparse=True,
)
with tempfile.TemporaryDirectory() as tdir:
path = os.path.join(tdir, "arr.tiledb")
tiledb.SparseArray.create(path, schema)
with tiledb.SparseArray(ctx, path, mode="w") as A:
A[n_idxs, m_idxs] = values
print('\n>> 1: Reading empty query first blocks subsequent queries of non-empty cells:\n')
with tiledb.SparseArray(ctx, path, mode="r") as A:
n_ent = A[0, 0]['v'].shape[0]
print('reading empty cell: {} entries (expected {})'.format(n_ent, 0))
n_ent = A[:, :]['v'].shape[0]
print('reading whole array: {} entries (expected {})'.format(n_ent, num_entries))
n_ent = A[0, 0]['v'].shape[0]
print('reading empty cell: {} entries (expected {})'.format(n_ent, 0))
n_ent = A[:, :]['v'].shape[0]
print('reading whole array: {} entries (expected {})'.format(n_ent, num_entries))
print('\n>> 2: Reading non-empty query first allows subsequent queries of non-empty cells:\n')
with tiledb.SparseArray(ctx, path, mode="r") as A:
n_ent = A[:, :]['v'].shape[0]
print('reading whole array: {} entries (expected {})'.format(n_ent, num_entries))
n_ent = A[0, 0]['v'].shape[0]
print('reading empty cell: {} entries (expected {})'.format(n_ent, 0))
n_ent = A[:, :]['v'].shape[0]
print('reading whole array: {} entries (expected {})'.format(n_ent, num_entries))
n_ent = A[0, 0]['v'].shape[0]
print('reading empty cell: {} entries (expected {})'.format(n_ent, 0))
outputs:
>> 1: Reading empty query first blocks subsequent queries of non-empty cells:
reading empty cell: 0 entries (expected 0)
reading whole array: 0 entries (expected 54696022)
reading empty cell: 0 entries (expected 0)
reading whole array: 0 entries (expected 54696022)
>> 2: Reading non-empty query first allows subsequent queries of non-empty cells:
reading whole array: 54696022 entries (expected 54696022)
reading empty cell: 0 entries (expected 0)
reading whole array: 54696022 entries (expected 54696022)
reading empty cell: 0 entries (expected 0)
Issue Analytics
- State:
- Created 5 years ago
- Comments:5 (4 by maintainers)
Top GitHub Comments
@chrisprobert thanks for the detailed bug report (and even data to reproduce!?). We have identified the issue in TileDB that causes this behavior when used from TileDB-Py. It will show up in the TileDB 1.3.3 / TileDB-Py 0.2.2 bugfix release in the next couple of days.
Great, happy to see this get fixed so fast.