question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SparseArray read order affects correctness

See original GitHub issue

This is a weird one: it seems that in some cases, after opening a SparseArray in read-only mode, performing a query of an empty subset of the domains will cause all future queries (even of non-empty subsets) to appear empty. Performing queries of empty domain subsets does not affect correctness if they are preceded by a query of a non-empty domain subset.

My current workaround is just to perform a dummy query of the entire array (_ = arr[:, :]) every time I open one, and this seems to fix the issue, but it shouldn’t be necessary.

I’m filing the bug here because I noticed it using the TileDB Python interface, but it’s possible it affects other language bindings too, I haven’t checked.

I’ve put a full set of data files and example code to reproduce here: http://mitra.stanford.edu/kundaje/cprobert/tiledb_bug/

In particular you might want to check out the html version of the notebook (since it’s easier to view without downloading): http://mitra.stanford.edu/kundaje/cprobert/tiledb_bug/tiledb_bug.html

I’ve also copied the contents of the notebook and output below. These results were generated with the TileDB-Py v0.2.1 pre-release, but I’ve been able to reproduce with the v0.2.0 release too. I’m using Python 3.6.6, and NumPy 1.15.0, but have also seen this bug in a Python 3.5 environment and with different NumPy versions. The platform is an Intel Xeon server running Ubuntu 16.04.5 LTS.

import tempfile
import os

import numpy as np
import pandas as ps

import tiledb


n_idxs = np.load('x_coords.npy')
m_idxs = np.load('y_coords.npy')
values = np.load('values.npy')

# Check the (x, y) coordinate pairs are unique
df = ps.DataFrame({'x': n_idxs, 'y': m_idxs})
df = ps.DataFrame(np.sort(df.values, axis=1), index=df.index, columns=df.columns).drop_duplicates()
assert df.shape[0] == n_idxs.shape[0]

num_entries = n_idxs.shape[0]
values_sum = values.sum()


ctx = tiledb.Ctx()

n = 249250621
m = 400

n_tile_extent = 50000

d1 = tiledb.Dim(ctx, "ndom", domain=(0, n - 1), tile=n_tile_extent, dtype="uint32")
d2 = tiledb.Dim(ctx, "mdom", domain=(0, m - 1), tile=m, dtype="uint32")

domain = tiledb.Domain(ctx, d1, d2)

v = tiledb.Attr(ctx, "v", compressor=("lz4", -1), dtype="uint8")

schema = tiledb.ArraySchema(
    ctx,
    domain=domain,
    attrs=(v,),
    capacity=10000,
    cell_order="row-major",
    tile_order="row-major",
    sparse=True,
)

with tempfile.TemporaryDirectory() as tdir:

    path = os.path.join(tdir, "arr.tiledb")

    tiledb.SparseArray.create(path, schema)

    with tiledb.SparseArray(ctx, path, mode="w") as A:
        A[n_idxs, m_idxs] = values
        
    print('\n>> 1: Reading empty query first blocks subsequent queries of non-empty cells:\n')
    
    with tiledb.SparseArray(ctx, path, mode="r") as A:
        n_ent = A[0, 0]['v'].shape[0]
        print('reading empty cell: {} entries (expected {})'.format(n_ent, 0))
        
        n_ent = A[:, :]['v'].shape[0]
        print('reading whole array: {} entries (expected {})'.format(n_ent, num_entries))
        
        n_ent = A[0, 0]['v'].shape[0]
        print('reading empty cell: {} entries (expected {})'.format(n_ent, 0))
        
        n_ent = A[:, :]['v'].shape[0]
        print('reading whole array: {} entries (expected {})'.format(n_ent, num_entries))
    
    
    print('\n>> 2: Reading non-empty query first allows subsequent queries of non-empty cells:\n')
    
    with tiledb.SparseArray(ctx, path, mode="r") as A:
        n_ent = A[:, :]['v'].shape[0]
        print('reading whole array: {} entries (expected {})'.format(n_ent, num_entries))
        
        n_ent = A[0, 0]['v'].shape[0]
        print('reading empty cell: {} entries (expected {})'.format(n_ent, 0))
        
        n_ent = A[:, :]['v'].shape[0]
        print('reading whole array: {} entries (expected {})'.format(n_ent, num_entries))
        
        n_ent = A[0, 0]['v'].shape[0]
        print('reading empty cell: {} entries (expected {})'.format(n_ent, 0))

outputs:

>> 1: Reading empty query first blocks subsequent queries of non-empty cells:

reading empty cell: 0 entries (expected 0)
reading whole array: 0 entries (expected 54696022)
reading empty cell: 0 entries (expected 0)
reading whole array: 0 entries (expected 54696022)

>> 2: Reading non-empty query first allows subsequent queries of non-empty cells:

reading whole array: 54696022 entries (expected 54696022)
reading empty cell: 0 entries (expected 0)
reading whole array: 54696022 entries (expected 54696022)
reading empty cell: 0 entries (expected 0)

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
jakebolewskicommented, Aug 9, 2018

@chrisprobert thanks for the detailed bug report (and even data to reproduce!?). We have identified the issue in TileDB that causes this behavior when used from TileDB-Py. It will show up in the TileDB 1.3.3 / TileDB-Py 0.2.2 bugfix release in the next couple of days.

0reactions
chrisprobertcommented, Aug 15, 2018

Great, happy to see this get fixed so fast.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Writing Sparse Arrays — TileDB 1.6.3 documentation
In this tutorial you will learn how to write to sparse arrays. It is highly recommended that you read the tutorials on sparse...
Read more >
Decision on the behavior of map · Issue #4 - GitHub
So basically this behavior assumes that the function passed to map is pure. I realize this is necessary to be able to return...
Read more >
How to efficiently construct a large SparseArray? Packages for ...
My u,v,w can get destroyed in the process, as long as I get the correct matrix. I do not have row pointers and...
Read more >
CSE 373 -- Spring 1999 -- [an error occurred while processing ...
We'll call this value that is highly repeated the unrepresented value (URV for short). In most real-world applications of sparse arrays, the URV ......
Read more >
Advances in Sparse Array Signal Processing and its ... - Hindawi
Recently, sparse arrays, such as coprime arrays and nested arrays, have been show promise in order to improve active and passive sensing in...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found