Efficient 'stream'-processing
See original GitHub issueHi, I’m currently evaluating TileDB for our project and I’m trying to find out how to efficiently read data. So let’s say we create a DenseArray
like this:
d_xy = tiledb.Dim(ctx, "xy", domain=(1, 256 * 256), tile=32, dtype="uint64")
d_u = tiledb.Dim(ctx, "u", domain=(1, 128), tile=128, dtype="uint64")
d_w = tiledb.Dim(ctx, "w", domain=(1, 128), tile=128, dtype="uint64")
domain = tiledb.Domain(ctx, d_xy, d_u, d_w)
a1 = tiledb.Attr(ctx, "a1", compressor=None, dtype="float64")
arr = tiledb.DenseArray(
ctx,
array_name,
domain=domain,
attrs=(a1,),
cell_order='row-major',
tile_order='row-major'
)
When reading from the array using the slicing interface, in slices that fit the d_xy
tiling, for example arr[1:33]
, I noticed that a lot of time is spent copying data (I can provide a flamegraph if you are interested). So I’m trying to understand what is happening behind the scenes: in the Domain
I created, the cells have a shape of (32, 128, 128)
, right? And are they saved linearly to disk?
I found the read_direct
method, which should not involve a copy, but as it reads the whole array it won’t work for huge arrays, and it won’t be cache efficient. We would like to process the data in nice chunks that fit into the L3 cache, so we thought working cell-by-cell would be optimal.
Maybe using an interface like this:
iter = arr.cell_iterator(attrs=['a1'], ...)
for cell in iter:
assert type(cell['a1']) == np.ndarray
# ...do some work on this cell...
This way, workloads that process the whole array can be implemented such that TileDB can make sure that the reading is done efficiently. If the processing is distributed on many machines, cell_iterator
would need some way to specify what partition to return cells from.
(The cell
could also have some information about its ‘geometry’, i.e. what parts of the array it consists of)
As an alternative, maybe the read_direct
interface could be augmented to allow reading only parts of the array, and err out if you cross a cell boundary. That way, TileDB users can build something like the above interface themselves.
I’m just brain dumping, so let me know if this kind of feedback is useful!
cc @uellue
Issue Analytics
- State:
- Created 5 years ago
- Comments:6 (4 by maintainers)
Top GitHub Comments
I am re opening this because it would be nice to track progress on these issues. Also we should expose TileDB’s
GLOBAL_ORDER
layout on the Python side which would be necessary for a streaming operation with minimal copies.Also it would be nice to expose a HL iterator over the discrete tiles in the dense case.
Thanks for the notes @sk1p.
Btw, we are aware of short-circuit reads for HDFS (cc @npapa). This is in our roadmap and in fact we will address it along with posix/win
mmap
. Please stay tuned 😃.