Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Efficient 'stream'-processing

See original GitHub issue

Hi, I’m currently evaluating TileDB for our project and I’m trying to find out how to efficiently read data. So let’s say we create a DenseArray like this:

    d_xy = tiledb.Dim(ctx, "xy", domain=(1, 256 * 256), tile=32, dtype="uint64")
    d_u = tiledb.Dim(ctx, "u", domain=(1, 128), tile=128, dtype="uint64")
    d_w = tiledb.Dim(ctx, "w", domain=(1, 128), tile=128, dtype="uint64")
    domain = tiledb.Domain(ctx, d_xy, d_u, d_w)
    a1 = tiledb.Attr(ctx, "a1", compressor=None, dtype="float64")
    arr = tiledb.DenseArray(
        ctx,
        array_name,
        domain=domain,
        attrs=(a1,),
        cell_order='row-major',
        tile_order='row-major'
    )

When reading from the array using the slicing interface, in slices that fit the d_xy tiling, for example arr[1:33], I noticed that a lot of time is spent copying data (I can provide a flamegraph if you are interested). So I’m trying to understand what is happening behind the scenes: in the Domain I created, the cells have a shape of (32, 128, 128), right? And are they saved linearly to disk?

I found the read_direct method, which should not involve a copy, but as it reads the whole array it won’t work for huge arrays, and it won’t be cache efficient. We would like to process the data in nice chunks that fit into the L3 cache, so we thought working cell-by-cell would be optimal.

Maybe using an interface like this:

iter = arr.cell_iterator(attrs=['a1'], ...)
for cell in iter:
    assert type(cell['a1']) == np.ndarray
    # ...do some work on this cell...

This way, workloads that process the whole array can be implemented such that TileDB can make sure that the reading is done efficiently. If the processing is distributed on many machines, cell_iterator would need some way to specify what partition to return cells from.

(The cell could also have some information about its ‘geometry’, i.e. what parts of the array it consists of)

As an alternative, maybe the read_direct interface could be augmented to allow reading only parts of the array, and err out if you cross a cell boundary. That way, TileDB users can build something like the above interface themselves.

I’m just brain dumping, so let me know if this kind of feedback is useful!

cc @uellue

Issue Analytics

State:
Created 5 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

2reactions

jakebolewskicommented, Apr 13, 2018

I am re opening this because it would be nice to track progress on these issues. Also we should expose TileDB’s GLOBAL_ORDER layout on the Python side which would be necessary for a streaming operation with minimal copies.

Also it would be nice to expose a HL iterator over the discrete tiles in the dense case.

1reaction

stavrospapadopouloscommented, Apr 18, 2018

Thanks for the notes @sk1p.

Btw, we are aware of short-circuit reads for HDFS (cc @npapa). This is in our roadmap and in fact we will address it along with posix/win mmap. Please stay tuned 😃.

Top Results From Across the Web

Resource Efficient Stream Processing Platform with Latency ...

We presented a novel platform dedicated to stream processing that improved resource efficiency by sharing resources among applications.

Analyzing Efficient Stream Processing on Modern Hardware

We investigate streaming optimizations by examining dif- ferent aspects of stream processing regarding their exploita- tion of modern hardware. To this end, we ......

Efficient Stream Processing Lab

The Efficient Stream Processing lab focusses on addressing the key challenges in this design flow, i.e. developing adaptive programming languages for streaming ...

Grizzly: Efficient Stream Processing ... - ACM Digital Library

In this paper, we present Grizzly, a novel adaptive query compilation-based SPE, to enable highly efficient query execution. We extend query ...

4 Sampling Techniques for Efficient Stream Processing

4 Sampling Techniques for Efficient Stream Processing · 1. Sliding Window · 2. Unbiased Reservoir Sampling · 3. Biased Reservoir Sampling · 4....