question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Efficient 'stream'-processing

See original GitHub issue

Hi, I’m currently evaluating TileDB for our project and I’m trying to find out how to efficiently read data. So let’s say we create a DenseArray like this:

    d_xy = tiledb.Dim(ctx, "xy", domain=(1, 256 * 256), tile=32, dtype="uint64")
    d_u = tiledb.Dim(ctx, "u", domain=(1, 128), tile=128, dtype="uint64")
    d_w = tiledb.Dim(ctx, "w", domain=(1, 128), tile=128, dtype="uint64")
    domain = tiledb.Domain(ctx, d_xy, d_u, d_w)
    a1 = tiledb.Attr(ctx, "a1", compressor=None, dtype="float64")
    arr = tiledb.DenseArray(
        ctx,
        array_name,
        domain=domain,
        attrs=(a1,),
        cell_order='row-major',
        tile_order='row-major'
    )

When reading from the array using the slicing interface, in slices that fit the d_xy tiling, for example arr[1:33], I noticed that a lot of time is spent copying data (I can provide a flamegraph if you are interested). So I’m trying to understand what is happening behind the scenes: in the Domain I created, the cells have a shape of (32, 128, 128), right? And are they saved linearly to disk?

I found the read_direct method, which should not involve a copy, but as it reads the whole array it won’t work for huge arrays, and it won’t be cache efficient. We would like to process the data in nice chunks that fit into the L3 cache, so we thought working cell-by-cell would be optimal.

Maybe using an interface like this:

iter = arr.cell_iterator(attrs=['a1'], ...)
for cell in iter:
    assert type(cell['a1']) == np.ndarray
    # ...do some work on this cell...

This way, workloads that process the whole array can be implemented such that TileDB can make sure that the reading is done efficiently. If the processing is distributed on many machines, cell_iterator would need some way to specify what partition to return cells from.

(The cell could also have some information about its ‘geometry’, i.e. what parts of the array it consists of)

As an alternative, maybe the read_direct interface could be augmented to allow reading only parts of the array, and err out if you cross a cell boundary. That way, TileDB users can build something like the above interface themselves.

I’m just brain dumping, so let me know if this kind of feedback is useful!

cc @uellue

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
jakebolewskicommented, Apr 13, 2018

I am re opening this because it would be nice to track progress on these issues. Also we should expose TileDB’s GLOBAL_ORDER layout on the Python side which would be necessary for a streaming operation with minimal copies.

Also it would be nice to expose a HL iterator over the discrete tiles in the dense case.

1reaction
stavrospapadopouloscommented, Apr 18, 2018

Thanks for the notes @sk1p.

Btw, we are aware of short-circuit reads for HDFS (cc @npapa). This is in our roadmap and in fact we will address it along with posix/win mmap. Please stay tuned 😃.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Resource Efficient Stream Processing Platform with Latency ...
We presented a novel platform dedicated to stream processing that improved resource efficiency by sharing resources among applications.
Read more >
Analyzing Efficient Stream Processing on Modern Hardware
We investigate streaming optimizations by examining dif- ferent aspects of stream processing regarding their exploita- tion of modern hardware. To this end, we ......
Read more >
Efficient Stream Processing Lab
The Efficient Stream Processing lab focusses on addressing the key challenges in this design flow, i.e. developing adaptive programming languages for streaming ...
Read more >
Grizzly: Efficient Stream Processing ... - ACM Digital Library
In this paper, we present Grizzly, a novel adaptive query compilation-based SPE, to enable highly efficient query execution. We extend query ...
Read more >
4 Sampling Techniques for Efficient Stream Processing
4 Sampling Techniques for Efficient Stream Processing · 1. Sliding Window · 2. Unbiased Reservoir Sampling · 3. Biased Reservoir Sampling · 4....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found