Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Flow.chunked with specified time period

See original GitHub issue

Currently Flow supports only buffer operator with capacity. It would be useful to buffer elements within specified time range. flow.buffer(Duration.ofSeconds(5)).collect {...}

Issue Analytics

State:
Created 4 years ago
Reactions:62
Comments:35 (8 by maintainers)

Top GitHub Comments

9reactions

circusmagnuscommented, Sep 3, 2021

Half a year later… What about such a signature?

fun <T> Flow<T>.chunk(
    maxSize: Int,
    naturalBatching: Boolean = false,
    delayUntilNext: (suspend (previousChunk: List<T>?) -> Unit)? = null
): Flow<List<T>>

example usage: flow.chunk(10) - simple size chunking flow.chunk(512) { delay(5.minutes) } - accumulate values for 5 minutes with max size of 512 flow.chunk(512, naturalBuffering = true) - natural buffering variant with max size flow.chunk(512) { previousChunk -> if(previousChunk.size == 512) Unit else delay(5.seconds) } - speed up if buffer is getting full before chunk emission flow.chunk(512) { semaphoreChannel.receive() } - emit chunk after signal from some external source

Pros:

one function covering wide array of chunking possibilites
chunk interval can be dynamically adjusted by the user in various ways
Not so much ceremony in usage

Cons:

no way of deciding, whether to emit or suspend onBufferOverlow. But is it really needed? Perhaps emitting when full is a good deafult and it is not worth it, to bother people with thinking too much about it? If downstream cannot keep up, then upstream will get suspended anyway. If downstream can process chunks faster, to offload a full buffer, then why not? Or maybe suspending, when full would be more intuitive behaviour?
Still not so simple, as flow.chunk(maxSize = 512, interval = 5.minutes) .

9reactions

circusmagnuscommented, Dec 23, 2020

Ok, I will have another go at this issue. @elizarov mentioned, that there is no obvious case for minSize and it complicates things. So, let`s assume, that minimum chunk size is always one. Seems to fit all use cases:

Size based chunking: no need to specify both min and max size. Single parameter is enough
Time based chunking: No use cases for zero-sized chunks. While I would be happy to bump up minSize of my analytics chunks, I can do it pretty easily with other operator further downstream
Natural batching: Min size of 1 is pretty obvious here.

No minSize param then.

That leaves us with duration and size params: fun Flow<T>.chunked(interval: Duration, size: Int): Flow<List<T>>

How to express our use cases with it…

Natural batching:

chunked(interval = 0 // emit as soon as possible, size = someLargishValue // how big of a buffer you are willing to store)

Time based chunking

chunked(interval = x // your main consideration, size = maxAcceptableBuffer)

Size based chunking: That one gets trickier, as we do not need duration at all:

chunked(interval = NO_INTERVAL // technically an Int.maxValue or such, size = desiredBufferSize)

Since our ‘size’ parameter is either maximumSize for time-based chunking or just desired size - it means we should try to emit immediately, when it is reached. Suspending upstream, if need be.

Interval param is a little different. We cannot guarantee, that chunk will be emitted exactly after interval has passed. Since interval does not relate to size, we can, I think, safely assume, that it is ok to buffer subsequent elements after interval has passed, even if we cannot emit, due to busy downstream.

In other words reaching size limit - we do suspend upstream until emission happens. Chunk cannot grow bigger, than specified. Reaching time limit - we do not suspend upstream no matter whether we did emit or are still waiting for downstream to get ready. Our time limit may be breached due to busy downstream. We cannot prevent it.

That shapes our design into `chunked(intervalConstraint OR sizeConstraint) consistently across all use cases.

So the proposal boils down to fun Flow<T>.chunked(interval: Duration, size: Int): Flow<List<T>>

Proposed impl (give or take - no sanity checks, etc):

public fun <T> Flow<T>.chunked(interval: Duration, size: Int): Flow<List<T>> = scopedFlow { downstream ->
    val buffer = Channel<T>(size)
    val emitSemaphore = Channel<Unit>()
    val collectSemaphore = Channel<Unit>()

    launch {
        collect { value ->
            val hasCapacity = buffer.offer(value)
            if (!hasCapacity) {
                emitSemaphore.send(Unit)
                collectSemaphore.receive()
                buffer.send(value)
            }
        }
        emitSemaphore.close()
        buffer.close()
    }

    whileSelect {

        emitSemaphore.onReceiveOrClosed { valueOrClosed ->
            buffer.drain().takeIf { it.isNotEmpty() }?.let { downstream.emit(it) }
            val shouldCollectNextChunk = valueOrClosed.isClosed.not()
            if (shouldCollectNextChunk) collectSemaphore.send(Unit)
            else collectSemaphore.close()
            shouldCollectNextChunk
        }

        onTimeout(interval) {
            downstream.emit(buffer.awaitFirstAndDrain())
            true
        }
    }
}

Helper functions:

private suspend fun <T> ReceiveChannel<T>.awaitFirstAndDrain(): List<T> {
    val first = receiveOrClosed().takeIf { it.isClosed.not() }?.value ?: return emptyList()
    return drain(mutableListOf(first))
}

private tailrec fun <T> ReceiveChannel<T>.drain(acc: MutableList<T> = mutableListOf()): List<T> {
    val item = poll()
    return if (item == null) acc
    else {
        acc.add(item)
        drain(acc)
    }
}

Plus optimized, non-concurrent, impl for purely size-based chunking:

private fun <T> Flow<T>.chunkedSizeBased(maxSize: Int): Flow<List<T>> = flow {
    val buffer = mutableListOf<T>()
    collect { value ->
        buffer.add(value)
        if (buffer.size == maxSize) emit(buffer.drain())
    }
    if (buffer.isNotEmpty()) emit(buffer)
}

Top Results From Across the Web

Kotlin - Chunk sequence based on size and time

This code start new time interval and does not send the previous (empty) chunk. Do we finish current chunk on timeout after last...

Solved: Power Automate 'uploadChunkSizeInMB' Error

Solved: I have a flow that creates a new file on a SFTP from a file in SharePoint. The Flow has been tested...

Chunked transfer encoding - Wikipedia

Chunked transfer encoding is a streaming data transfer mechanism available in Hypertext Transfer Protocol (HTTP) version 1.1, defined in RFC 9112 §7.1. ......

What is Time Chunking? [How it Improves Work Productivity]

Scientists have proven that 25-minute work periods are more effective in enabling people to give full focus to a specific task. Researchers also ......

Configuring a Step - Spring

ChunkListener. A “chunk” is defined as the items processed within the scope of a transaction. Committing a transaction, at each commit interval, ...