question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Flow.chunked with specified time period

See original GitHub issue

Currently Flow supports only buffer operator with capacity. It would be useful to buffer elements within specified time range. flow.buffer(Duration.ofSeconds(5)).collect {...}

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:62
  • Comments:35 (8 by maintainers)

github_iconTop GitHub Comments

9reactions
circusmagnuscommented, Sep 3, 2021

Half a year later… What about such a signature?

fun <T> Flow<T>.chunk(
    maxSize: Int,
    naturalBatching: Boolean = false,
    delayUntilNext: (suspend (previousChunk: List<T>?) -> Unit)? = null
): Flow<List<T>>

example usage: flow.chunk(10) - simple size chunking flow.chunk(512) { delay(5.minutes) } - accumulate values for 5 minutes with max size of 512 flow.chunk(512, naturalBuffering = true) - natural buffering variant with max size flow.chunk(512) { previousChunk -> if(previousChunk.size == 512) Unit else delay(5.seconds) } - speed up if buffer is getting full before chunk emission flow.chunk(512) { semaphoreChannel.receive() } - emit chunk after signal from some external source

Pros:

  • one function covering wide array of chunking possibilites
  • chunk interval can be dynamically adjusted by the user in various ways
  • Not so much ceremony in usage

Cons:

  • no way of deciding, whether to emit or suspend onBufferOverlow. But is it really needed? Perhaps emitting when full is a good deafult and it is not worth it, to bother people with thinking too much about it? If downstream cannot keep up, then upstream will get suspended anyway. If downstream can process chunks faster, to offload a full buffer, then why not? Or maybe suspending, when full would be more intuitive behaviour?
  • Still not so simple, as flow.chunk(maxSize = 512, interval = 5.minutes) .
9reactions
circusmagnuscommented, Dec 23, 2020

Ok, I will have another go at this issue. @elizarov mentioned, that there is no obvious case for minSize and it complicates things. So, let`s assume, that minimum chunk size is always one. Seems to fit all use cases:

  • Size based chunking: no need to specify both min and max size. Single parameter is enough
  • Time based chunking: No use cases for zero-sized chunks. While I would be happy to bump up minSize of my analytics chunks, I can do it pretty easily with other operator further downstream
  • Natural batching: Min size of 1 is pretty obvious here.

No minSize param then.

That leaves us with duration and size params: fun Flow<T>.chunked(interval: Duration, size: Int): Flow<List<T>>

How to express our use cases with it…

  • Natural batching:
chunked(interval = 0 // emit as soon as possible, size = someLargishValue // how big of a buffer you are willing to store)
  • Time based chunking
chunked(interval = x // your main consideration, size = maxAcceptableBuffer)
  • Size based chunking: That one gets trickier, as we do not need duration at all:
chunked(interval = NO_INTERVAL // technically an Int.maxValue or such, size = desiredBufferSize)

Since our ‘size’ parameter is either maximumSize for time-based chunking or just desired size - it means we should try to emit immediately, when it is reached. Suspending upstream, if need be.

Interval param is a little different. We cannot guarantee, that chunk will be emitted exactly after interval has passed. Since interval does not relate to size, we can, I think, safely assume, that it is ok to buffer subsequent elements after interval has passed, even if we cannot emit, due to busy downstream.

In other words reaching size limit - we do suspend upstream until emission happens. Chunk cannot grow bigger, than specified. Reaching time limit - we do not suspend upstream no matter whether we did emit or are still waiting for downstream to get ready. Our time limit may be breached due to busy downstream. We cannot prevent it.

That shapes our design into `chunked(intervalConstraint OR sizeConstraint) consistently across all use cases.

So the proposal boils down to fun Flow<T>.chunked(interval: Duration, size: Int): Flow<List<T>>

Proposed impl (give or take - no sanity checks, etc):

public fun <T> Flow<T>.chunked(interval: Duration, size: Int): Flow<List<T>> = scopedFlow { downstream ->
    val buffer = Channel<T>(size)
    val emitSemaphore = Channel<Unit>()
    val collectSemaphore = Channel<Unit>()

    launch {
        collect { value ->
            val hasCapacity = buffer.offer(value)
            if (!hasCapacity) {
                emitSemaphore.send(Unit)
                collectSemaphore.receive()
                buffer.send(value)
            }
        }
        emitSemaphore.close()
        buffer.close()
    }

    whileSelect {

        emitSemaphore.onReceiveOrClosed { valueOrClosed ->
            buffer.drain().takeIf { it.isNotEmpty() }?.let { downstream.emit(it) }
            val shouldCollectNextChunk = valueOrClosed.isClosed.not()
            if (shouldCollectNextChunk) collectSemaphore.send(Unit)
            else collectSemaphore.close()
            shouldCollectNextChunk
        }

        onTimeout(interval) {
            downstream.emit(buffer.awaitFirstAndDrain())
            true
        }
    }
}

Helper functions:

private suspend fun <T> ReceiveChannel<T>.awaitFirstAndDrain(): List<T> {
    val first = receiveOrClosed().takeIf { it.isClosed.not() }?.value ?: return emptyList()
    return drain(mutableListOf(first))
}

private tailrec fun <T> ReceiveChannel<T>.drain(acc: MutableList<T> = mutableListOf()): List<T> {
    val item = poll()
    return if (item == null) acc
    else {
        acc.add(item)
        drain(acc)
    }
}

Plus optimized, non-concurrent, impl for purely size-based chunking:

private fun <T> Flow<T>.chunkedSizeBased(maxSize: Int): Flow<List<T>> = flow {
    val buffer = mutableListOf<T>()
    collect { value ->
        buffer.add(value)
        if (buffer.size == maxSize) emit(buffer.drain())
    }
    if (buffer.isNotEmpty()) emit(buffer)
}
Read more comments on GitHub >

github_iconTop Results From Across the Web

Kotlin - Chunk sequence based on size and time
This code start new time interval and does not send the previous (empty) chunk. Do we finish current chunk on timeout after last...
Read more >
Solved: Power Automate 'uploadChunkSizeInMB' Error
Solved: I have a flow that creates a new file on a SFTP from a file in SharePoint. The Flow has been tested...
Read more >
Chunked transfer encoding - Wikipedia
Chunked transfer encoding is a streaming data transfer mechanism available in Hypertext Transfer Protocol (HTTP) version 1.1, defined in RFC 9112 §7.1. ......
Read more >
What is Time Chunking? [How it Improves Work Productivity]
Scientists have proven that 25-minute work periods are more effective in enabling people to give full focus to a specific task. Researchers also ......
Read more >
Configuring a Step - Spring
ChunkListener. A “chunk” is defined as the items processed within the scope of a transaction. Committing a transaction, at each commit interval, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found