question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Creating DAGs with channels - proposing transforms and pipes

See original GitHub issue

Directed Graphs

Think directed graphs of channels & coroutines with producers as message sources and actors as sinks. What’s missing in this picture is:

  • intermediate transform nodes
  • graph edges / interconnects or piping

e.g. NodeJS Streams uses similar concepts for high throughput and performance.

The Proposal

I’m proposing adding a couple of coroutines (note that these are early prototypes and not up to par with producer et al):

  • transform coroutine - like produce & actor, a transform is a combination of a coroutine, the state that is confined and is encapsulated into this coroutine, and two channels to communicate with upstream and downstream coroutines.
  • pipe coroutine - a stateless coroutine that consumes messages from a ReceiveChannel and send them to a downstream SendChannel. When the downstream SendChannel is part of a Channel, it returns the downstream channel’s ReceiveChannel for further chaining (like a shell pipe sequence $ cmd1 | cmd2 | cmd3 ...).

dag

Example 1

This example reads blocks (default 1024) as ByteBuffers from a file and decodes the blocks to utf8.

val data : String //the file's contents

FS.createReader(inputFile.toPath())
 .pipe(decodeUtf8())
 .pipe(contents {
   assertEquals(data, it.joinToString { "" })
 })
 .drainAndJoin()
  • createReader returns a ReceiveChannel wrapper around aRead
  • decodeUtf8 receives ByteBuffers and returns String
  • contents is a transform which returns a list of all messages after the channel closes

Example 2

Like the previous example, here we read blocks, convert them to utf8 strings and further split text into lines and count the number of lines.

val data : String //the file's contents
val lines = data.split("\n")

val listener = Channel<String>()
val count = async(coroutineContext) {
  listener.count()
}
val teeListener = tee(listener, context = coroutineContext)

FS.createReader(inputFile.toPath())
  .pipe(decodeUtf8())
  .pipe(splitter)
  .pipe(teeListener)
  .pipe(contents {
    assertEquals(lines.size, it.size)
  })
  .drainAndJoin()

assertEquals(lines.size, count.await())
  • splitLine splits incoming String blocks into individual lines and pushes each line as a message on its downstream channel.
  • tee is a passthrough transform that replicates messages on the provided ReceiveChannel

Current Alternatives

As @jcornaz points out in the discussion below, transforms (with state) can be implemented as extensions of ReceiveChannel. The snippets (from this test) below contrast the two approaches (extensions are cleaner):

With Transforms/Pipes

 dataProducer()                 // emit chunks of 512 bites
  .pipe(tee(verifyLength))      // verify that we're getting all of the data
  .pipe(splitter(NL))           // split into lines
  .pipe(counter(lineCounter))   // count lines
  .pipe(splitter(WS, true))     // split lines into words (words are aligned)
  .pipe(counter(wordCounter))   // count words
  .drainAndJoin()               // wait

With Extensions

dataProducer()                  // emit chunks of 512 bites
  .tee(verifyLength)            // verify that we're getting all of the data
  .split(NL)                    // split into lines
  .countMessages(lineCounter)   // count lines
  .split(WS, true)              // split lines into words (words are aligned)
  .countMessages(wordCounter)   // count words
  .drain()                      // wait

Question

Question: is this something the team considers worth pursuing?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:12 (10 by maintainers)

github_iconTop GitHub Comments

2reactions
gildorcommented, Apr 9, 2018

It’s not clear to me how this new API better than standard map/flatMap. Could you show some working, self-contained example? I would like to try to rewrite it using existing channel operators and compare results

1reaction
SolomonSun2010commented, Aug 20, 2018

Data-Driven Concurrency is cool ! see also: https://dl.acm.org/citation.cfm?doid=3154814.3162014

Building Scalable, Highly Concurrent & Fault Tolerant Systems - Lessons Learned https://www.slideshare.net/jboner/building-scalable-highly-concurrent-fault-tolerant-systems-lessons-learned?from_action=save

Dataflow Concurrency • Deterministic • Declarative • Data-driven • Threads are suspended until data is available • Lazy & On-demand • No difference between: • Concurrent code • Sequential code • Examples: Akka & GPars

Read more comments on GitHub >

github_iconTop Results From Across the Web

Build your first pipeline DAG | Apache airflow for beginners
apacheairflow #airflowforbeginners #datapipeline #etl #etlpipeline #airflow2Welcome to series Apache airflow for beginners.
Read more >
Airflow DAG: Make your data pipelines better! - YouTube
In the previous video, you created your first Airflow DAG.In this video, you will use Airflow's last features to make it MUCH better!...
Read more >
Data Pipelines with Apache Airflow - BI Consult
DAG file. (Python). Dependency between tasks, indicating task 3 must run before task 4 ... Cleaning/transforming the sales data to fit requirements.
Read more >
How to Generate Airflow Dynamic DAGs - Hevo Data
Another way to construct Airflow Dynamic DAGs is to use code to generate complete Python files for each DAG. This method produces one...
Read more >
The transformation of the network DAG (top) to a SISO ...
We present the rationale and design of S-Net, a coordination language for asynchronous stream processing. The language achieves a near-complete separation ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found