question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[RFC] Streaming Index API

See original GitHub issue

Is your feature request related to a problem? Please describe. Current _bulk indexing API places a high configuration burden on users today to avoid RejectedExecutionException due to TOO_MANY_REQUESTS. This forces the user to “experiment” with bulk block sizes, multi-threading, refresh intervals, etc.

Describe the solution you’d like The _bulk configuration burden and workflow should be relocated from the user and handled by the server. The user experience should switch to an anxiety free API that enables users to send a “stream” of index requests that is load balanced by the server in a Streaming Index mechansim.

This Streaming Index API mechanism should also handle the “durability” responsibility based on a user defined Durability Policy to determine the following:

  1. What operations to persist in the TransLog (if any)
  2. What type of remote storage to use (e.g., long term vs short term)
  3. What documents / segments to replicate
  4. Where segments should be replicated
  5. Level of consistency (e.g., how often to ack)

Describe alternatives you’ve considered Continue w/ durability as it is today w/ a document replication model.

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:4
  • Comments:14 (9 by maintainers)

github_iconTop GitHub Comments

2reactions
nknizecommented, Jul 5, 2022

In the streaming mode, we would want to combine the action, metadata and the optional source into a single structure…However, it would add a breaking change to the bulk API and the alternative would be to introduce a new API.

I think the streaming index API should be a new API. Like segment replication it should start as experimental behind a feature flag so we can benchmark default parameters and API semantics before promoting it as a first class bulk ingest mechanism. As you touch on in the durability levels, we’re exploring durability under different configurations and looking at introducing new durability controls. For example, segrep w/o remote store needs the local translog to provide durability. Once operations are durable in the translog we can ack to the client; segrep w/ remote store will ack after a commit. But like UDP a user may not be so concerned about durability and won’t care if an operation is lost in which case no ack is necessary.

Refresh policy: In the streaming mode, we may want to reconsider whether to provide this control to the client.

This defaults to false in the current bulk API, effectively decoupling refresh from bulk indexing. The high penalty true value was originally introduced for those cases where users wanted documents available for search immediately after each operation (e.g., some security use cases) and wait_for was intended to strike a balance. I think we’ll want to retain this control but introducing streaming index as a separate API allows us to explore the necessity as we evolve segment replication.

1reaction
Bukhtawarcommented, Oct 18, 2022

I concur the thought of having a separate API to revisit our freshness and durability semantics and pack optimizations as needed.

I guess the network infrastructure/firewall would potentially limit how long the connection can stay open, this should also factor in inevitable cases where the connections have to be forcibly closed like server maintenance

Do we also plan on supporting a client library for ensuring persistent consistent(keep-alive), connection close on end of stream, backup buffering mechanism if the server isn’t able to process as fast and close connection if the buffer hits a certain limit, reconnection on connection drops. The server could apply back-pressure if it isn’t able to process the stream as fast or sees resources being too close to being exhausted.

I would also think about how we could directly stream to the data node instead of coordinator node splitting the requests in between. This would probably require us to vend clients with some intelligence about routing?

@itiyama I think we could have the coordinator split the streams for parallel processing and fan it out to respective shards as needed or even consider having a single stream always write to a single shard if there are overheads with splitting them

Read more comments on GitHub >

github_iconTop Results From Across the Web

HTTP Live Streaming RFC 8216 - IETF Datatracker
HTTP Live Streaming (RFC 8216, August 2017)
Read more >
Stream Packet Index - Yamcs Documentation
Streams back packet index records. Warning. This method uses server-streaming. Yamcs sends an unspecified amount of data using chunked transfer encoding.
Read more >
RFC INDEX - » RFC Editor
Captive Portal API T. Pauly, D. Thakore [ September 2020 ] (HTML, TEXT, PDF, XML) (Status: PROPOSED STANDARD) (Stream: IETF, Area: art, WG:...
Read more >
Using RFC Streaming with Tables - SAP Help Portal
The following parameters are optional when RFC streaming is enabled: Batch size; RFC Trace; Batch receive timeout. ECC Client and Gateway Service Configuration....
Read more >
Media Capture and Streams - W3C
This document defines a set of JavaScript APIs that allow local media, ... in the W3C technical reports index at https://www.w3.org/TR/.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found