Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[RFC] Streaming Index API

See original GitHub issue

Is your feature request related to a problem? Please describe. Current _bulk indexing API places a high configuration burden on users today to avoid RejectedExecutionException due to TOO_MANY_REQUESTS. This forces the user to “experiment” with bulk block sizes, multi-threading, refresh intervals, etc.

Describe the solution you’d like The _bulk configuration burden and workflow should be relocated from the user and handled by the server. The user experience should switch to an anxiety free API that enables users to send a “stream” of index requests that is load balanced by the server in a Streaming Index mechansim.

This Streaming Index API mechanism should also handle the “durability” responsibility based on a user defined Durability Policy to determine the following:

What operations to persist in the TransLog (if any)
What type of remote storage to use (e.g., long term vs short term)
What documents / segments to replicate
Where segments should be replicated
Level of consistency (e.g., how often to ack)

Describe alternatives you’ve considered Continue w/ durability as it is today w/ a document replication model.

Issue Analytics

State:
Created a year ago
Reactions:4
Comments:14 (9 by maintainers)

Top GitHub Comments

2reactions

nknizecommented, Jul 5, 2022

In the streaming mode, we would want to combine the action, metadata and the optional source into a single structure…However, it would add a breaking change to the bulk API and the alternative would be to introduce a new API.

I think the streaming index API should be a new API. Like segment replication it should start as experimental behind a feature flag so we can benchmark default parameters and API semantics before promoting it as a first class bulk ingest mechanism. As you touch on in the durability levels, we’re exploring durability under different configurations and looking at introducing new durability controls. For example, segrep w/o remote store needs the local translog to provide durability. Once operations are durable in the translog we can ack to the client; segrep w/ remote store will ack after a commit. But like UDP a user may not be so concerned about durability and won’t care if an operation is lost in which case no ack is necessary.

Refresh policy: In the streaming mode, we may want to reconsider whether to provide this control to the client.

This defaults to false in the current bulk API, effectively decoupling refresh from bulk indexing. The high penalty true value was originally introduced for those cases where users wanted documents available for search immediately after each operation (e.g., some security use cases) and wait_for was intended to strike a balance. I think we’ll want to retain this control but introducing streaming index as a separate API allows us to explore the necessity as we evolve segment replication.

1reaction

Bukhtawarcommented, Oct 18, 2022

I concur the thought of having a separate API to revisit our freshness and durability semantics and pack optimizations as needed.

I guess the network infrastructure/firewall would potentially limit how long the connection can stay open, this should also factor in inevitable cases where the connections have to be forcibly closed like server maintenance

Do we also plan on supporting a client library for ensuring persistent consistent(keep-alive), connection close on end of stream, backup buffering mechanism if the server isn’t able to process as fast and close connection if the buffer hits a certain limit, reconnection on connection drops. The server could apply back-pressure if it isn’t able to process the stream as fast or sees resources being too close to being exhausted.

I would also think about how we could directly stream to the data node instead of coordinator node splitting the requests in between. This would probably require us to vend clients with some intelligence about routing?

@itiyama I think we could have the coordinator split the streams for parallel processing and fan it out to respective shards as needed or even consider having a single stream always write to a single shard if there are overheads with splitting them

Top Results From Across the Web

HTTP Live Streaming RFC 8216 - IETF Datatracker

HTTP Live Streaming (RFC 8216, August 2017)

Stream Packet Index - Yamcs Documentation

Streams back packet index records. Warning. This method uses server-streaming. Yamcs sends an unspecified amount of data using chunked transfer encoding.

RFC INDEX - » RFC Editor

Captive Portal API T. Pauly, D. Thakore [ September 2020 ] (HTML, TEXT, PDF, XML) (Status: PROPOSED STANDARD) (Stream: IETF, Area: art, WG:...

Using RFC Streaming with Tables - SAP Help Portal

The following parameters are optional when RFC streaming is enabled: Batch size; RFC Trace; Batch receive timeout. ECC Client and Gateway Service Configuration....

Media Capture and Streams - W3C

This document defines a set of JavaScript APIs that allow local media, ... in the W3C technical reports index at https://www.w3.org/TR/.