[RFC] Streaming Index API
See original GitHub issueIs your feature request related to a problem? Please describe.
Current _bulk
indexing API places a high configuration burden on users today to avoid RejectedExecutionException
due to TOO_MANY_REQUESTS
. This forces the user to “experiment” with bulk block sizes, multi-threading, refresh intervals, etc.
Describe the solution you’d like The _bulk configuration burden and workflow should be relocated from the user and handled by the server. The user experience should switch to an anxiety free API that enables users to send a “stream” of index requests that is load balanced by the server in a Streaming Index mechansim.
This Streaming Index API mechanism should also handle the “durability” responsibility based on a user defined Durability Policy to determine the following:
- What operations to persist in the TransLog (if any)
- What type of remote storage to use (e.g., long term vs short term)
- What documents / segments to replicate
- Where segments should be replicated
- Level of consistency (e.g., how often to ack)
Describe alternatives you’ve considered Continue w/ durability as it is today w/ a document replication model.
Issue Analytics
- State:
- Created a year ago
- Reactions:4
- Comments:14 (9 by maintainers)
Top GitHub Comments
I think the streaming index API should be a new API. Like segment replication it should start as experimental behind a feature flag so we can benchmark default parameters and API semantics before promoting it as a first class bulk ingest mechanism. As you touch on in the
durability levels
, we’re exploring durability under different configurations and looking at introducing new durability controls. For example, segrep w/o remote store needs the local translog to provide durability. Once operations are durable in the translog we can ack to the client; segrep w/ remote store will ack after a commit. But like UDP a user may not be so concerned about durability and won’t care if an operation is lost in which case no ack is necessary.This defaults to
false
in the current bulk API, effectively decoupling refresh from bulk indexing. The high penaltytrue
value was originally introduced for those cases where users wanted documents available for search immediately after each operation (e.g., some security use cases) andwait_for
was intended to strike a balance. I think we’ll want to retain this control but introducing streaming index as a separate API allows us to explore the necessity as we evolve segment replication.I concur the thought of having a separate API to revisit our freshness and durability semantics and pack optimizations as needed.
I guess the network infrastructure/firewall would potentially limit how long the connection can stay open, this should also factor in inevitable cases where the connections have to be forcibly closed like server maintenance
Do we also plan on supporting a client library for ensuring persistent consistent(keep-alive), connection close on end of stream, backup buffering mechanism if the server isn’t able to process as fast and close connection if the buffer hits a certain limit, reconnection on connection drops. The server could apply back-pressure if it isn’t able to process the stream as fast or sees resources being too close to being exhausted.
@itiyama I think we could have the coordinator split the streams for parallel processing and fan it out to respective shards as needed or even consider having a single stream always write to a single shard if there are overheads with splitting them