Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Promote druid-kubernetes-extensions out of experimental status

See original GitHub issue

Currently, the druid-kubernetes-extension is in experimental status: https://druid.apache.org/docs/latest/development/extensions-core/kubernetes.html:

Consider this an EXPERIMENTAL feature mostly because it has not been tested yet on a wide variety of long running Druid clusters.

The functionality is quite useful, since it allows people to run Druid on k8s without reliance on ZooKeeper. So, we’d like to promote it out of experimental status. To do that, we need:

Robust experience in production scenarios.
Volunteers to maintain the extension.

Let’s use this issue as a place people can chime in about this stuff.

Notes on testing. I checked and found:

Unit tests with about 55% coverage. The uncovered code is mostly the prod implementations of certain interfaces where we have text-fixture implementations in unit tests. So, the coverage is about as good as it can be. The prod implementations interface directly with k8s, so they need to be tested in integration tests.
An integration test, “(Compile=openjdk8, Run=openjdk8, Cluster Build On K8s) ITNestedQueryPushDownTest integration test” added in #10669 by @zhangyue19921010 (thank you 🙌). It runs one test case, ITNestedQueryPushDownTest, which exercises aspects of ingestion and query.

I didn’t see integration tests for cases like servers going on and offline, or for leader failover. That’d be a great direction to extend the tests in. Note that we have a project going on right now to create a simpler and easier-to-use integration test framework, in #12359. It may be prudent to implement new tests on top of that new framework when it’s available.

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:9 (6 by maintainers)

Top GitHub Comments

2reactions

gianmcommented, Aug 31, 2022

Slack thread mentioning an issue: https://apachedruidworkspace.slack.com/archives/C0309C9L90D/p1661944109893699

Hello, we’re trying to start using integrated K8S controller (no ZK) with k8s 1.24.3. Our middlemanagers are dying after some time, it seems all is due to this sequence of events (logs from one of the middlemanagers):

2022-08-31T10:53:17,716 ERROR [[index_kafka_netflows_fc3a72329ded59f_bebhjghl]-appenderator-persist] org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Incremental persist failed: {class=org.apache.druid.segment.realtime.appenderator.StreamAppenderator, segment=netflows_2022-08-31T08:00:00.000Z_2022-08-31T09:00:00.000Z_2022-08-31T08:47:41.323Z_247, dataSource=netflows, count=12}
2022-08-31T10:53:17,718 INFO [task-runner-0-priority-0] org.apache.druid.k8s.discovery.K8sDruidNodeAnnouncer - Unannouncing DiscoveryDruidNode[DiscoveryDruidNode{druidNode=DruidNode{serviceName='druid/middleManager', host='10.2.28.27', bindOnHost=false, port=-1, plaintextPort=8105, enablePlaintextPort=true, tlsPort=-1, enableTlsPort=false}, nodeRole='PEON', services={dataNodeService=DataNodeService{tier='_default_tier', maxSize=3900000000000, serverType=indexer-executor, priority=0}, lookupNodeService=LookupNodeService{lookupTier='__default'}}}]
2022-08-31T10:53:17,801 WARN [task-runner-0-priority-0] org.apache.druid.java.util.common.RetryUtils - Retrying (1 of 2) in 1,079ms.
org.apache.druid.java.util.common.RE: Failed to patch pod[default/druid-druid-cluster-middlemanagers-0], code[422], error[{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "the server rejected our request due to an error in our request",
  "reason": "Invalid",
  "details": {},
  "code": 422
}]

1reaction

wiegandfcommented, Sep 2, 2022

Slack thread mentioning an issue: https://apachedruidworkspace.slack.com/archives/C0309C9L90D/p1661944109893699

Hello, we’re trying to start using integrated K8S controller (no ZK) with k8s 1.24.3. Our middlemanagers are dying after some time, it seems all is due to this sequence of events (logs from one of the middlemanagers):

2022-08-31T10:53:17,716 ERROR [[index_kafka_netflows_fc3a72329ded59f_bebhjghl]-appenderator-persist] org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Incremental persist failed: {class=org.apache.druid.segment.realtime.appenderator.StreamAppenderator, segment=netflows_2022-08-31T08:00:00.000Z_2022-08-31T09:00:00.000Z_2022-08-31T08:47:41.323Z_247, dataSource=netflows, count=12}
2022-08-31T10:53:17,718 INFO [task-runner-0-priority-0] org.apache.druid.k8s.discovery.K8sDruidNodeAnnouncer - Unannouncing DiscoveryDruidNode[DiscoveryDruidNode{druidNode=DruidNode{serviceName='druid/middleManager', host='10.2.28.27', bindOnHost=false, port=-1, plaintextPort=8105, enablePlaintextPort=true, tlsPort=-1, enableTlsPort=false}, nodeRole='PEON', services={dataNodeService=DataNodeService{tier='_default_tier', maxSize=3900000000000, serverType=indexer-executor, priority=0}, lookupNodeService=LookupNodeService{lookupTier='__default'}}}]
2022-08-31T10:53:17,801 WARN [task-runner-0-priority-0] org.apache.druid.java.util.common.RetryUtils - Retrying (1 of 2) in 1,079ms.
org.apache.druid.java.util.common.RE: Failed to patch pod[default/druid-druid-cluster-middlemanagers-0], code[422], error[{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "the server rejected our request due to an error in our request",
  "reason": "Invalid",
  "details": {},
  "code": 422
}]

Same issue on k8s 1.23.8

Top Results From Across the Web

[Draft] 0.21.0 Release Notes #10752 - apache/druid - GitHub

This extension is still experimental. See Kubernetes extension for more details.

Kubernetes - Apache Druid

Consider this an EXPERIMENTAL feature mostly because it has not been tested yet on a wide variety of long running Druid clusters. Apache...

org.apache.druid.java.util.common.ISE: No default server found

Hi, Our druid instances (which are deployed as pods ... Hi @keerthikumar, there is an experimental extension that does exactly that.

druid kubernetes extension - HilsonCreek

Before going into detail on provisioning Druid cluster in Kubernetes, ... on issue #12904: Promote druid-kubernetes-extensions out of experimental status.

How to launch multi-node on-prem Imply Manager with ...

7. (OPTIONAL) Kubernetes also relies on labeling nodes to assign distributed services. Here we assign one node as Druid's master, two as data, ......