question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Promote druid-kubernetes-extensions out of experimental status

See original GitHub issue

Currently, the druid-kubernetes-extension is in experimental status: https://druid.apache.org/docs/latest/development/extensions-core/kubernetes.html:

Consider this an EXPERIMENTAL feature mostly because it has not been tested yet on a wide variety of long running Druid clusters.

The functionality is quite useful, since it allows people to run Druid on k8s without reliance on ZooKeeper. So, we’d like to promote it out of experimental status. To do that, we need:

  1. Robust experience in production scenarios.
  2. Volunteers to maintain the extension.

Let’s use this issue as a place people can chime in about this stuff.


Notes on testing. I checked and found:

  • Unit tests with about 55% coverage. The uncovered code is mostly the prod implementations of certain interfaces where we have text-fixture implementations in unit tests. So, the coverage is about as good as it can be. The prod implementations interface directly with k8s, so they need to be tested in integration tests.
  • An integration test, “(Compile=openjdk8, Run=openjdk8, Cluster Build On K8s) ITNestedQueryPushDownTest integration test” added in #10669 by @zhangyue19921010 (thank you 🙌). It runs one test case, ITNestedQueryPushDownTest, which exercises aspects of ingestion and query.

I didn’t see integration tests for cases like servers going on and offline, or for leader failover. That’d be a great direction to extend the tests in. Note that we have a project going on right now to create a simpler and easier-to-use integration test framework, in #12359. It may be prudent to implement new tests on top of that new framework when it’s available.

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:1
  • Comments:9 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
gianmcommented, Aug 31, 2022

Slack thread mentioning an issue: https://apachedruidworkspace.slack.com/archives/C0309C9L90D/p1661944109893699

Hello, we’re trying to start using integrated K8S controller (no ZK) with k8s 1.24.3. Our middlemanagers are dying after some time, it seems all is due to this sequence of events (logs from one of the middlemanagers):

2022-08-31T10:53:17,716 ERROR [[index_kafka_netflows_fc3a72329ded59f_bebhjghl]-appenderator-persist] org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Incremental persist failed: {class=org.apache.druid.segment.realtime.appenderator.StreamAppenderator, segment=netflows_2022-08-31T08:00:00.000Z_2022-08-31T09:00:00.000Z_2022-08-31T08:47:41.323Z_247, dataSource=netflows, count=12}
2022-08-31T10:53:17,718 INFO [task-runner-0-priority-0] org.apache.druid.k8s.discovery.K8sDruidNodeAnnouncer - Unannouncing DiscoveryDruidNode[DiscoveryDruidNode{druidNode=DruidNode{serviceName='druid/middleManager', host='10.2.28.27', bindOnHost=false, port=-1, plaintextPort=8105, enablePlaintextPort=true, tlsPort=-1, enableTlsPort=false}, nodeRole='PEON', services={dataNodeService=DataNodeService{tier='_default_tier', maxSize=3900000000000, serverType=indexer-executor, priority=0}, lookupNodeService=LookupNodeService{lookupTier='__default'}}}]
2022-08-31T10:53:17,801 WARN [task-runner-0-priority-0] org.apache.druid.java.util.common.RetryUtils - Retrying (1 of 2) in 1,079ms.
org.apache.druid.java.util.common.RE: Failed to patch pod[default/druid-druid-cluster-middlemanagers-0], code[422], error[{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "the server rejected our request due to an error in our request",
  "reason": "Invalid",
  "details": {},
  "code": 422
}]
1reaction
wiegandfcommented, Sep 2, 2022

Slack thread mentioning an issue: https://apachedruidworkspace.slack.com/archives/C0309C9L90D/p1661944109893699

Hello, we’re trying to start using integrated K8S controller (no ZK) with k8s 1.24.3. Our middlemanagers are dying after some time, it seems all is due to this sequence of events (logs from one of the middlemanagers):

2022-08-31T10:53:17,716 ERROR [[index_kafka_netflows_fc3a72329ded59f_bebhjghl]-appenderator-persist] org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Incremental persist failed: {class=org.apache.druid.segment.realtime.appenderator.StreamAppenderator, segment=netflows_2022-08-31T08:00:00.000Z_2022-08-31T09:00:00.000Z_2022-08-31T08:47:41.323Z_247, dataSource=netflows, count=12}
2022-08-31T10:53:17,718 INFO [task-runner-0-priority-0] org.apache.druid.k8s.discovery.K8sDruidNodeAnnouncer - Unannouncing DiscoveryDruidNode[DiscoveryDruidNode{druidNode=DruidNode{serviceName='druid/middleManager', host='10.2.28.27', bindOnHost=false, port=-1, plaintextPort=8105, enablePlaintextPort=true, tlsPort=-1, enableTlsPort=false}, nodeRole='PEON', services={dataNodeService=DataNodeService{tier='_default_tier', maxSize=3900000000000, serverType=indexer-executor, priority=0}, lookupNodeService=LookupNodeService{lookupTier='__default'}}}]
2022-08-31T10:53:17,801 WARN [task-runner-0-priority-0] org.apache.druid.java.util.common.RetryUtils - Retrying (1 of 2) in 1,079ms.
org.apache.druid.java.util.common.RE: Failed to patch pod[default/druid-druid-cluster-middlemanagers-0], code[422], error[{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "the server rejected our request due to an error in our request",
  "reason": "Invalid",
  "details": {},
  "code": 422
}]

Same issue on k8s 1.23.8

Read more comments on GitHub >

github_iconTop Results From Across the Web

[Draft] 0.21.0 Release Notes #10752 - apache/druid - GitHub
This extension is still experimental. See Kubernetes extension for more details.
Read more >
Kubernetes - Apache Druid
Consider this an EXPERIMENTAL feature mostly because it has not been tested yet on a wide variety of long running Druid clusters. Apache...
Read more >
org.apache.druid.java.util.common.ISE: No default server found
Hi, Our druid instances (which are deployed as pods ... Hi @keerthikumar, there is an experimental extension that does exactly that.
Read more >
druid kubernetes extension - HilsonCreek
Before going into detail on provisioning Druid cluster in Kubernetes, ... on issue #12904: Promote druid-kubernetes-extensions out of experimental status.
Read more >
How to launch multi-node on-prem Imply Manager with ...
7. (OPTIONAL) Kubernetes also relies on labeling nodes to assign distributed services. Here we assign one node as Druid's master, two as data, ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found