question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[RFC] Support multiple endpoints for load balancing and failover

See original GitHub issue

Background

ClickHouse supports clustering and often runs in a cluster. Typical approaches for load balancing and failover are:

  • distributed table
  • proxy/gateway(e.g. chproxy, traefik etc.) sitting between client and server
  • or modern DNS like consul

Besides, message queues(federated or not) and customization on client-side may require to further optimize reads and writes across servers.

It would be really nice to enhance both Java client and JDBC driver to support multiple endpoints, which provides more options for people to choose. Yes, we have BalancedClickhouseDataSource in JDBC driver, and maybe ClickHouseCluster in Java client, but unfortunately none of them is good enough and the latter was not even tested on a clustered environment.

Concept

  • Endpoint - essentially a combination of host, port and protocol. It may contain additional information like status, role, tags, weight, server revision/version/timezone, and credentials for authentication. (localhost, 8123, http) and (127.0.0.1, 8123, http) are different endpoints.

  • Server - an instance of ClickHouse server, which exposes multiple endpoints for client to connect to. clickhouse-local is a special type of server, which only exists when needed.

  • Cluster - a group of ClickHouse servers under same name, as described in system.clusters table.

User Scenario

# Operation Target Remark
1.1 Connect an endpoint memorize which exact server was connected to, based on X-ClickHouse-Server in response
1.2 Connect multiple endpoints either one of them based on load balancing policy, or multiple based on property matching(e.g. tags/status/distance)
1.3 Connect a server either one of discovered endpoints, or multiple
1.4 Connect a cluster either one of discovered endpoints, or multiple
2.1 Read an endpoint memorize which exact server was queried against, based on X-ClickHouse-Server in response
2.2 Read multiple endpoints either one of them based on load balancing policy, or multiple based on property matching and the query(prefer to query multiple nodes in parallel)
2.3 Read a server one of discovered endpoints, perhaps the best protocol available
2.4 Read a cluster either one of discovered endpoints, or multiple with consideration of sharding key and replicas
3.1 Write an endpoint memorize which exact server was queried against, based on X-ClickHouse-Server in response
3.2 Write multiple endpoints either one of them based on load balancing policy, or multiple based on property matching and the target table(prefer to write into local table)
3.3 Write a server one of discovered endpoints, perhaps the best protocol available
3.4 Write a cluster either one of discovered endpoints, or multiple with consideration of sharding key and replicas

Proposed Solution

In order to support above user scenarios, we need to implement below features:

# Feature Remark
1 Auto discovery The ability to automatically discover: 1) exposed ports of a ClickHouse server; 2) servers of a cluster; 3) connections between configured endpoint and actually connected server(X-ClickHouse-Server in response)
2 Load balancing Pick endpoint from a sorted and filtered list using one of two policies: 1) first alive; and 2) round robin
3 Failover Retry only when there’s network issue(happens before sending query) or failure of select query
4 Protocol detection Probe given host + port to figure out the protocol - example
5 Health check Besides liveness detection like ping, we need to understand “distance” between client and each endpoint
6 Parallel query Query partitioned data stored on multiple servers and return consolidated response(unordered) to the caller
7 Dual write Write same or distributed data into multiple endpoints. Might need to introduce configurable client-side consistency-level

Potential changes to existing code:

  • ClickHouseCluster - represents a cluster, which contains list of endpoints. Needs to rewrite the class accordingly and add background thread for auto-discovery.
  • ClickHouseEndpoint - represents an endpoint. Replacement of ClickHouseNode.
  • ClickHouseEndpoints - represents list of endpoints. It contains methods to add ClickHouseEndpoint and ClickHouseCluster, and a background thread for health check.
  • LoadBalancingPolicy - load balancing policy can be used in ClickHouseEndpoints for picking an endpoint from a sorted(based on status, weight and maybe distance) and filtered(based on tags etc.) list
  • JDBC URL - jdbc:ch://server1,(grpc://server2?tags=dc2),(tcp://server3:9000),(http://user4:passwd@server4:8124/db4?tags=dc1)/db?cluster=mycluster&tags=dc1&lbPolicy=firstAlive&autoDiscovery=true
  • JDBC Connection Properties
    Property Description Example
    cluster name of the cluster cluster=mycluster
    tags comma separated tags for categorization tags=dc1,readonly,r1s1
    lbPolicy either firstAlive(default) or roundRobin lbPolicy=roundRobin
    failover either true or false(default), only retry on network issue or failure of select query failover=true
    autoDiscovery either true or false(default) autoDiscovery=true
    healthCheck either true(default) or false healthCheck=false

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:3
  • Comments:10

github_iconTop GitHub Comments

1reaction
zhicwucommented, Jun 3, 2022

Any update @dynaxis? Just trying to avoid duplicated efforts here. I’m thinking to add an alternative of BalancedClickhouseDataSource in the coming weekend. Of course, it’s not going to be a complete implementation of all above, but just bare-bone with single load-balancing policy and no auto-discovery and fail-over.

A few things to start with:

  • 1. the ability to probe secure and insecure ports(to figure out protocol) - feature 4
  • 2. new classes: ClickHouseEndpoint and ClickHouseEndpoints(or ClickHouseNode, ClickHouseNodes instead for minimum changes)
  • 3. ClickHouseEndpoint can carry ClickHouseConfig like timeout and buffer size etc., which will override the ones in ClickHouseClient when creating ClickHouseRequest
  • 4. the ability to connect to a URL in Java client
    // host=localhost, protocol=tcp, port=9440, ssl=true, sslmode=NONE
    client.connect(ClickHouseEndpoint.of("tcps://localhost/system")); // tcps is short for TCP secure
    // host=localhost, protocol=http, port=8123, ssl=false
    client.connect("http://localhost/system"); // port is NOT 80 as we're connecting to ClickHouse, not a web server
    // host=localhost, protocol=grpc, port=9100, ssl=false
    client.connect("localhost:9100/system"); // probe port 9100 to figure out protocol
    // host=localhost, protocol=http, port=8123, ssl=false, database=default
    client.connect("localhost"); // probe default port 8123 to figure out protocol, try more ports only when autoDiscovery is enabled
    
    // may cover below case later - cached ClickHouseEndpoints will be shared among clients
    // client.connect("localhost, https://localhost:443,grpc://localhost,tcp://localhost");
    
  • 5. the ability to define nested URLs in JDBC driver
    // the outer-most protocol(defaults to ANY) and parameters are defaults, which can be override by inner URL
    Connection conn = DriverManager.getConnection(
      "jdbc:ch://server1?connection_timeout=3000,grpc://server2,(tcp://server3:9000/default)/?connection_timeout=5000", "default", "default");
     
    // change default protocol to TCP(instead of ANY), so we'll probe port 9000 for all 3 servers  
    conn = DriverManager.getConnection("jdbc:ch:tcp://server1,server2,server3/?connection_timeout=5000", "default", "");
    
  • 6. background thread for health check when connecting to multiple endpoints - and maybe options to set check intervals and threads(defaults to single-thread)

Please feel free to submit PR or review/merge changes later.

0reactions
tisonkuncommented, Dec 7, 2022

@zhicwu I made a patch to upgrade the dependency for Pulsar ClickHouse Connector https://github.com/apache/pulsar/pull/18774.

I’ll appreciate if you can also check if the upgrade is correct and transparent to users.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Network Traffic Distribution – Elastic Load Balancing FAQs
Elastic Load Balancing automatically distributes incoming application traffic across multiple targets in one or more Availability Zones (AZs).
Read more >
RFC 3074: DHC Load Balancing Algorithm
Abstract This document proposes a method of algorithmic load balancing. It enables multiple, cooperating servers to decide which one should service a client ......
Read more >
draft-ietf-dhc-failover-12
Abstract DHCP [RFC 2131] allows for multiple servers to be operating on a single ... Support load balancing between the primary and secondary...
Read more >
External HTTP(S) Load Balancing overview - Google Cloud
In the Premium Network Service Tier, this load balancer offers multi-region load balancing, directing traffic to the closest healthy backend that has ...
Read more >
Load Balancing and Failover Routing
After the requests have been forwarded to all the services in the pool, the first service is selected for the next loop of...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found