Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

etcdClient.watch lead to memory usuage increasing all the time

See original GitHub issue

Env

etcd server: v3.4.0

below is one of the etcd cluster node, run it via docker container

b24afab33c18        quay.io/coreos/etcd:v3.4.0                            "/usr/local/bin/etcd…"   15 hours ago        Up 15 hours        0.0.0.0:2381->2381/tcp, 0.0.0.0:2482->2482/tcp, 2379-2380/tcp   etcd2

etcd client: 0.4.1

compile ("io.etcd:jetcd-core:0.4.1")

Problem phenomenon

the watch or close api seems has memory leak problem, so i used below scala codes to prove it. (I deployed a etcd cluster including three nodes, the 3rd node’s memory is increasing all the time during blow codes running, other nodes’s memory seems keep stable)

  def main(args: Array[String]): Unit = {
    val hostAndPorts = "xxx.xxx.xxx.xxx:2379,xxx.xxx.xxx.xxx:2380,xxx.xxx.xxx.xxx:2381"
    val addresses: List[URI] = hostAndPorts
      .split(",")
      .toList
      .map(hp => {
        val host :: port :: Nil = hp.split(":").toList
        URI.create(s"http://$host:$port")
      })
    val client = Client.builder().endpoints(addresses.asJava).build();
    val watchClient = client.getWatchClient

    for (count1 <- 1 to 100) {
      var watcherQueue = Queue.empty[Watcher]
      for (count2 <- 1 to 5000) {
        val key = s"namespace/${count1}/${count2}"
        val option = WatchOption
          .newBuilder()
          .withPrevKV(true)
          .withPrefix(key)
          .build()

        val watcher = watchClient.watch(key, option, Watch.listener((res: WatchResponse) => onKeyChange(res)))
        watcherQueue = watcherQueue.enqueue(watcher)
      }

      Thread.sleep(1000 * 10)

      // Close all watcher
      for (watcher <- watcherQueue) {
        watcher.close()
      }
    }
  }

In above codes, create 5000 watchers per loop, after sleep 10.seconds, close these 5000 watchers, total execute 100 loop.

During testing, in spite of closed the watcher in above codes, but the result showes it will not release the memory(the momery usage is increasing all the time via docker stats , finally can’t execute docker ps or docker stats(it seems etcd2 crash here)), if use free -m to check memory, the memory is consumed over.

During testing, i also used pprof to check the memory

go tool pprof http://xxx.xxx.xxx.xxx:2381/debug/pprof/heap?debug=1&seconds=10

Found go.etcd.io/etcd/mvcc.(*watchableStore).NewWatchStream’s memory usuage is increasing all the time.

So here, we can get a preliminary conclusion: the watch api lead to the memory leak, maybe watcher.close() doesn’t release the memory.

PS: I also did other test to prove the problem from another angle

I Removed all etcdclient.watch/close logic from our own application, and test the application , monit the etcd memory usgage, the memory usuage keeps stable.
I used go verion API to test again, all three etcd nodes’s memory usuage keep stable.

ackage main

import (
	"context"
	"fmt"
	"github.com/coreos/etcd/clientv3"
	"time"
)

func main() {
	cli, _ := clientv3.New(clientv3.Config{
		Endpoints:   []string{"localhost:2379"},
		DialTimeout: 5 * time.Second,
	})


	defer cli.Close()

	for j := 1; j <= 1000; j++ {
		var watchers []clientv3.Watcher
		for i := 1; i <= 5000; i++ {
			println("starting watcher: ", i)
			watcher := clientv3.NewWatcher(cli)
			key := fmt.Sprintf("foo-%d-%d", j, i)
			_ = watcher.Watch(context.Background(), key, clientv3.WithPrefix())

			watchers = append(watchers, watcher)

		}
		time.Sleep(10 * time.Second)

		for _, watcher := range watchers {
			println("closing watcher: ", watcher)
			watcher.Close()
		}
		println("done: ", j)
	}
}

the etcd0’s memory usuage keep stable (< 200M) the etcd1’s memory usuage keep stable (< 200M) the etcd2’s memory usuage keep stable (round 700M)

So obviousely，the etcd server itself and go version API have no problem also. The problem is in jetcd.