Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Health / liveness check for Worker

See original GitHub issue

Is your feature request related to a problem? Please describe. Since shifting to the standalone NestJS architecture for the worker in v1.0, we’ve lost the ability to use the HealthCheckRegistryService to monitor the health of worker processes.

There is also no easy way to do a liveness check for things like kubernetes deployments of the worker process.

Since the worker no longer has a networking layer, we cannot just add an HTTP endpoint, which is the usual way of handling it.

Describe the solution you’d like Some way to support both the internal HealthCheckRegistryService checks (so it appears in the Admin UI) and also support for a generic liveness check.

Describe alternatives you’ve considered Some ideas:

Add a http layer just for the liveness check. This might imply using a Nest Microservice rather than a Standalone app. What are the perf / other implications of this?
Investigate a non-http way of doing a liveness check. Honestly no idea what this might look like, I’m asking on the Nest discord to see if anyone has any idea about this.

Issue Analytics

State:
Created 2 years ago
Comments:6 (5 by maintainers)

Top GitHub Comments

2reactions

michaelbromleycommented, Aug 25, 2021

@artcoded-dev This change will allow the worker to listen for HTTP requests like this:

bootstrapWorker(config)
  .then(worker => worker.startJobQueue())
  .then(worker => worker.startHealthCheckServer({ port: 3020 }))
  .catch(err => {
    console.log(err);
  });

GET http://<host>:3020

{ status: 'ok' }

If your use-case requires some other kind of http endpoint, please provide more details.

2reactions

michaelbromleycommented, Jul 29, 2021

Ran a few performance tests comparing the difference between bootstrapping the worker as a standalone app (the way we currently do it) vs as a full-blown Nest server including http layer. I also created a new WorkerModule which is just the same as the AppModule but omits the ApiModule import, since we would not need that for the worker.

The code looks like this (irrelevant parts elided):

export async function bootstrapWorker(
    userConfig: Partial<VendureConfig>,
): Promise<{ app: INestApplicationContext; startJobQueue: () => Promise<void> }> {
    const mode: 'standalone' | 'server' = 'standalone';
    const module: 'app' | 'worker' = 'worker';
    console.time('bootstrap');

    // ...

    let workerApp: INestApplication | INestApplicationContext;
    let moduleClass: any;
    if (module === 'app') {
        const appModule = await import('./app.module');
        moduleClass = appModule.AppModule;
    } else {
        const workerModule = await import('./worker.module');
        moduleClass = workerModule.WorkerModule;
    }
    if (mode === 'standalone') {
        const appModule = await import('./app.module');
        workerApp = await NestFactory.createApplicationContext(moduleClass, {
            logger: new Logger(),
        });
    } else {
        const workerModule = await import('./worker.module');
        workerApp = await NestFactory.create(moduleClass, {
            logger: new Logger(),
        });
        await (workerApp as any).listen(config.apiOptions.port + 1, config.apiOptions.hostname);
    }
    // ...
    console.log(
        `Mode: ${mode}, module: ${moduleClass.name} -  The script uses approximately ${
            Math.round(used * 100) / 100
        } MB`,
    );
    console.timeEnd('bootstrap');
    return { app: workerApp, startJobQueue };
}

Here are the results (3 runs of each):

standalone with AppModule (status quo)

The script uses approximately 267.68 MB
bootstrap: 2.174s
The script uses approximately 274.61 MB
bootstrap: 1.882s
The script uses approximately 267.47 MB
bootstrap: 1.964s

server with AppModule

The script uses approximately 292.27 MB
bootstrap: 2.466s
The script uses approximately 290.86 MB
bootstrap: 2.474s
The script uses approximately 290.69 MB
bootstrap: 2.486s

standalone with WorkerModule

The script uses approximately 225.57 MB
bootstrap: 552.972ms
The script uses approximately 224.37 MB
bootstrap: 578.833ms
The script uses approximately 224.32 MB
bootstrap: 561.531ms

server with WorkerModule

The script uses approximately 224.01 MB
bootstrap: 418.242ms
The script uses approximately 224.44 MB
bootstrap: 410.18ms
The script uses approximately 224.87 MB
bootstrap: 431.568ms

Conclusion

My intuition that bootstrapping a full server is much heavier than a standalone app is not correct. The main overhead is the ApiModule which, when omitted by using the WorkerModule, has a drastic impact on memory usage and especially startup time.

So it looks perfectly viable to bootstrap the worker as a full NestJS server app, which will give us an HTTP layer we can use for the health check and at the same time reduce startup time by a factor of ~4-5x, (2s down to 400ms) which is also very helpful for the serverless use-case!

API proposal

When bootstrapping the worker as a Nest server, we need to provide a post & hostname. I propose adding an optional 2nd argument to the bootstrapWorker() function which allows these to be provided. If provided, a server is created. If not provided, a standalone app is created. Both server and standalone app return an object implementing INestApplicationContext, so nothing needs to change regarding the return type of the function.

Health Checks

Still to be solved is how to enable a health check indicator for the worker in the Admin UI using the existing HealthCheckRegistryService. Problems:

How do we tell the server at which address to ping the worker?
There can be multiple workers. They can come on and offline during the lifecycle of the server. How do we communicate this to the server so that it can add/remove checks for each worker?

Perhaps we can have an endpoint on the server than can be called to register a worker? Need to think more about this one.