Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[RFC] Catalog Event Stream Endpoint

See original GitHub issue

Status: Open for comments and community contribution

Need

The Backstage catalog is designed be a central hub of information within an organization. It typically contains information about and organization’s software, resources, and the structure of the organization itself. It achieves this by collecting data from external services and then presents that data using a unified data model.

While the catalog might collect quite a lot of information from these external services, it is best not to ingest all of this data into the catalog in too much detail. Attempting to do so will typically lead to a bloated data model that forces in data from too many domains, which in turn makes it harder to work with and evolve the catalog. It is also likely to cause reliability issues because of the increasing number of services that the catalog depends on, or that it simply assumes too much responsibility within an organization.

A great pattern to use instead is have have external services store data that is associated with entities in the catalog. A service would typically use catalog entity names as keys of its data store, and possibly synchronize its data store with the catalog. One example of a service that uses this pattern is the TechDocs backend, which does not store its content within the catalog, but instead uses entity names as keys for its internal or external documentation storage. Other examples of open source plugins that use this pattern are the search, todo, and tech-insights backends, and there is also a use-case with an external service catalogs described in #8162. We’ve also learned of instances of this pattern being used within adopting organizations, and it is commonly used in services related to our catalog at Spotify as well.

Implementing a service that synchronizes itself with the catalog in a quick and dirty way is quite simple. You fetch all the current entities in the catalog, and then update your database accordingly. This is however wasteful in both the catalog and consuming service ends, and if the client has a need to compute a delta of the updates there’s a fair amount of complexity involved in that as well. It is also a solution where you won’t be able to get anywhere close to realtime latencies of updates, with it being more realistic to have updates happen once a minute or much more rarely.

I think there is space for providing a much better primitive for building services that integrate with the catalog. One that makes it much simpler to keep up to date with the catalog, and enables services to react to updates in a much more efficient way and with lower delay.

Proposal

I propose that we extend the catalog REST API with an /events endpoint. This endpoint would expose a single linearized stream of events that would function much like a Kafka stream. The consumption would based on offsets, and it would be up to each consuming service to keep track of their own offset as they consume the stream. There would be no server-side state except for the list of events, meaning any number of services and consume this API in parallel.

The following is a high level view of what this API might look like:

GET /api/catalog/events

200 OK
{
  "lastEventOffset": 74,
  "events": [{
    "type": "added",
    "offset": 72,
    "entityRef": "component:default/foo",
    "entity": {
      ... entity data ...
    },
  }, {
    "type": "updated",
    "offset": 73,
    "entityRef": "component:default/foo",
    "entity": {
      ... entity data ...
    },
  }, {
    "type": "removed",
    "offset": 74,
    "entityRef": "component:default/foo"
  }]
}

GET /api/catalog/events?offset=74

200 OK
{
  "lastEventOffset": 75,
  "events": [{
    "type": "removed",
    "offset": 75,
    "entityRef": "component:default/bar"
  }]
}

I kept the response format minimal for now and exact format can be discussed later, let’s not spend time there x). The important thing is that consuming events from the catalog is a simple REST API call with an offset. We also have the option to use a cursor, as long as each event receives its own cursor so that it’s possible to consume events one by one on the caller side. An optimization that could be added on top of that is the ability to filter the stream by for example entity kind, which would result in simply leaving gaps in the offset sequence. Furthermore the endpoint would likely be a long-polling endpoint, meaning incoming requests would be left open for a while if there are no events to return.

An important aspect is also how to bootstrap the consuming service. For this I propose that we add the current event stream offset in the response from the /entities endpoint, probably by making it return an object at the root, but otherwise via a header.

Bootstrapping a service would then look something like this:

GET /api/catalog/entities

200 OK
{
  "pageInfo": { ... },
  "lastEventOffset": 105,
  "items": [ ... entities ... ]
}

GET /api/catalog/events?offset=105

200 OK
{
  "lastEventOffset": 106,
  "events": [{
    "type": "removed",
    "offset": 106,
    "entityRef": "component:default/foo"
  }]
}

By providing the offset in the /entities response, we can ensure that we don’t miss any events that happen between the two calls. The service can initialize its data store based on the initial entities call, and then start consuming the event stream in a reliable way.

Consumer implementation

This is a mock implementation of a consumer of the events endpoint, just to get an idea of what it could look like.

import { LockManager } from '@backstage/backend-tasks';

// In case multiple instances are running in parallel we use a common locking utility
// to make sure that only one instance is consuming the stream at a time.
await lockManager.withLock('catalog-events', async () => {
  // We start be fetching the offset from our store that we should start consuming at
  let currentOffset = await store.getEventsOffset();

  // If we don't have an offset stored we trigger a call to the /entities
  // endpoint and initialize our data store with the entities returned.
  if (!currentOffset) {
    currentOffset = store.initialize();
  }

  while (running) {
    // We continuously fetch events from the catalog
    const { events } = await catalogClient.events({ offset: currentOffset });

    // Events are consumed and committed one by one in this example
    for (const event of events) {
      // This applies modifications and commits the offset of the consumed event
      // If our processing gets interrupted a different instance will pick up where
      // we left off by starting at the most recently committed offset.
      await store.transaction(tx => {
        currentOffset = event.offset;
        await tx.setEventOffset(event.offset);

        // Any other business logic. This doesn't necessarily have to be done with
        // transactions, we could for example also make sure our event consumption
        // is idempotent and only store the offset once each event is fully consumed.
        await tx.consumeEvent(event);
      });
    }
  }
});

There are a couple of assumptions in this implementation, please scrutinize thoroughly 😁

Implementation Proposal

The events would likely be persisted in the catalog database, which would have to be done in a way that works with multiple catalog instances sharing the same database. We would likely not persist events forever, and could either straight up expire old events, and/or run compactions to remove redundant events.

Some good news is that we already have a synchronization point build into the catalog processing where all entity updates happen, namely within the catalog Stitcher. My hope is that as we write rows to the final_entities table, we can simultaneously write to the new events table in a safe way.

Some bad news is that I expect the implementation of deletions to be a bit more complex. They are currently done through a cascading delete via the refresh_state table in the mother of all queries. A way around this is perhaps to introduce a few more steps in the database access during deletions, but we would need to be careful not to make access to the events table a bottleneck.

If this RFC is accepted, the implementation of it is open to community contributions as it will not be a focus for the maintainers for some time. Please respond here or reach out if you are interesting in working on this, and we can assist in navigating the catalog backend and any other challenges that show up. And of course this entire RFC is a proposed solution, and we are always open to other ideas as well!

Alternatives

One alternative is to provide the event stream as a TypeScript interface. I worry both that this could easily cause a performance hit if too much work is done in the callback and/or that it does not provide nearly the same guarantees as an implementation that is more directly tied into the catalog database code.

Another option could be that we don’t package this in the form of a REST API endpoint, but rather a connector for various popular reliable or even unreliable messaging systems. The internal implementation could be quite similar to the one proposed in this RFC still, but it lets us lean more heavily on established systems for the event stream implementation. My worry with this is that it will make open source plugins suffer, because we either need to provide generalized client APIs for these event stream systems, or backends need to implement support for several different systems. It also adds the burden of having to manage the deployment of these systems, even for very small scale Backstage deployments. I’m also hoping that even though the REST API might be the main way of consuming the event stream, we might still be able to provide connectors that and publish these events to ones favorite flavor of messaging system.

We could also explore an option where the catalog uses either reliable or unreliable webhooks to signal the external services. I’m not quite sure how that would compare to the proposed solution from the consumers point of view, but I think especially a reliable webhook delivery implementation in the catalog would become quite complex.

Yet another alternative here is to not pursue a solution that gives us a higher guarantees of correctness, but rather simply post events in a more best-effort way and make sure that consuming services occasionally do a full synchronization with the catalog. The even stream is then treated more as optimization and something that provides more timely updates rather than a complete solution for integrating with external services. For some use-cases this might work well, but it might also cause issues for some that want to rely on more correct data and strict event ordering.

Risks

The solution is only allowed to have a minimal impact on the catalog performance, and there is definitely a risk of seeing the catalog take a performance hit. This is something to consider as part of the design and likely benchmark to ensure that the impact is acceptable.

Another risk is that the proposed consumption pattern is actually not that easy to implement for the external services. Especially when you have services that are scaled horizontally and you need to make sure events are only consumed once. It’s possible there could be some additions to the API that can help provide some utility here, like for example having consumer groups where the catalog only delivers events to a single consumer from each group at a time. Either way it is an area where I’d love to hear from the community and people that are interested in this problem space.

Issue Analytics

State:
Created 2 years ago
Reactions:8
Comments:22 (21 by maintainers)

Top GitHub Comments

2reactions

Xantiercommented, Nov 27, 2021

This is excellent 👍

Some things @freben mentioned came to mind around this for me as well.

From the original RFC:

The events would likely be persisted in the catalog database, which would have to be done in a way that works with multiple catalog instances sharing the same database. We would likely not persist events forever, and could either straight up expire old events, and/or run compactions to remove redundant events.

How about if we would create a different plugin/package with its own backing database that we can use as this “makeshift queue/pubsub”? Within that plugin the initial inbound implementation can be handling catalog events, and initial outbound implementation can be an endpoint exposing those events. That way the extendability of adding more downstream implementations (SQS FIFO(s), SNS, Kafka etc.) as well as upstream producers (exposing Backstage webhooks, listening to GitHub eventstream, Jenkins triggering updates etc.) would be open also by adding extending modules to this.

Granted the data model with that one will get a bit more complicated and the inability to use same DB with triggers (or capture binlog) as the source would make reliability a bit more problematic to implement. We would gain scalability and ability for integrators to swap the event storage to their own implementation this way though. I foresee for example a good use case to dump these into Kafka directly, or alternatively change the backing DB to be AWS DynamoDB and use the native eventstream from that directly, still keeping the same known data model that Backstage would expose.

1reaction

zhammercommented, May 12, 2022

Going to follow work on this! We’re curious about pushing change events from the backstage catalog through our notification system the same way we do from other internal/platform tools.

For us, this doesn’t have to be a Kafka-like endpoint. We’re open to alternatives, like a lightweight hook system in our backstage application along the lines of:

const builder = await CatalogBuilder.create(env);
builder.onEvent = (event: Event) => { ... };

Could be a nice first step towards a persisted event stream? I’m looking at the life of an entity docs and imagine an event emitter would come after the Stitching phase.

Top Results From Across the Web

Catalog events · Issue #8162 · backstage/backstage - GitHub

Feature Suggestion Backstage is a centralized software catalog, ... Catalog events #8162 ... [RFC] Catalog Event Stream Endpoint #8219.

RFC 8040: RESTCONF Protocol

Event stream contents are described in Section 3.8. o media type: HTTP uses Internet media types [RFC2046] in the "Content-Type" and "Accept" header...

RESTCONF Protocol RFC 8040 - IETF Datatracker

Event stream contents are described in Section 3.8. o media type: HTTP uses Internet media types [RFC2046] in the "Content-Type" and "Accept" header...

Event Streams - IBM Cloud

IBM Event Streams is a high-throughput message bus built with Apache Kafka. It is optimized for event ingestion into IBM Cloud and event ......

Service Catalog in AMS before you begin - AWS Documentation

The only RFC that will show in the AMS console is an RFC to register the stack with AMS when a stack is...