Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Decouple topics from tables and generalize SchemaRetriever

See original GitHub issue

Intro

We have had a number of PRs and issues recently that are attempting to do two things:

Support more serialization formats
Support multiple schemas in the same topic

These include:

https://github.com/wepay/kafka-connect-bigquery/pull/238 https://github.com/wepay/kafka-connect-bigquery/issues/216 https://github.com/wepay/kafka-connect-bigquery/issues/175 https://github.com/wepay/kafka-connect-bigquery/issues/206 https://github.com/wepay/kafka-connect-bigquery/issues/178

And more.

Changes

I believe that we can address these issues with the following changes:

Add TableRouter

Adding a pluggable TableRouter that takes a SinkRecord and returns which table it should be written to. (Default: RegexTableRouter)

public interface TableRouter {
  void configure(Map<String, String> properties);
  TableId getTable(SinkRecord sinkRecord);
}

Generalize SchemaRetriever

Changing SchemaRetriever interface to have two methods: Schema getKeySchema(SinkRecord) and Schema getValueSchema(SinkRecord). (Default: IdentitySchemaRetriever, which just returns sinkRecord.keySchema() and sinkRecord.valueSchema())

public interface SchemaRetriever {
  void configure(Map<String, String> properties);
  Schema retrieveKeySchema(SinkRecord sinkRecord);
  Schema retrieveValueSchema(SinkRecord sinkRecord);
}

Change *QueryWriter schema update logic

Changing the way that schema updates are handled in the AdaptiveBigQueryWriter.

This change will be the largest. If we remove SchemaRetriever.updateSchema, we need a way for AdaptiveBigQueryWriter to update BQ schemas when a batch insert fails. Given these rules:

https://cloud.google.com/bigquery/docs/managing-table-schemas

This document describes how to modify the schema definitions for existing BigQuery tables. BigQuery natively supports the following schema modifications:

Adding columns to a schema definition

Relaxing a column’s mode from REQUIRED to NULLABLE

It is valid to create a table without defining an initial schema and to add a schema definition to the table at a later time.

All other schema modifications are unsupported and require manual workarounds, including:

Changing a column’s name

Changing a column’s data type

Changing a column’s mode (aside from relaxing REQUIRED columns to NULLABLE)

Deleting a column

The correct behavior is to do when a schema failure occurs is to have the adaptive writer union all fields from all insert batch with all fields in the existing BigQuery table. An example illustrates:

Existing BQ Schema

a					INTEGER				REQUIRED
b 					INTEGER				NULLABLE

Batch Insert Schemas

Row 1
a					INTEGER 			NULLABLE
Row 2
c					INTEGER				NULLABLE
Row 3
a					INTEGER				REQUIRED

Final Schema
a					INTEGER				NULLABLE
b 					INTEGER				NULLABLE
c					INTEGER				NULLABLE

This will require that we have access to the insert batch’s the SinkRecord for each row (not just RowToInsert). It will also require that AdaptiveBigQueryWriter has the SchemaRetriever wired in as well.

I think the most straight-forward way to handle this is to have TableWriterBuilder.addRow(SinkRecord record) instead of RowToInsert. The Builder can then keep a SortedMap<SinkRecord, RowToInsert>, and pass that down the stack through to AdaptiveBigQueryWriter.performWriteRequest. AdaptiveBigQueryWriter.attemptSchemaUpdate can then be changed to implement the logic I field-union logic that I described above.

One area that we’ll have to be particularly careful about is dealing with repeated records and nested structures. They need to be properly unioned as well.

Benefits

This approach should give us a ton of flexibility including:

It will allow you to pick which table to route each individual message based on all of the information in SinkRecord (topic, key schema, value schema, message payload, etc.)
It will allow us to fix some known schema-evolution bugs in cases where one field is added and another is dropped.
It will allow us to use SinkRecord's .keySchema and .valueSchema rather than talking to the schema registry for schemas.
It will make it easy to support JSON messages, even those that return null for the .keySchema() and .valueSchema() methods–you can implement a custom retriever for this case.

Issue Analytics

State:
Created 4 years ago
Reactions:3
Comments:8 (4 by maintainers)

Top GitHub Comments

4reactions

criccominicommented, Jul 16, 2020

(We are using SMT approach now. PR is forthcoming 😃 )

2reactions

sebco59commented, Jun 3, 2020

@criccomini Nice !

On our side we tried to use Simple Message Transformation to deal with MultiSchema instead of Table Router implementation.

But SMT doesn’t work because of the way to retrieve schema on SchemaRegistrySchemaRetriever#getSubject. If we change it, it will open the door for all SMT and no need to implement TableRouter

With SMT, we could extract a lot of complexity from connector to outside
- we want to deal with multi subject -> SMT to change topic name in the fly
- we want to add kafka system information (offset, partition, …p) -> SMT to add field with this value on the fly
- we want to route topic in a specific bq table depending on a field -> SMT to change topic name with a regex
- we want to add header of the record to bq record -> SMT to add field… and so more
With SMT, we have default error handling from kafka connect.
With SMT, we could reuse it in another connector

We have started developing schema retriever logic but we tried to decouple it from the schema update logic. With the current implementation, the BQ table schema updates are entirely based on the latest version available in the schema registry, which doesn’t do well with messages with an older incompatible schema. With the schema retriever, it’s still straightforward to handle homogenous batches (as far as the schema is concerned), you can simply use any message from the batch to try to update the BQ table schema. To handle heterogenous batches, we could rely on a retry logic that will incrementally update the BQ table schema until it gets stable (you can use the first failed message in the batch to try to update the BQ table schema). That would be a simple alternative before the field-union logic is implemented.

What do you think of this approach (SMT and Schema Update Logic) ? Can we contribute too ?

Update: You can found our quick n dirty dev: sebco59/kafka-connect-bigquery -> branch allow-smt