Decouple topics from tables and generalize SchemaRetriever
See original GitHub issueIntro
We have had a number of PRs and issues recently that are attempting to do two things:
- Support more serialization formats
- Support multiple schemas in the same topic
These include:
https://github.com/wepay/kafka-connect-bigquery/pull/238 https://github.com/wepay/kafka-connect-bigquery/issues/216 https://github.com/wepay/kafka-connect-bigquery/issues/175 https://github.com/wepay/kafka-connect-bigquery/issues/206 https://github.com/wepay/kafka-connect-bigquery/issues/178
And more.
Changes
I believe that we can address these issues with the following changes:
Add TableRouter
- Adding a pluggable
TableRouter
that takes aSinkRecord
and returns which table it should be written to. (Default:RegexTableRouter
)
public interface TableRouter {
void configure(Map<String, String> properties);
TableId getTable(SinkRecord sinkRecord);
}
Generalize SchemaRetriever
- Changing
SchemaRetriever
interface to have two methods:Schema getKeySchema(SinkRecord)
andSchema getValueSchema(SinkRecord)
. (Default:IdentitySchemaRetriever
, which just returnssinkRecord.keySchema()
andsinkRecord.valueSchema()
)
public interface SchemaRetriever {
void configure(Map<String, String> properties);
Schema retrieveKeySchema(SinkRecord sinkRecord);
Schema retrieveValueSchema(SinkRecord sinkRecord);
}
Change *QueryWriter schema update logic
- Changing the way that schema updates are handled in the
AdaptiveBigQueryWriter
.
This change will be the largest. If we remove SchemaRetriever.updateSchema
, we need a way for AdaptiveBigQueryWriter
to update BQ schemas when a batch insert fails. Given these rules:
https://cloud.google.com/bigquery/docs/managing-table-schemas
This document describes how to modify the schema definitions for existing BigQuery tables. BigQuery natively supports the following schema modifications:
- Adding columns to a schema definition
- Relaxing a column’s mode from REQUIRED to NULLABLE
- It is valid to create a table without defining an initial schema and to add a schema definition to the table at a later time.
All other schema modifications are unsupported and require manual workarounds, including:
- Changing a column’s name
- Changing a column’s data type
- Changing a column’s mode (aside from relaxing REQUIRED columns to NULLABLE)
- Deleting a column
The correct behavior is to do when a schema failure occurs is to have the adaptive writer union all fields from all insert batch with all fields in the existing BigQuery table. An example illustrates:
Existing BQ Schema
a INTEGER REQUIRED
b INTEGER NULLABLE
Batch Insert Schemas
Row 1
a INTEGER NULLABLE
Row 2
c INTEGER NULLABLE
Row 3
a INTEGER REQUIRED
Final Schema
a INTEGER NULLABLE
b INTEGER NULLABLE
c INTEGER NULLABLE
This will require that we have access to the insert batch’s the SinkRecord
for each row (not just RowToInsert
). It will also require that AdaptiveBigQueryWriter
has the SchemaRetriever
wired in as well.
I think the most straight-forward way to handle this is to have TableWriterBuilder.addRow(SinkRecord record)
instead of RowToInsert
. The Builder
can then keep a SortedMap<SinkRecord, RowToInsert>
, and pass that down the stack through to AdaptiveBigQueryWriter.performWriteRequest
. AdaptiveBigQueryWriter.attemptSchemaUpdate
can then be changed to implement the logic I field-union logic that I described above.
One area that we’ll have to be particularly careful about is dealing with repeated records and nested structures. They need to be properly unioned as well.
Benefits
This approach should give us a ton of flexibility including:
- It will allow you to pick which table to route each individual message based on all of the information in
SinkRecord
(topic, key schema, value schema, message payload, etc.) - It will allow us to fix some known schema-evolution bugs in cases where one field is added and another is dropped.
- It will allow us to use
SinkRecord's
.keySchema
and.valueSchema
rather than talking to the schema registry for schemas. - It will make it easy to support JSON messages, even those that return null for the
.keySchema()
and.valueSchema()
methods–you can implement a custom retriever for this case.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:3
- Comments:8 (4 by maintainers)
Top GitHub Comments
(We are using SMT approach now. PR is forthcoming 😃 )
@criccomini Nice !
On our side we tried to use Simple Message Transformation to deal with MultiSchema instead of Table Router implementation.
But SMT doesn’t work because of the way to retrieve schema on
SchemaRegistrySchemaRetriever#getSubject
. If we change it, it will open the door for all SMT and no need to implement TableRouterWe have started developing schema retriever logic but we tried to decouple it from the schema update logic. With the current implementation, the BQ table schema updates are entirely based on the latest version available in the schema registry, which doesn’t do well with messages with an older incompatible schema. With the schema retriever, it’s still straightforward to handle homogenous batches (as far as the schema is concerned), you can simply use any message from the batch to try to update the BQ table schema. To handle heterogenous batches, we could rely on a retry logic that will incrementally update the BQ table schema until it gets stable (you can use the first failed message in the batch to try to update the BQ table schema). That would be a simple alternative before the field-union logic is implemented.
What do you think of this approach (SMT and Schema Update Logic) ? Can we contribute too ?
Update: You can found our quick n dirty dev: sebco59/kafka-connect-bigquery -> branch allow-smt