question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Decouple topics from tables and generalize SchemaRetriever

See original GitHub issue

Intro

We have had a number of PRs and issues recently that are attempting to do two things:

  1. Support more serialization formats
  2. Support multiple schemas in the same topic

These include:

https://github.com/wepay/kafka-connect-bigquery/pull/238 https://github.com/wepay/kafka-connect-bigquery/issues/216 https://github.com/wepay/kafka-connect-bigquery/issues/175 https://github.com/wepay/kafka-connect-bigquery/issues/206 https://github.com/wepay/kafka-connect-bigquery/issues/178

And more.

Changes

I believe that we can address these issues with the following changes:

Add TableRouter

  1. Adding a pluggable TableRouter that takes a SinkRecord and returns which table it should be written to. (Default: RegexTableRouter)
public interface TableRouter {
  void configure(Map<String, String> properties);
  TableId getTable(SinkRecord sinkRecord);
}

Generalize SchemaRetriever

  1. Changing SchemaRetriever interface to have two methods: Schema getKeySchema(SinkRecord) and Schema getValueSchema(SinkRecord). (Default: IdentitySchemaRetriever, which just returns sinkRecord.keySchema() and sinkRecord.valueSchema())
public interface SchemaRetriever {
  void configure(Map<String, String> properties);
  Schema retrieveKeySchema(SinkRecord sinkRecord);
  Schema retrieveValueSchema(SinkRecord sinkRecord);
}

Change *QueryWriter schema update logic

  1. Changing the way that schema updates are handled in the AdaptiveBigQueryWriter.

This change will be the largest. If we remove SchemaRetriever.updateSchema, we need a way for AdaptiveBigQueryWriter to update BQ schemas when a batch insert fails. Given these rules:

https://cloud.google.com/bigquery/docs/managing-table-schemas

This document describes how to modify the schema definitions for existing BigQuery tables. BigQuery natively supports the following schema modifications:

  • Adding columns to a schema definition
  • Relaxing a column’s mode from REQUIRED to NULLABLE
  • It is valid to create a table without defining an initial schema and to add a schema definition to the table at a later time.

All other schema modifications are unsupported and require manual workarounds, including:

  • Changing a column’s name
  • Changing a column’s data type
  • Changing a column’s mode (aside from relaxing REQUIRED columns to NULLABLE)
  • Deleting a column

The correct behavior is to do when a schema failure occurs is to have the adaptive writer union all fields from all insert batch with all fields in the existing BigQuery table. An example illustrates:

Existing BQ Schema

a					INTEGER				REQUIRED
b 					INTEGER				NULLABLE

Batch Insert Schemas

Row 1
a					INTEGER 			NULLABLE
Row 2
c					INTEGER				NULLABLE
Row 3
a					INTEGER				REQUIRED

Final Schema
a					INTEGER				NULLABLE
b 					INTEGER				NULLABLE
c					INTEGER				NULLABLE

This will require that we have access to the insert batch’s the SinkRecord for each row (not just RowToInsert). It will also require that AdaptiveBigQueryWriter has the SchemaRetriever wired in as well.

I think the most straight-forward way to handle this is to have TableWriterBuilder.addRow(SinkRecord record) instead of RowToInsert. The Builder can then keep a SortedMap<SinkRecord, RowToInsert>, and pass that down the stack through to AdaptiveBigQueryWriter.performWriteRequest. AdaptiveBigQueryWriter.attemptSchemaUpdate can then be changed to implement the logic I field-union logic that I described above.

One area that we’ll have to be particularly careful about is dealing with repeated records and nested structures. They need to be properly unioned as well.

Benefits

This approach should give us a ton of flexibility including:

  • It will allow you to pick which table to route each individual message based on all of the information in SinkRecord (topic, key schema, value schema, message payload, etc.)
  • It will allow us to fix some known schema-evolution bugs in cases where one field is added and another is dropped.
  • It will allow us to use SinkRecord's .keySchema and .valueSchema rather than talking to the schema registry for schemas.
  • It will make it easy to support JSON messages, even those that return null for the .keySchema() and .valueSchema() methods–you can implement a custom retriever for this case.

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:3
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

4reactions
criccominicommented, Jul 16, 2020

(We are using SMT approach now. PR is forthcoming 😃 )

2reactions
sebco59commented, Jun 3, 2020

@criccomini Nice !

On our side we tried to use Simple Message Transformation to deal with MultiSchema instead of Table Router implementation.

But SMT doesn’t work because of the way to retrieve schema on SchemaRegistrySchemaRetriever#getSubject. If we change it, it will open the door for all SMT and no need to implement TableRouter

  • With SMT, we could extract a lot of complexity from connector to outside
    • we want to deal with multi subject -> SMT to change topic name in the fly
    • we want to add kafka system information (offset, partition, …p) -> SMT to add field with this value on the fly
    • we want to route topic in a specific bq table depending on a field -> SMT to change topic name with a regex
    • we want to add header of the record to bq record -> SMT to add field… and so more
  • With SMT, we have default error handling from kafka connect.
  • With SMT, we could reuse it in another connector

We have started developing schema retriever logic but we tried to decouple it from the schema update logic. With the current implementation, the BQ table schema updates are entirely based on the latest version available in the schema registry, which doesn’t do well with messages with an older incompatible schema. With the schema retriever, it’s still straightforward to handle homogenous batches (as far as the schema is concerned), you can simply use any message from the batch to try to update the BQ table schema. To handle heterogenous batches, we could rely on a retry logic that will incrementally update the BQ table schema until it gets stable (you can use the first failed message in the batch to try to update the BQ table schema). That would be a simple alternative before the field-union logic is implemented.

What do you think of this approach (SMT and Schema Update Logic) ? Can we contribute too ?

Update: You can found our quick n dirty dev: sebco59/kafka-connect-bigquery -> branch allow-smt

Read more comments on GitHub >

github_iconTop Results From Across the Web

Google BigQuery Sink Connector Configuration Properties
Designates whether to automatically sanitize topic names before using them as table names. If not enabled, topic names are used as table names....
Read more >
How much does decoupling tables in MySQL improve ...
Splitting it into two tables, to read the whole record, you now have to do searches on two clustered indexes, ... Yes to...
Read more >
kafka-connect-bigquery - bytemeta
kafka-connect-bigquery repo issues. ... Decouple topics from tables and generalize SchemaRetriever. linaizhong. linaizhong OPEN · Updated 2 years ago ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found