question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Help with Reading Kafka topic written using Debezium Connector - Deltastreamer

See original GitHub issue

Hi Team,

Im facing this use case where I need to ingest data from kafka topic usinf Deltastreamer which is loaded using Debezium connector. So the topic contains schema which contains fields like before, after, ts_ms, op, source etc. Im providing record key as after.id and precombine key with after.timestamp but still the entire debezium output is being ingested.

Please find my properties

hoodie.upsert.shuffle.parallelism=2
 hoodie.insert.shuffle.parallelism=2
 hoodie.delete.shuffle.parallelism=2
 hoodie.bulkinsert.shuffle.parallelism=2
 hoodie.embed.timeline.server=true
 hoodie.filesystem.view.type=EMBEDDED_KV_STORE
 hoodie.compact.inline=false
# Key fields, for kafka example
hoodie.datasource.write.recordkey.field=after.inc_id
hoodie.datasource.write.partitionpath.field=date
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator
# Schema provider props (change to absolute path based on your installation)
#hoodie.deltastreamer.schemaprovider.source.schema.file=/var/demo/config/schema.avsc
#hoodie.deltastreamer.schemaprovider.target.schema.file=/var/demo/config/schema.avsc
# Kafka Source
hoodie.deltastreamer.source.kafka.topic=airflow.public.motor_crash_violation_incidents
#Kafka props
bootstrap.servers=http://xxxxx:29092
auto.offset.reset=earliest
hoodie.deltastreamer.schemaprovider.registry.url=http://xxxxx:8081/subjects/airflow.public.motor_crash_violation_incidents-value/versions/latest
#hoodie.deltastreamer.schemaprovider.registry.targetUrl=http://xxxxx:8081/subjects/airflow.public.motor_crash_violation_incidents-value/versions/latest
schema.registry.url=http://xxxxx:8081
validate.non.null = false

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:50 (21 by maintainers)

github_iconTop GitHub Comments

2reactions
bvaradarcommented, Oct 6, 2020

@ashishmgofficial : You need to plugin a transformer class to only select the columns you need and record-payload to handle deletions. We are currently in the process of adding the transformer to OSS Hudi but broadly here is how it will look like (thanks to @joshk-kang).

gist :

package org.apache.hudi.utilities.transform;

import org.apache.hudi.common.config.TypedProperties;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class DebeziumTransformer implements Transformer {

  @Override
  public Dataset apply(JavaSparkContext jsc, SparkSession sparkSession, Dataset<Row> rowDataset,
      TypedProperties properties) {

    Dataset<Row> insertedOrUpdatedData = rowDataset
        .select("op", "ts_ms", "after.*")
        .withColumnRenamed("op", "_op")
        .withColumnRenamed("ts_ms", "_ts_ms")
        .filter(rowDataset.col("op").notEqual("d"));

    Dataset<Row> deletedData = rowDataset
        .select("op", "ts_ms", "before.*")
        .withColumnRenamed("op", "_op")
        .withColumnRenamed("ts_ms", "_ts_ms")
        .filter(rowDataset.col("op").equalTo("d"));

    Dataset<Row> transformedData = insertedOrUpdatedData.union(deletedData);

    return transformedData;
  }
}
public class DebeziumAvroPayload extends OverwriteWithLatestAvroPayload {

  // Field is prefixed with a underscore by transformer to indicate metadata field
  public static final String OP_FIELD = "_op";
  public static final String DELETE_OP = "d";

  public DebeziumAvroPayload(GenericRecord record, Comparable orderingVal) {
    super(record, orderingVal);
  }

  public DebeziumAvroPayload(Option<GenericRecord> record) {
    this(record.get(), (record1) -> 0); // natural order
  }

  @Override
  protected boolean isDeleteRecord(GenericRecord genericRecord) {
    return genericRecord.get(OP_FIELD) != null && genericRecord.get(OP_FIELD).toString().equalsIgnoreCase(
        DELETE_OP);
  }
}
1reaction
toniniscommented, Feb 25, 2021

@vinothchandar Sorry I took so long to respond . It had worked and compiled successfully . I probably had missed something at the time .

Thanks for your response at the time .

Read more comments on GitHub >

github_iconTop Results From Across the Web

Change Data Capture with Debezium and Apache Hudi
The Debezium connector continuously polls the changelogs from the database and writes an AVRO message with the changes for each database row to ......
Read more >
Topic Routing :: Debezium Documentation
The topic routing transformation is a Kafka Connect SMT. Use case. The default behavior is that a Debezium connector sends each change event...
Read more >
[GitHub] [hudi] bvaradar commented on issue #2149: Help with ...
[GitHub] [hudi] bvaradar commented on issue #2149: Help with Reading Kafka topic written using Debezium Connector - Deltastreamer.
Read more >
Building Open Data Lakes on AWS with Debezium and ...
Build an open-source data lake on AWS using a combination of Debezium, Apache Kafka, Apache Hudi, Apache Spark, and Apache Hive ...
Read more >
Debezium, Apache Kafka, Hudi, Spark, and Hive on AWS
In this video demonstration, we will build a simple open data lake on AWS using a combination of open-source software, including Debezium ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found