question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

AWS S3 Sink connector does not store Kafka message key

See original GitHub issue

What version of the Stream Reactor are you reporting this issue for?

Latest master build

What is the expected behaviour?

The S3 sink connector to store the Kafka message along with its key inside S3.

What was observed?

The Kafka message key is ignored and only the value is stored.

What is your Connect cluster configuration (connect-avro-distributed.properties)?

bootstrap.servers=kafka:9092
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=false
value.converter.schemas.enable=false
offset.storage.file.filename=/tmp/connect.offsets
offset.flush.interval.ms=10000
plugin.path=/home/src/kafka/jars
group.id=kafka-connect
storage.topic=_connect-configs
offset.storage.topic=_connect-offsets
status.storage.topic=_connect-status

What is your connector properties configuration (my-connector.properties)?

name=S3SinkConnectorS3 # this can be anything
connector.class=io.lenses.streamreactor.connect.aws.s3.sink.S3SinkConnector
tasks.max=1
aws.auth.mode=Default
topics=test_yordan_kafka_connect
connect.s3.kcql="insert into `test:test-bucket` select * from test_yordan_kafka_connect STOREAS `JSON` WITH_FLUSH_COUNT = 5000"
connect.s3.aws.client=AWS
connect.s3.aws.region=eu-central-1
timezone=UTC
errors.log.enable=true

I did some debugging of the codebase and suspect the issue is the key is being ignored here: https://github.com/lensesio/stream-reactor/blob/master/kafka-connect-aws-s3/src/main/scala/io/lenses/streamreactor/connect/aws/s3/formats/JsonFormatWriter.scala#L41

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

2reactions
iosifnicolae2commented, Sep 1, 2022

We’ve solved this problem by following these steps:

  1. Implement JsonFormatStreamReader - https://github.com/iosifnicolae2/stream-reactor
  2. Implement Archive & UnArchive transformers - https://github.com/iosifnicolae2/kafka-connect-transform-archive
    • you can use this Dockerfile to build the plugin and include it in your Kafka Connector
  3. Configure your Sink connector using:
"connect.s3.kcql" : "insert into <bucket_name>:<path> select * from <topic> STOREAS `JSON` WITH_FLUSH_INTERVAL=60 WITH_FLUSH_COUNT=1000"
"transforms" : "archiveRowForS3",
"transforms.archiveRowForS3.type" : "com.github.jcustenborder.kafka.connect.archive.Archive",
  1. Configure your Source connector using:
"connect.s3.kcql" : "insert into <topic> select * from <bucket_name>:<path>  BATCH = 5 STOREAS `JSON` LIMIT 1000"
"transforms" : "unArchiveRowFromS3",
"transforms.unArchiveRowFromS3.type" : "com.github.jcustenborder.kafka.connect.archive.UnArchive",
0reactions
iosifnicolae2commented, Aug 31, 2022

Also, I’ve tried the idea 1.ii by adding

"transforms" : "archiveRowForS3",
"transforms.archiveRowForS3.type" : "com.github.jcustenborder.kafka.connect.archive.Archive",

but I’m getting:

java.lang.UnsupportedOperationException: empty.maxBy
--
Wed, Aug 31 2022 9:32:35 pm | at scala.collection.TraversableOnce.maxBy(TraversableOnce.scala:282)
Wed, Aug 31 2022 9:32:35 pm | at scala.collection.TraversableOnce.maxBy$(TraversableOnce.scala:280)
Wed, Aug 31 2022 9:32:35 pm | at scala.collection.AbstractTraversable.maxBy(Traversable.scala:108)
Wed, Aug 31 2022 9:32:35 pm | at io.lenses.streamreactor.connect.aws.s3.sink.S3WriterManager.io$lenses$streamreactor$connect$aws$s3$sink$S3WriterManager$writerForTopicPartitionWithMaxOffset(S3WriterManager.scala:68)
Wed, Aug 31 2022 9:32:35 pm | at io.lenses.streamreactor.connect.aws.s3.sink.S3WriterManager$anonfun$preCommit$1.applyOrElse(S3WriterManager.scala:244)
Wed, Aug 31 2022 9:32:35 pm | at io.lenses.streamreactor.connect.aws.s3.sink.S3WriterManager$anonfun$preCommit$1.applyOrElse(S3WriterManager.scala:242)
Wed, Aug 31 2022 9:32:35 pm | at scala.PartialFunction.$anonfun$runWith$1$adapted(PartialFunction.scala:145)
Wed, Aug 31 2022 9:32:35 pm | at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:400)
Wed, Aug 31 2022 9:32:35 pm | at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:728)
Wed, Aug 31 2022 9:32:35 pm | at scala.collection.TraversableLike.collect(TraversableLike.scala:407)
Wed, Aug 31 2022 9:32:35 pm | at scala.collection.TraversableLike.collect$(TraversableLike.scala:405)
Wed, Aug 31 2022 9:32:35 pm | at scala.collection.AbstractTraversable.collect(Traversable.scala:108)
Wed, Aug 31 2022 9:32:35 pm | at io.lenses.streamreactor.connect.aws.s3.sink.S3WriterManager.preCommit(S3WriterManager.scala:242)
Wed, Aug 31 2022 9:32:35 pm | at io.lenses.streamreactor.connect.aws.s3.sink.S3SinkTask.preCommit(S3SinkTask.scala:181)
Wed, Aug 31 2022 9:32:35 pm | at org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:387)
Wed, Aug 31 2022 9:32:35 pm | at org.apache.kafka.connect.runtime.WorkerSinkTask.closePartitions(WorkerSinkTask.java:639)
Wed, Aug 31 2022 9:32:35 pm | at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:202)
Wed, Aug 31 2022 9:32:35 pm | at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:186)
Wed, Aug 31 2022 9:32:35 pm | at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:241)
Wed, Aug 31 2022 9:32:35 pm | at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
Wed, Aug 31 2022 9:32:35 pm | at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
Wed, Aug 31 2022 9:32:35 pm | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
Wed, Aug 31 2022 9:32:35 pm | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
Wed, Aug 31 2022 9:32:35 pm | at java.base/java.lang.Thread.run(Thread.java:829)
Wed, Aug 31 2022 9:32:35 pm | 2022-08-31 18:32:35,891 ERROR WorkerSinkTask{id=main-kafka-raw-packets-to-s3-0} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask) [task-thread-main-kafka-raw-packets-to-s3-0]

EDIT: By building and using the master branch the above error gets solved - https://github.com/lensesio/stream-reactor/issues/865#issuecomment-1165647864.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Amazon S3 Sink Connector for Confluent Platform
The Amazon S3 Sink connector exports data from Apache Kafka® topics to S3 objects in either Avro, JSON, or Bytes formats. Depending on...
Read more >
Kafka connect S3 source connector ignores keys
I ignore how to make sure that .keys.json is taken into account to construct the Kafka keys when reading the data from S3....
Read more >
Connect Kafka to S3: 6 Easy Steps - Hevo Data
This blog teaches you how to set up a Kafka to S3 integration. It provides a step-by-step guide to help you connect them...
Read more >
Kafka to AWS S3 | S3 open source Kafka connector
A Kafka Connect sink connector for writing records from Kafka to AWS S3 ... Where the Kafka message key is not a primitive...
Read more >
Amazon S3 sink connector
This example shows how to set up the Confluent Amazon S3 sink connector for ... DefaultPartitioner", "storage.class": "io.confluent.connect.s3.storage.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found