question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Flink Iceberg Usage

See original GitHub issue

We use Avro schema as unified ETL schema management solution. when I’m trying to write data into Iceberg using Flink, I found there are so many terms in Flink to represent data types, such as TypeInformation, LogicalType, RowType, TableSchema, DataType … I can’t figure out the relationship between them and how to convert each other.

Specifically, my question is How can I write DataStream<GenericRecord> to an Iceberg table using Flink Iceberg api? And I think avro Schema should have enough information to desc the Record schema.

Should I use below APIs? If yes, how can I adapt them from DataStream<GenericRecord>?

public static <T> Builder builderFor(DataStream<T> input,
                                       MapFunction<T, RowData> mapper,
                                       TypeInformation<RowData> outputType)
public static Builder forRow(DataStream<Row> input, TableSchema tableSchema)

PS: It’s GenericRecord in Avro, not Iceberg.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
pan3793commented, Dec 7, 2020

I think I found the solution:

@Override
protected void output(DataStream<GenericRecord> outputStream, org.apache.avro.Schema avroSchema) {
    DataStream<Row> rowDataStream = outputStream.map(genericRecord -> {
        int columnNum = genericRecord.getSchema().getFields().size();
        Object[] rowData = new Object[columnNum];
        for (int i = 0; i < columnNum; i++) {
            rowData[i] = genericRecord.get(i);
        }
        return Row.of(rowData);
    });
    org.apache.iceberg.shaded.org.apache.avro.Schema shadeAvroSchema =
            new org.apache.iceberg.shaded.org.apache.avro.Schema.Parser().parse(avroSchema.toString());
    Schema icebergSchema = AvroSchemaUtil.toIceberg(shadeAvroSchema);
    RowType rowType = FlinkSchemaUtil.convert(icebergSchema);
    TableSchema tableSchema = FlinkSchemaUtil.toSchema(rowType);
    FlinkSink.forRow(rowDataStream, tableSchema)
       .table(table)
       .tableLoader(tableLoader)
       .tableSchema(tableSchema)
       .writeParallelism(parallelism)
       .build();
}
0reactions
pan3793commented, Dec 9, 2020

@openinx Thanks for following up on this issue, I haven’t test nested field, but found another issue about Logical Types. Our ETL build on CDH-6.3.1 with Avro 1.8.2, which not support generate Java 8 time because of AVRO-2079, and without converter, Flink can’t handle joda time properly.

I think we may need a converter which just like the flink’s DataFormatConverters.RowConverter to convert the avro GenericRecord to RowData ?

We really need that converter. All of our ETL job input and output data structures are present by Avro GenericRecord because it has avrc to define schema in Json, and provide avro-maven-plugin to generate Entity automatically.

PS: We only use avro to manage schemas, but store in ORC. And we are trying to migrate storage format to Iceberg. If iceberg provide schema management tools like avro, may be we can also manage schema by Iceberg.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Enabling Iceberg in Flink
Iceberg provides API to rewrite small files into large files by submitting flink batch job. The behavior of this flink action is the...
Read more >
Flink + Iceberg: How to Construct a Whole-scenario Real-time ...
Flink real-time tasks often run in clusters on a long-term basis. Usually, the Iceberg commit is set to perform a commit operation every...
Read more >
Flink via Iceberg - Project Nessie
Detailed steps on how to set up Pyspark + Iceberg + Flink + Nessie with Python is available on Binder. In order to...
Read more >
how to consume historical iceberg data with flink? · Issue #3905
The incremental consumption of flick can meet your requirements. The consumption of overwrite snapshot is not supported now. Detailed guidance ...
Read more >
Real-Time Data Lake Based on Apache Flink ... - Alibaba Cloud
The following briefly explains the design principle of the Flink Iceberg Sink. Iceberg uses the optimistic lock method to commit a transaction.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found