Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Flink Iceberg Usage

See original GitHub issue

We use Avro schema as unified ETL schema management solution. when I’m trying to write data into Iceberg using Flink, I found there are so many terms in Flink to represent data types, such as TypeInformation, LogicalType, RowType, TableSchema, DataType … I can’t figure out the relationship between them and how to convert each other.

Specifically, my question is How can I write DataStream<GenericRecord> to an Iceberg table using Flink Iceberg api? And I think avro Schema should have enough information to desc the Record schema.

Should I use below APIs? If yes, how can I adapt them from DataStream<GenericRecord>?

public static <T> Builder builderFor(DataStream<T> input,
                                       MapFunction<T, RowData> mapper,
                                       TypeInformation<RowData> outputType)

public static Builder forRow(DataStream<Row> input, TableSchema tableSchema)

PS: It’s GenericRecord in Avro, not Iceberg.

Issue Analytics

State:
Created 3 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

1reaction

pan3793commented, Dec 7, 2020

I think I found the solution:

@Override
protected void output(DataStream<GenericRecord> outputStream, org.apache.avro.Schema avroSchema) {
    DataStream<Row> rowDataStream = outputStream.map(genericRecord -> {
        int columnNum = genericRecord.getSchema().getFields().size();
        Object[] rowData = new Object[columnNum];
        for (int i = 0; i < columnNum; i++) {
            rowData[i] = genericRecord.get(i);
        }
        return Row.of(rowData);
    });
    org.apache.iceberg.shaded.org.apache.avro.Schema shadeAvroSchema =
            new org.apache.iceberg.shaded.org.apache.avro.Schema.Parser().parse(avroSchema.toString());
    Schema icebergSchema = AvroSchemaUtil.toIceberg(shadeAvroSchema);
    RowType rowType = FlinkSchemaUtil.convert(icebergSchema);
    TableSchema tableSchema = FlinkSchemaUtil.toSchema(rowType);
    FlinkSink.forRow(rowDataStream, tableSchema)
       .table(table)
       .tableLoader(tableLoader)
       .tableSchema(tableSchema)
       .writeParallelism(parallelism)
       .build();
}

0reactions

pan3793commented, Dec 9, 2020

@openinx Thanks for following up on this issue, I haven’t test nested field, but found another issue about Logical Types. Our ETL build on CDH-6.3.1 with Avro 1.8.2, which not support generate Java 8 time because of AVRO-2079, and without converter, Flink can’t handle joda time properly.

I think we may need a converter which just like the flink’s DataFormatConverters.RowConverter to convert the avro GenericRecord to RowData ?

We really need that converter. All of our ETL job input and output data structures are present by Avro GenericRecord because it has avrc to define schema in Json, and provide avro-maven-plugin to generate Entity automatically.

PS: We only use avro to manage schemas, but store in ORC. And we are trying to migrate storage format to Iceberg. If iceberg provide schema management tools like avro, may be we can also manage schema by Iceberg.