Ingestion of Parquet with `int96` and `decimal` data types fails in 0.8.0-SNAPSHOT
See original GitHub issueAccording to #6525, which is part of the release-0.7.1 based on the comparison: https://github.com/apache/incubator-pinot/compare/release-0.6.0...release-0.7.1,
it seems that ingestion of Parquet files that include int96
and decimal
data types should have been addressed in newer versions. Nevertheless I am still getting an exception:
java.lang.IllegalArgumentException: INT96 not implemented and is deprecated
How I deployed
I used the helm charts in a fresh checkout at hash 7233e2c66, around 2021-07-11. The chart is pulling the latest docker image which should be 0.8.0-SNAPSHOT
. The version endpoint /version
does not return any info, but the controller’s folder /opt/pinot/lib/
contains pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar
so it must be 0.8.0-SNAPSHOT
.
Reproduced with test data
I found the following test resource which I used it to create a standalone batch ingestion:
pinot-plugins/pinot-input-format/pinot-parquet/src/test/resources/test-file-with-int96-and-decimal.snappy.parquet
the schema of that test data is as follows:
message spark_schema {
optional int96 ts;
optional binary source (STRING);
optional binary table_name (STRING);
optional float coins;
optional int32 seniority;
optional int32 llrecency;
optional int32 trstierid;
optional int32 gamelevelid;
optional int32 ltdrecency;
optional fixed_len_byte_array(11) playerimportance (DECIMAL(24,4));
optional binary platformtypename (STRING);
optional fixed_len_byte_array(12) extreme_wager_14d (DECIMAL(28,2));
optional int32 is_vip;
optional int32 newdeposithabitgroupid;
optional int32 deposithabitgroupid;
optional int32 activedayssinceinstall;
optional int32 is_depo;
}
The configuration of the injection is described by the following files:
- table schema,
- offline table configuration,
- ingestion job specification,
- ingestion job properties
So the files are:
test_with_int96_and_decimal_schema.json
:
{
"schemaName": "test_with_int96_and_decimal",
"metricFieldSpecs": [
{"name": "coins", "dataType": "FLOAT"},
{"name": "seniority", "dataType": "INT"},
{"name": "llrecency", "dataType": "INT"},
{"name": "ltdrecency", "dataType": "INT"},
{"name": "playerimportance", "dataType": "DOUBLE"},
{"name": "extreme_wager_14d", "dataType": "DOUBLE"},
{"name": "activedayssinceinstall", "dataType": "INT"}
],
"dimensionFieldSpecs": [
{"name": "trstierid", "dataType": "INT"},
{"name": "gamelevelid", "dataType": "INT"},
{"name": "newdeposithabitgroupid", "dataType": "INT"},
{"name": "deposithabitgroupid", "dataType": "INT"},
{"name": "source", "dataType": "STRING"},
{"name": "table_name", "dataType": "STRING"},
{"name": "platformtypename", "dataType": "STRING"},
{"name": "is_vip", "dataType": "INT"},
{"name": "is_depo", "dataType": "INT"}
],
"dateTimeFieldSpecs": [
{
"name": "ts",
"dataType": "LONG",
"format": "1:MINUTES:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss",
"granularity": "1:MINUTES"
}
]
}
test_with_int96_and_decimal_offline_table_config.json
:
{
"tableName": "test_with_int96_and_decimal",
"tableType": "OFFLINE",
"segmentsConfig": {
"segmentPushType": "APPEND",
"segmentAssignmentStrategy": "BalanceNumSegmentAssignmentStrategy",
"schemaName": "test_with_int96_and_decimal",
"replication": "1"
},
"tenants": {
},
"tableIndexConfig": {
"loadMode": "HEAP",
"invertedIndexColumns": [
"trstierid",
"gamelevelid"
]
},
"metadata": {
"customConfigs": {
}
}
}
ingestionJobSpec.yaml
:
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#
# executionFrameworkSpec: Defines ingestion jobs to be running.
executionFrameworkSpec:
# name: execution framework name
name: 'standalone'
# segmentGenerationJobRunnerClassName: class name implements org.apache.pinot.spi.batch.ingestion.runner.SegmentGenerationJobRunner interface.
segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
# segmentTarPushJobRunnerClassName: class name implements org.apache.pinot.spi.batch.ingestion.runner.SegmentTarPushJobRunner interface.
segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
# segmentUriPushJobRunnerClassName: class name implements org.apache.pinot.spi.batch.ingestion.runner.SegmentUriPushJobRunner interface.
segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
# jobType: Pinot ingestion job type.
# Supported job types are:
# 'SegmentCreation'
# 'SegmentTarPush'
# 'SegmentUriPush'
# 'SegmentCreationAndTarPush'
# 'SegmentCreationAndUriPush'
jobType: SegmentCreationAndTarPush
# inputDirURI: Root directory of input data, expected to have scheme configured in PinotFS.
inputDirURI: 'examples/batch/test_with_int96_and_decimal/rawdata'
# includeFileNamePattern: include file name pattern, supported glob pattern.
# Sample usage:
# 'glob:*.avro' will include all avro files just under the inputDirURI, not sub directories;
# 'glob:**/*.avro' will include all the avro files under inputDirURI recursively.
includeFileNamePattern: 'glob:**/*.parquet'
# excludeFileNamePattern: exclude file name pattern, supported glob pattern.
# Sample usage:
# 'glob:*.avro' will exclude all avro files just under the inputDirURI, not sub directories;
# 'glob:**/*.avro' will exclude all the avro files under inputDirURI recursively.
# _excludeFileNamePattern: ''
# outputDirURI: Root directory of output segments, expected to have scheme configured in PinotFS.
outputDirURI: 'examples/batch/test_with_int96_and_decimal/segments'
# overwriteOutput: Overwrite output segments if existed.
overwriteOutput: true
# pinotFSSpecs: defines all related Pinot file systems.
pinotFSSpecs:
- # scheme: used to identify a PinotFS.
# E.g. local, hdfs, dbfs, etc
scheme: file
# className: Class name used to create the PinotFS instance.
# E.g.
# org.apache.pinot.spi.filesystem.LocalPinotFS is used for local filesystem
# org.apache.pinot.plugin.filesystem.AzurePinotFS is used for Azure Data Lake
# org.apache.pinot.plugin.filesystem.HadoopPinotFS is used for HDFS
className: org.apache.pinot.spi.filesystem.LocalPinotFS
# recordReaderSpec: defines all record reader
recordReaderSpec:
# dataFormat: Record data format, e.g. 'avro', 'parquet', 'orc', 'csv', 'json', 'thrift' etc.
dataFormat: 'parquet'
# className: Corresponding RecordReader class name.
# E.g.
# org.apache.pinot.plugin.inputformat.avro.AvroRecordReader
# org.apache.pinot.plugin.inputformat.csv.CSVRecordReader
# org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader
# org.apache.pinot.plugin.inputformat.json.JSONRecordReader
# org.apache.pinot.plugin.inputformat.orc.ORCRecordReader
# org.apache.pinot.plugin.inputformat.thrift.ThriftRecordReader
className: 'org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader'
# configClassName: Corresponding RecordReaderConfig class name, it's mandatory for CSV and Thrift file format.
# E.g.
# org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig
# org.apache.pinot.plugin.inputformat.thrift.ThriftRecordReaderConfig
# configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
# configs: Used to init RecordReaderConfig class name, this config is required for CSV and Thrift data format.
configs:
# tableSpec: defines table name and where to fetch corresponding table config and table schema.
tableSpec:
# tableName: Table name
tableName: 'test_with_int96_and_decimal'
# schemaURI: defines where to read the table schema, supports PinotFS or HTTP.
# E.g.
# hdfs://path/to/table_schema.json
# http://localhost:9000/tables/myTable/schema
schemaURI: 'http://localhost:9000/tables/test_with_int96_and_decimal/schema'
# tableConfigURI: defines where to reade the table config.
# Supports using PinotFS or HTTP.
# E.g.
# hdfs://path/to/table_config.json
# http://localhost:9000/tables/myTable
# Note that the API to read Pinot table config directly from pinot controller contains a JSON wrapper.
# The real table config is the object under the field 'OFFLINE'.
tableConfigURI: 'http://localhost:9000/tables/test_with_int96_and_decimal'
# pinotClusterSpecs: defines the Pinot Cluster Access Point.
pinotClusterSpecs:
- # controllerURI: used to fetch table/schema information and data push.
# E.g. http://localhost:9000
controllerURI: 'http://localhost:9000'
# pushJobSpec: defines segment push job related configuration.
pushJobSpec:
# pushAttempts: number of attempts for push job, default is 1, which means no retry.
pushAttempts: 2
# pushRetryIntervalMillis: retry wait Ms, default to 1 second.
pushRetryIntervalMillis: 1000
ingestionJob.properties
:
job-spec-format=yaml
and I created the table with:
/opt/pinot/bin/pinot-admin.sh \
AddTable \
-schemaFile /opt/pinot/examples/batch/test_with_int96_and_decimal/test_with_int96_and_decimal_schema.json \
-tableConfigFile /opt/pinot/examples/batch/test_with_int96_and_decimal/test_with_int96_and_decimal_offline_table_config.json \
-controllerHost pinot-controller \
-controllerPort 9000 \
-exec
and the started the ingestion with:
/opt/pinot/bin/pinot-admin.sh \
LaunchDataIngestionJob \
-jobSpecFile /opt/pinot/examples/batch/test_with_int96_and_decimal/ingestionJobSpec.yaml \
-propertyFile /opt/pinot/examples/batch/test_with_int96_and_decimal/ingestionJob.properties
Am I doing something wrong or is the batch ingestion of parquet with int96
and decimal
data types not working?
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (4 by maintainers)
Top GitHub Comments
Updated the docs here: https://docs.pinot.apache.org/basics/data-import/pinot-input-formats#parquet
Since parquet uses int96 for nanos, pinot converts it to int64 millis. Please refer to this: https://github.com/apache/incubator-pinot/blob/master/pinot-plugins/pinot-input-format/pinot-parquet/src/main/java/org/apache/pinot/plugin/inputformat/parquet/ParquetNativeRecordExtractor.java#L169