Migrating parquet table to hudi issue [SUPPORT]
See original GitHub issueDescribe the problem you faced
I have questions regarding the Hudi table initial loading (migrating from parquet to Hudi table, bulk-insert), because we have encountered significantly high loading time, but first let me add the details for both tables we were trying to load, spark conf, Hudi conf and further modifications.
Sample of attempts: Table1: 6.7GB parquet, 180M records, 16 columns and key is composite of 2 columns. Spark Conf: 1 executor, 12 cores, 16GB, 32 shuffle, 32 bulk-insert-parallelism. Table2: 21GB parquet, 600M records, 16 columns and key is composite of 2 columns. Spark Conf: 4 executor, 8 cores, 32GB, 128 shuffle, 128 bulk-insert-parallelism. Table 1 loading time: 25 min. Table 2 loading time: 47 min. Both tables read and write from/to local file system.
To Reproduce
Code sample used:
import cluster.SparkConf
import common.DataConfig._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
import org.apache.spark.sql.SaveMode._
import org.apache.spark.sql.SparkSession
object HudiFilewriter {
val COW = "COW"
val MOR = "MOR"
def main(args: Array[String]): Unit = {
val tableName = args(0)
val basePath = args(1)
val tableType = if(COW.equalsIgnoreCase(args(2))) COW_TABLE_TYPE_OPT_VAL else MOR_TABLE_TYPE_OPT_VAL
val rawTablePath = args(3)
val partitionCol = args(4)
val spark = SparkSession.builder()
.getOrCreate()
val logLevel = spark.sparkContext.getConf.get(SparkConf.LOG_LEVEL)
spark.sparkContext.setLogLevel(logLevel)
val shuffle = spark.sparkContext.getConf.get(SparkConf.SHUFFLE_PARTITIONS)
var hudiOptions = Map[String, String](
//HoodieWriteConfig
TABLE_NAME -> tableName,
"hoodie.bulkinsert.shuffle.parallelism" -> shuffle,
//DataSourceWriteOptions
TABLE_TYPE_OPT_KEY -> tableType,
PRECOMBINE_FIELD_OPT_KEY -> UPDATE_COL,
KEYGENERATOR_CLASS_OPT_KEY -> "org.apache.hudi.keygen.ComplexKeyGenerator",
RECORDKEY_FIELD_OPT_KEY -> KEY_COLS.mkString(","),
PARTITIONPATH_FIELD_OPT_KEY -> partitionCol,
OPERATION_OPT_KEY -> BULK_INSERT_OPERATION_OPT_VAL
)
spark.time{
val df = spark.read.parquet(rawTablePath)
df.write.format("org.apache.hudi").
options(hudiOptions).
mode(Overwrite).
save(basePath)
}
}
Expected behavior
Similar performance to vanilla parquet writing with additional sort overhead.
Environment Description
-
Hudi version : 0.5.2
-
Spark version : 2.4.5
-
Hive version : NA
-
Hadoop version : NA
-
Storage (HDFS/S3/GCS…) : Local file System
-
Running on Docker? (yes/no) : no
Additional context
Attempts:
-
We tried multiple different spark configurations, increasing the shuffle and bulk-insert parallelism, increasing the number of executors while maintaining the base resources, increasing memory threshold of the driver/executors.
-
Hudi Tables types (MOR partitioned and non-partitioned) (COW partitioned and non-petitioned), for partitioned tables we provided a partitioned version of the base table a long with the partitioned column(s).
-
Hudi and spark version: “hudi-spark-bundle” % “0.5.1-incubating”, Spark-2.4.3, “spark-avro” % “2.4.3”.
-
Upgraded Hudi and spark version: “hudi-spark-bundle” % “0.5.2-incubating”, Spark-2.4.5, “spark-avro” % “2.4.5”.
-
Base data preparation, sorted by keys or partitioned.
-
Load the data partition by partition, filter base table based on the partition column and bulk-insert each dataframe result, so each partition individually will use the whole app resources during the writing operation, use new app for each partition.
All the above attempts didn’t improve the loading time that much or make it worse. So I would like to know if:
- Is that the normal time for initial loading for Hudi tables, or we are doing something wrong?
- Do we need a better cluster/recoures to be able to load the data for the first time?, because it is mentioned on Hudi confluence page that COW bulkinsert should match vanilla parquet writing + sort only.
- Does partitioning improves the upsert and/or compaction time for Hudi tables, or just to improve the analytical queries (partition pruning)?
- We have noticed that the most time spent in the data indexing (the bulk-insert logic itself) and not the sorting stages/operation before the indexing, so how can we improve that? should we provide our own indexing logic?
Issue Analytics
- State:
- Created 3 years ago
- Comments:13 (8 by maintainers)
@vinothchandar I apologies for the delayed response, and thanks again for your help and detailed answers.
Sure, this is the public Repo for generating the data https://github.com/gregrahn/tpch-kit And it provides the information you need for data generation, size, etc
You can use this command to generate lineitem with scale 10GB
DSS_PATH=/output/path ./dbgen -T L 10
Adding more details and update the schema screenshot mentioned on previous comment:
RECORDKEY_FIELD_OPT_KEY: is composite (l_linenumber, l_orderkey) PARTITIONPATH_FIELD_OPT_KEY: optional default (non-portioned), or l_shipmode PRECOMBINE_FIELD_OPT_KEY: l_commitdate, or generating new timestamp column last_updated.
This is the official documentation for the datasets definition, schema, queries and business logic behind the queries http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.18.0.pdf
Are we talking about the proposal mentioned at https://cwiki.apache.org/confluence/display/HUDI/RFC±+12+%3A+Efficient+Migration+of+Large+Parquet+Tables+to+Apache+Hudi
because we need more clarification regarding this approach.
There one of the attempts mentioned on the first comment which might be similar, I will explain in details to check with you if it should work for now or not, if it provides a valid Hudi table:
So consider we have table of size 1TB parquet format as an input table either partitioned or non- partitioned. Spark resource 256GB ram and 32 cores:
Case non-partitioned
Case partitioned same as above, with faster filter operations.
Pros:
Cons:
Questions:
We are excited as well to have a call together, please inform me how we can proceed on this meeting.
Few clarifications:
Is it possible to share the data generation tool with us or point us to reproducing this ourselves locally? We can go much faster if we are able to repro this ourselves…
What’s your record key and partition key? (was unable to spot this from the snippet above)… If you have monotonically increasing key (say timestamp prefix) and already know the partition to which an increment record (IIUC this means incoming writes) belong to, then upsert performance will be optimal.
If you have a workload that will touch every file, then you could use the #1402 that is being built. Bloom filter checking will anyway lead to opening up all files in that scenario anyway…
By default, upsert will also de-dupe the increments once… So if this the norm, you can turn off
hoodie.combine.before.upsert=false
to avoid an extra shuffle.so this is without overriding the user defined partitioner? btw the two jobs you see is how Spark sort works, first job does reservoir sampling to get ranges, and the second one actually sorts…
bulk_insert was designed to do an initial sort and write data without incurring large memory overheads associated with caching… Spark cache is a LRU… so it will thrash a fair bit if you start spilling due to lack of memory. I would not recommend trying this…
yes… you are right… sorting gives you a dataset which is initially sorted/ordered by keys and if you have ordered keys, hudi will preserve this and extract upsert performance by filtering out files not in range during indexing… At Uber, when we moved all the tables to hudi, we found this one time sort, well worth the initial cost… It repaid itself many times over the course of a quarter.
Its the action that triggers the actual parquet writing. So the 30 second odd you see if the actual cost of writing data…
No… bulk_insert + sorting is what I recommend (with good key design) for large scale deployment like you are talking about… if you don’t want to convert all data, @bvaradar & @umehrot2 will have the ability to seamlessly bootstrap the data into hudi without rewriting in the next release… (you ll also have the ability to do a one-time bulk_insert for last N partitions to get the upsert performance benefits as we discussed above)…
I would be happy to jump on a call with you folks and get this moving along… I am also very excited to work with a user like yourself and move the perf aspects of the project along more…