Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Migrating parquet table to hudi issue [SUPPORT]

See original GitHub issue

Describe the problem you faced

I have questions regarding the Hudi table initial loading (migrating from parquet to Hudi table, bulk-insert), because we have encountered significantly high loading time, but first let me add the details for both tables we were trying to load, spark conf, Hudi conf and further modifications.

Sample of attempts: Table1: 6.7GB parquet, 180M records, 16 columns and key is composite of 2 columns. Spark Conf: 1 executor, 12 cores, 16GB, 32 shuffle, 32 bulk-insert-parallelism. Table2: 21GB parquet, 600M records, 16 columns and key is composite of 2 columns. Spark Conf: 4 executor, 8 cores, 32GB, 128 shuffle, 128 bulk-insert-parallelism. Table 1 loading time: 25 min. Table 2 loading time: 47 min. Both tables read and write from/to local file system.

To Reproduce

Code sample used:

import cluster.SparkConf
import common.DataConfig._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
import org.apache.spark.sql.SaveMode._
import org.apache.spark.sql.SparkSession
object HudiFilewriter {

  val COW = "COW"
  val MOR = "MOR"

  def main(args: Array[String]): Unit = {
    val tableName = args(0)
    val basePath = args(1)
    val tableType = if(COW.equalsIgnoreCase(args(2))) COW_TABLE_TYPE_OPT_VAL else MOR_TABLE_TYPE_OPT_VAL
    val rawTablePath = args(3)
    val partitionCol = args(4)

    val spark = SparkSession.builder()
      .getOrCreate()

    val logLevel = spark.sparkContext.getConf.get(SparkConf.LOG_LEVEL)
    spark.sparkContext.setLogLevel(logLevel)

    val shuffle = spark.sparkContext.getConf.get(SparkConf.SHUFFLE_PARTITIONS)


    var hudiOptions = Map[String, String](

      //HoodieWriteConfig
      TABLE_NAME -> tableName,
      "hoodie.bulkinsert.shuffle.parallelism" -> shuffle,

      //DataSourceWriteOptions
      TABLE_TYPE_OPT_KEY -> tableType,
      PRECOMBINE_FIELD_OPT_KEY -> UPDATE_COL,
      KEYGENERATOR_CLASS_OPT_KEY -> "org.apache.hudi.keygen.ComplexKeyGenerator",
      RECORDKEY_FIELD_OPT_KEY -> KEY_COLS.mkString(","),
      PARTITIONPATH_FIELD_OPT_KEY -> partitionCol,
      OPERATION_OPT_KEY -> BULK_INSERT_OPERATION_OPT_VAL

    )

    spark.time{
      val df = spark.read.parquet(rawTablePath)

      df.write.format("org.apache.hudi").
        options(hudiOptions).
        mode(Overwrite).
        save(basePath)
    }
  }

Expected behavior

Similar performance to vanilla parquet writing with additional sort overhead.

Environment Description

Hudi version : 0.5.2
Spark version : 2.4.5
Hive version : NA
Hadoop version : NA
Storage (HDFS/S3/GCS…) : Local file System
Running on Docker? (yes/no) : no

Additional context

Attempts:

We tried multiple different spark configurations, increasing the shuffle and bulk-insert parallelism, increasing the number of executors while maintaining the base resources, increasing memory threshold of the driver/executors.
Hudi Tables types (MOR partitioned and non-partitioned) (COW partitioned and non-petitioned), for partitioned tables we provided a partitioned version of the base table a long with the partitioned column(s).
Hudi and spark version: “hudi-spark-bundle” % “0.5.1-incubating”, Spark-2.4.3, “spark-avro” % “2.4.3”.
Upgraded Hudi and spark version: “hudi-spark-bundle” % “0.5.2-incubating”, Spark-2.4.5, “spark-avro” % “2.4.5”.
Base data preparation, sorted by keys or partitioned.
Load the data partition by partition, filter base table based on the partition column and bulk-insert each dataframe result, so each partition individually will use the whole app resources during the writing operation, use new app for each partition.

All the above attempts didn’t improve the loading time that much or make it worse. So I would like to know if:

Is that the normal time for initial loading for Hudi tables, or we are doing something wrong?
Do we need a better cluster/recoures to be able to load the data for the first time?, because it is mentioned on Hudi confluence page that COW bulkinsert should match vanilla parquet writing + sort only.
Does partitioning improves the upsert and/or compaction time for Hudi tables, or just to improve the analytical queries (partition pruning)?
We have noticed that the most time spent in the data indexing (the bulk-insert logic itself) and not the sorting stages/operation before the indexing, so how can we improve that? should we provide our own indexing logic?

Issue Analytics

State:
Created 3 years ago
Comments:13 (8 by maintainers)

Top GitHub Comments

2reactions

ahmed-elfarcommented, Apr 15, 2020

@vinothchandar I apologies for the delayed response, and thanks again for your help and detailed answers.

Is it possible to share the data generation tool with us or point us to reproducing this ourselves locally? We can go much faster if we are able to repro this ourselves…

Sure, this is the public Repo for generating the data https://github.com/gregrahn/tpch-kit And it provides the information you need for data generation, size, etc

You can use this command to generate lineitem with scale 10GB DSS_PATH=/output/path ./dbgen -T L 10

Schema for lineitem

Adding more details and update the schema screenshot mentioned on previous comment:

RECORDKEY_FIELD_OPT_KEY: is composite (l_linenumber, l_orderkey) PARTITIONPATH_FIELD_OPT_KEY: optional default (non-portioned), or l_shipmode PRECOMBINE_FIELD_OPT_KEY: l_commitdate, or generating new timestamp column last_updated.

Screenshot from 2020-04-14 17-54-43

This is the official documentation for the datasets definition, schema, queries and business logic behind the queries http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.18.0.pdf

@bvaradar & @umehrot2 will have the ability to seamlessly bootstrap the data into hudi without rewriting in the next release.

Are we talking about the proposal mentioned at https://cwiki.apache.org/confluence/display/HUDI/RFC±+12+%3A+Efficient+Migration+of+Large+Parquet+Tables+to+Apache+Hudi

because we need more clarification regarding this approach.

you ll also have the ability to do a one-time bulk_insert for last N partitions to get the upsert performance benefits as we discussed above

There one of the attempts mentioned on the first comment which might be similar, I will explain in details to check with you if it should work for now or not, if it provides a valid Hudi table:

So consider we have table of size 1TB parquet format as an input table either partitioned or non- partitioned. Spark resource 256GB ram and 32 cores:

Case non-partitioned

We use the suggested/recommended partition column(s) (we pick first column in the partition path), then project this partition column and apply distinct which will provide you with filter values you need to pass to next process of the pipe line.
The next step submit a sequential spark applications which filter the input data based on the passed filter value resulting in data frame of single partition.
Write (bulk-insert) the filtered dataframe as Hudi table with the provided partition column(s) using save-mode append
Hudi table being written partition by partition.
Query the Hudi table to check if it is valid table, and it looks valid.

Case partitioned same as above, with faster filter operations.

Pros:

Avoided a lot of disk spilling, GC hits.
Using less resources for initial loading.

Cons:

No time improvements in case you have enough resources to load the table at once.
We ended up with partitioned table, which might not be needed in some of our use cases.

Questions:

If this approach is valid or going to impact the upsert operations in the future?

I would be happy to jump on a call with you folks and get this moving along… I am also very excited to work with a user like yourself and move the perf aspects of the project along more…

We are excited as well to have a call together, please inform me how we can proceed on this meeting.

1reaction

vinothchandarcommented, Apr 13, 2020

Few clarifications:

For initial bench-marking we generate standard tpch data

Is it possible to share the data generation tool with us or point us to reproducing this ourselves locally? We can go much faster if we are able to repro this ourselves…

Schema for lineitem

What’s your record key and partition key? (was unable to spot this from the snippet above)… If you have monotonically increasing key (say timestamp prefix) and already know the partition to which an increment record (IIUC this means incoming writes) belong to, then upsert performance will be optimal.

We have 2 versions of generated updates, one which touches only the last quarter of the year and another one generated randomly to touch most of the parquet parts during updates.

If you have a workload that will touch every file, then you could use the #1402 that is being built. Bloom filter checking will anyway lead to opening up all files in that scenario anyway…

Currently we generate no duplicates for the base table and increments.

By default, upsert will also de-dupe the increments once… So if this the norm, you can turn off hoodie.combine.before.upsert=false to avoid an extra shuffle.

As you can see the data has been shuffled to disk twice, applying the sorting twice as well.

so this is without overriding the user defined partitioner? btw the two jobs you see is how Spark sort works, first job does reservoir sampling to get ranges, and the second one actually sorts…

Eagerly persist the input RDD before the bulk-insert, which uses the same sorting provided before the bulk-insert.

bulk_insert was designed to do an initial sort and write data without incurring large memory overheads associated with caching… Spark cache is a LRU… so it will thrash a fair bit if you start spilling due to lack of memory. I would not recommend trying this…

Note: this approach impact the Upsert time significantly specially if you didn’t apply any sorting to the data, it might be because the upsert operation touched most of the parquet parts.

yes… you are right… sorting gives you a dataset which is initially sorted/ordered by keys and if you have ordered keys, hudi will preserve this and extract upsert performance by filtering out files not in range during indexing… At Uber, when we moved all the tables to hudi, we found this one time sort, well worth the initial cost… It repaid itself many times over the course of a quarter.

it is actually a narrow transformation after the sorting operation.

Its the action that triggers the actual parquet writing. So the 30 second odd you see if the actual cost of writing data…

If I get your suggestion right, would you suggest to initially load the table using upsert or insert operation for the whole table instead of bulk-insert?

No… bulk_insert + sorting is what I recommend (with good key design) for large scale deployment like you are talking about… if you don’t want to convert all data, @bvaradar & @umehrot2 will have the ability to seamlessly bootstrap the data into hudi without rewriting in the next release… (you ll also have the ability to do a one-time bulk_insert for last N partitions to get the upsert performance benefits as we discussed above)…

I would be happy to jump on a call with you folks and get this moving along… I am also very excited to work with a user like yourself and move the perf aspects of the project along more…