Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Spark load] Doris support Spark load

See original GitHub issue

For many users who want to load data into doris for the first time, they have large amount of data, about 10G+，it is hard to support to load so large data into doris at one time using Broker load or Stream load. To resolve this problem, We proposal a new solution to load data by using spark cluster.

Spark clusters are used to preprocess data (bitmap global dict build, partition, sort, aggregation) in spark load, which can improve Doris load performance of large data volume and save the computing resources of Doris.

Spark load is mainly used for the initial migration from other systems or loading large amounts of data into Doris.

                 +
                 | 0. User create spark load job
            +----v----+
            |   FE    |---------------------------------+
            +----+----+                                 |
                 | 3. FE send push tasks                |
                 | 5. FE publish version                |
    +------------+------------+                         |
    |            |            |                         |
+---v---+    +---v---+    +---v---+                     |
|  BE   |    |  BE   |    |  BE   |                     |1. FE submit Spark ETL job
+---^---+    +---^---+    +---^---+                     |
    |4. BE push with broker   |                         |
+---+---+    +---+---+    +---+---+                     |
|Broker |    |Broker |    |Broker |                     |
+---^---+    +---^---+    +---^---+                     |
    |            |            |                         |
+---+------------+------------+---+ 2.ETL +-------------v---------------+
|               HDFS              +------->       Spark cluster         |
|                                 <-------+                             |
+---------------------------------+       +-----------------------------+