question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Spark load] Doris support Spark load

See original GitHub issue

For many users who want to load data into doris for the first time, they have large amount of data, about 10G+,it is hard to support to load so large data into doris at one time using Broker load or Stream load. To resolve this problem, We proposal a new solution to load data by using spark cluster.

Spark clusters are used to preprocess data (bitmap global dict build, partition, sort, aggregation) in spark load, which can improve Doris load performance of large data volume and save the computing resources of Doris.

Spark load is mainly used for the initial migration from other systems or loading large amounts of data into Doris.

                 +
                 | 0. User create spark load job
            +----v----+
            |   FE    |---------------------------------+
            +----+----+                                 |
                 | 3. FE send push tasks                |
                 | 5. FE publish version                |
    +------------+------------+                         |
    |            |            |                         |
+---v---+    +---v---+    +---v---+                     |
|  BE   |    |  BE   |    |  BE   |                     |1. FE submit Spark ETL job
+---^---+    +---^---+    +---^---+                     |
    |4. BE push with broker   |                         |
+---+---+    +---+---+    +---+---+                     |
|Broker |    |Broker |    |Broker |                     |
+---^---+    +---^---+    +---^---+                     |
    |            |            |                         |
+---+------------+------------+---+ 2.ETL +-------------v---------------+
|               HDFS              +------->       Spark cluster         |
|                                 <-------+                             |
+---------------------------------+       +-----------------------------+

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:5
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
imaycommented, Apr 29, 2020
  1. can refer this issue in related PR and issue.
  2. I will create a project “Spark Load” to track this feature.
  3. You can create an issue for each part of this project.
0reactions
xy720commented, Jul 1, 2020

Be handle push task #3742 [Spark load][Be 1/1] Be handle push task

Other #3878 [Spark load][broker load]Optimize reading parquet format file

Read more comments on GitHub >

github_iconTop Results From Across the Web

Spark Load - Apache Doris
Spark load realizes the preprocessing of load data by spark, improves the performance of loading large amount of Doris data and saves the...
Read more >
[Spark load] Doris support Spark load · Issue #3433 - GitHub
To resolve this problem, We proposal a new solution to load data by using spark cluster. Spark clusters are used to preprocess data...
Read more >
Apache Doris Reaches Top-Level Status - I Programmer
Dpris supports fast loading of data from localhost, Hadoop, Flink, Spark, Kafka, SeaTunnel and other systems, and can also directly access ...
Read more >
Spark Connector (Spark 实时或批量数据) - 腾讯云- Tencent
Spark load 通过外部的Spark 计算资源实现对导入数据的预处理,提高Doris 大数据量的导入性能并且节省Doris 集群的计算资源。
Read more >
[Feature] support spark connector sink stream data to doris ...
+- Support `Spark DataFrame` batch/stream writing data to `Doris` - You ... "$YOUR_KAFKA_TOPICS") + .format("kafka") + .load() +kafkaSource.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found