[Spark load] Doris support Spark load
See original GitHub issueFor many users who want to load data into doris for the first time, they have large amount of data, about 10G+,it is hard to support to load so large data into doris at one time using Broker load or Stream load. To resolve this problem, We proposal a new solution to load data by using spark cluster.
Spark clusters are used to preprocess data (bitmap global dict build, partition, sort, aggregation) in spark load, which can improve Doris load performance of large data volume and save the computing resources of Doris.
Spark load is mainly used for the initial migration from other systems or loading large amounts of data into Doris.
+
| 0. User create spark load job
+----v----+
| FE |---------------------------------+
+----+----+ |
| 3. FE send push tasks |
| 5. FE publish version |
+------------+------------+ |
| | | |
+---v---+ +---v---+ +---v---+ |
| BE | | BE | | BE | |1. FE submit Spark ETL job
+---^---+ +---^---+ +---^---+ |
|4. BE push with broker | |
+---+---+ +---+---+ +---+---+ |
|Broker | |Broker | |Broker | |
+---^---+ +---^---+ +---^---+ |
| | | |
+---+------------+------------+---+ 2.ETL +-------------v---------------+
| HDFS +-------> Spark cluster |
| <-------+ |
+---------------------------------+ +-----------------------------+
Issue Analytics
- State:
- Created 3 years ago
- Reactions:5
- Comments:5 (5 by maintainers)
Top Results From Across the Web
Spark Load - Apache Doris
Spark load realizes the preprocessing of load data by spark, improves the performance of loading large amount of Doris data and saves the...
Read more >[Spark load] Doris support Spark load · Issue #3433 - GitHub
To resolve this problem, We proposal a new solution to load data by using spark cluster. Spark clusters are used to preprocess data...
Read more >Apache Doris Reaches Top-Level Status - I Programmer
Dpris supports fast loading of data from localhost, Hadoop, Flink, Spark, Kafka, SeaTunnel and other systems, and can also directly access ...
Read more >Spark Connector (Spark 实时或批量数据) - 腾讯云- Tencent
Spark load 通过外部的Spark 计算资源实现对导入数据的预处理,提高Doris 大数据量的导入性能并且节省Doris 集群的计算资源。
Read more >[Feature] support spark connector sink stream data to doris ...
+- Support `Spark DataFrame` batch/stream writing data to `Doris` - You ... "$YOUR_KAFKA_TOPICS") + .format("kafka") + .load() +kafkaSource.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Be handle push task #3742 [Spark load][Be 1/1] Be handle push task
Other #3878 [Spark load][broker load]Optimize reading parquet format file