OlapTableSink::send is low efficient?
See original GitHub issueProblem When we use broker load, OlapTableSink::send() takes the longest time, almost all of the plan_fragment active time. Example in one BE:
Fragment f59d832368a84c94-be109f903cf4698d:(Active: 3h36m, % non-child: 0.00%)
- AverageThreadTokens: 1.00
- PeakReservation: 0
- PeakUsedReservation: 0
- RowsProduced: 168.61M
- SizeProduced: 30.25 GB
BlockMgr:
- BlockWritesOutstanding: 0
- BlocksCreated: 0
- BlocksRecycled: 0
- BufferedPins: 0
- BytesWritten: 0
- MaxBlockSize: 8.00 MB
- MemoryLimit: 2.00 GB
- TotalBufferWaitTime: 0.000ns
- TotalEncryptionTime: 0.000ns
- TotalIntegrityCheckTime: 0.000ns
- TotalReadBlockTime: 0.000ns
OlapTableSink:(Active: 3h35m, % non-child: 0.00%)
- CloseTime: 102.932ms
- ConvertBatchTime: 0.000ns
- OpenTime: 247.194ms
- RowsFiltered: 0
- RowsRead: 168.61M
- RowsReturned: 168.61M
- SendDataTime: 3h34m
- SerializeBatchTime: 8m26s
- ValidateDataTime: 19s554ms
- WaitInFlightPacketTime: 3h23m
BROKER_SCAN_NODE (id=0):(Active: 1m8s, % non-child: 0.00%)
- BytesRead: 0
- MemoryUsed: 0
- NumThread: 0
- PerReadThreadRawHdfsThroughput: 0.00 /sec
- RowsRead: 168.61M
- RowsReturned: 168.61M
- RowsReturnedRate: 2.48 M/sec
- ScanRangesComplete: 0
- ScannerThreadsInvoluntaryContextSwitches: 0
- ScannerThreadsTotalWallClockTime: 0.000ns
- MaterializeTupleTime(*): 5m37s
- ScannerThreadsSysTime: 0.000ns
- ScannerThreadsUserTime: 0.000ns
- ScannerThreadsVoluntaryContextSwitches: 0
- TotalRawReadTime(*): 38m58s
- TotalReadThroughput: 0.00 /sec
- WaitScannerTime: 1m7s
As can be seen above, WaitInFlightPacketTime is the most time-consuming portion.
Analysis I describe the whole progress here.
PlanFragmentExecutor pseudo code:
while(1){
batch=get_one_batch();
OlapTableSink::send(batch);
}
Then, OlapTableSink::send() pseudo code:
for(row in batch){
channel=get_corresponding_channel(row);
// channel::add_row() explanation:
ok=channel::add_row_in_cur_batch(row);
if(!ok){
if(channel::has_in_flight_packet){
channel::wait_in_flight_packet(); // (*)
}
channel::send_add_batch_req();
channel::add_row_in_cur_batch(row);
}
// channel::add_row() end
}
So if we trigger channel::wait_in_flight_packet(), it will block the whole process. But there’s no need to block other channels add_row(). For example, channel0 is waiting in_flight_packet, we can still add row to other channels.
Better solutions(preliminary thoughts)
- make channel::add_row() non-blocking. It might be a massive change.
- make channel::add_row() less blocking. e.g. avoid adding row to channel0 immediately after channel0 send a add_batch request.
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (8 by maintainers)
Top Results From Across the Web
Minimum Efficient Scale (MES): Definition With Graph
The minimum efficient scale (MES) is the point on a cost curve when a company can produce its product cheaply enough to offer...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
https://github.com/apache/incubator-doris/pull/2956#issuecomment-596889947 As mentioned, here‘s the new design of OlapTableSink–add one sender thread to do non-blocking sending. Let me explain the meaning of non-blocking.
The origin version of OlapTableSink can be abstracted as one queue(contains all batches in all node channels). One thread consumes the queue’s items, one by one. When it wants to send a batch of the NodeChannel which has a in flight packet(rpc hasn’t returned response), it must wait(rpc join). For example: The batch which index id=0 && node id=2, is denoted by “B1(0-2)”. The Abstract queue as shown below.
When we are sending B2(0-1), we must wait for the B0 response. But if we set aside B2(0-1) and send the next item “B3(1-4)”, it won’t be blocked.
So I used to split the one queue into multi queues(abandoned), as follows.( For details, see https://github.com/apache/incubator-doris/issues/2780#issuecomment-588156273) Batches queue0:
Batches queue1:
Batches queue2:
Each queue needs one thread to consume items. Block time is shared by multiple queues. But it’s still a block way.
The new design is non-blocking.
We can save batches in NodeChannels(pending batches), and try to send a pending batch. If the current channel has a in flight packet, we just skip sending in this round.
The implementation is coming soon.
There’s no need to limit mem in sink node. When we create a new RowBatch in NodeChannel, we use sink node’s mem_tracker. As we know, the sink node & the scan node has the same ancestor mem_tracker(query mem tracker, default 2GB), so the mem limit is a matter for scan node.