Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

OlapTableSink::send is low efficient?

See original GitHub issue

Problem When we use broker load, OlapTableSink::send() takes the longest time, almost all of the plan_fragment active time. Example in one BE:

Fragment f59d832368a84c94-be109f903cf4698d:(Active: 3h36m, % non-child: 0.00%)
   - AverageThreadTokens: 1.00
   - PeakReservation: 0
   - PeakUsedReservation: 0
   - RowsProduced: 168.61M
   - SizeProduced: 30.25 GB
  BlockMgr:
     - BlockWritesOutstanding: 0
     - BlocksCreated: 0
     - BlocksRecycled: 0
     - BufferedPins: 0
     - BytesWritten: 0
     - MaxBlockSize: 8.00 MB
     - MemoryLimit: 2.00 GB
     - TotalBufferWaitTime: 0.000ns
     - TotalEncryptionTime: 0.000ns
     - TotalIntegrityCheckTime: 0.000ns
     - TotalReadBlockTime: 0.000ns
  OlapTableSink:(Active: 3h35m, % non-child: 0.00%)
     - CloseTime: 102.932ms
     - ConvertBatchTime: 0.000ns
     - OpenTime: 247.194ms
     - RowsFiltered: 0
     - RowsRead: 168.61M
     - RowsReturned: 168.61M
     - SendDataTime: 3h34m
     - SerializeBatchTime: 8m26s
     - ValidateDataTime: 19s554ms
     - WaitInFlightPacketTime: 3h23m
  BROKER_SCAN_NODE (id=0):(Active: 1m8s, % non-child: 0.00%)
     - BytesRead: 0
     - MemoryUsed: 0
     - NumThread: 0
     - PerReadThreadRawHdfsThroughput: 0.00 /sec
     - RowsRead: 168.61M
     - RowsReturned: 168.61M
     - RowsReturnedRate: 2.48 M/sec
     - ScanRangesComplete: 0
     - ScannerThreadsInvoluntaryContextSwitches: 0
     - ScannerThreadsTotalWallClockTime: 0.000ns
       - MaterializeTupleTime(*): 5m37s
       - ScannerThreadsSysTime: 0.000ns
       - ScannerThreadsUserTime: 0.000ns
     - ScannerThreadsVoluntaryContextSwitches: 0
     - TotalRawReadTime(*): 38m58s
     - TotalReadThroughput: 0.00 /sec
     - WaitScannerTime: 1m7s

As can be seen above, WaitInFlightPacketTime is the most time-consuming portion.

Analysis I describe the whole progress here.

PlanFragmentExecutor pseudo code:

while(1){
    batch=get_one_batch();
    OlapTableSink::send(batch);
}

Then, OlapTableSink::send() pseudo code:

for(row in batch){
    channel=get_corresponding_channel(row);

    // channel::add_row() explanation:
    ok=channel::add_row_in_cur_batch(row);
    if(!ok){
        if(channel::has_in_flight_packet){
            channel::wait_in_flight_packet(); // (*)
        }
        channel::send_add_batch_req();
        channel::add_row_in_cur_batch(row);
    }
    // channel::add_row() end
}

So if we trigger channel::wait_in_flight_packet(), it will block the whole process. But there’s no need to block other channels add_row(). For example, channel0 is waiting in_flight_packet, we can still add row to other channels.

Better solutions(preliminary thoughts)

make channel::add_row() non-blocking. It might be a massive change.
make channel::add_row() less blocking. e.g. avoid adding row to channel0 immediately after channel0 send a add_batch request.

Issue Analytics

State:
Created 4 years ago
Comments:8 (8 by maintainers)

Top GitHub Comments

1reaction

vagetablechickencommented, Mar 10, 2020

https://github.com/apache/incubator-doris/pull/2956#issuecomment-596889947 As mentioned, here‘s the new design of OlapTableSink–add one sender thread to do non-blocking sending. Let me explain the meaning of non-blocking.

The origin version of OlapTableSink can be abstracted as one queue(contains all batches in all node channels). One thread consumes the queue’s items, one by one. When it wants to send a batch of the NodeChannel which has a in flight packet(rpc hasn’t returned response), it must wait(rpc join). For example: The batch which index id=0 && node id=2, is denoted by “B1(0-2)”. The Abstract queue as shown below.

B0(0-1)	B1(0-2)	B2(0-1)	B3(1-4)	…	…

When we are sending B2(0-1), we must wait for the B0 response. But if we set aside B2(0-1) and send the next item “B3(1-4)”, it won’t be blocked.

So I used to split the one queue into multi queues(abandoned), as follows.( For details, see https://github.com/apache/incubator-doris/issues/2780#issuecomment-588156273) Batches queue0:

B0(0-0)	B1(0-0)	B2(2-0)	B3(4-0)	…	…

Batches queue1:

B0(2-1)	B1(0-1)	B2(0-1)	B3(1-1)	…	…

Batches queue2:

B0(0-2)	B1(0-2)	B2(3-2)	B3(1-2)	…	…

Each queue needs one thread to consume items. Block time is shared by multiple queues. But it’s still a block way.

The new design is non-blocking.

We can save batches in NodeChannels(pending batches), and try to send a pending batch. If the current channel has a in flight packet, we just skip sending in this round.

The implementation is coming soon.

0reactions

vagetablechickencommented, Mar 19, 2020

The new non-blocking approach looks good. It should improve the overall load performance particularly in case where you have some slower receiver nodes.

BTW you should cap the total number of pending batches in order to control the memory usage.

There’s no need to limit mem in sink node. When we create a new RowBatch in NodeChannel, we use sink node’s mem_tracker. As we know, the sink node & the scan node has the same ancestor mem_tracker(query mem tracker， default 2GB), so the mem limit is a matter for scan node.