question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Producer wait for broker response when send a message often shows RemotingTimeoutException

See original GitHub issue

As the title,when producer send messages to broker,shows the flowing WARN: RemotingTimeoutException: wait response on the channel timeout 3000ms.

We excluded network problems by network packet capture.

When producer send A message ,the broker need 4000ms to respond, or even more.This has exceeded the maximum timeout.

I’m sorry I can’t go to the email community to discuss it because I’m a Chinese developer, And the community is for me is 404ERROR.

NEED some help,PLEASE. Thank you Handsome guys.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
RongtongJincommented, Nov 20, 2020

这几天在写CURD,这个问题还是存在,

store.log中[PAGECACHERT]的统计存在明显的刷盘超时:

2020-11-18 14:22:05 INFO StoreStatsService - [PAGECACHERT] TotalPut 50, PutMessageDistributeTime [<=0ms]:10 [0~10ms]:35 [10~50ms]:0 [50~100ms]:0 [100~200ms]:0 [200~500ms]:0 [500ms~1s]:1 [1~2s]:1 [2~3s]:1 [3~4s]:0 [4~5s]:2 [5~10s]:0 [10s~]:0

在每次发生这个现象的时候都会出现这些异常:

我们的broker配置如下:

brokerClusterName = DefaultCluster
brokerName = broker-a
brokerId = 0
deleteWhen = 04
fileReservedTime = 48
brokerRole = ASYNC_MASTER
flushDiskType = ASYNC_FLUSH

看PAGECACHERT中有 2–3s:1、4–5s:2 这种明显的超时

2020-11-18 14:21:05 INFO StoreStatsService - [PAGECACHERT] TotalPut 52, PutMessageDistributeTime [<=0ms]:9 [0~10ms]:43 [10~50ms]:0 [50~100ms]:0 [100~200ms]:0 [200~500ms]:0 [500ms~1s]:0 [1~2s]:0 [2~3s]:0 [3~4s]:0 [4~5s]:0 [5~10s]:0 [10s~]:0 
2020-11-18 14:21:14 WARN ReadSocketService - ha housekeeping, found this connection[/slaveIP:47894] expired, 20019
2020-11-18 14:21:14 INFO ReadSocketService - ReadSocketService service end
2020-11-18 14:21:14 INFO AcceptSocketService - HAService receive new connection, /slaveIP:33400
2020-11-18 14:21:14 INFO ReadSocketService - ReadSocketService service started
2020-11-18 14:21:14 INFO WriteSocketService - WriteSocketService service started
2020-11-18 14:21:14 INFO WriteSocketService - WriteSocketService service end
2020-11-18 14:21:19 INFO ReadSocketService - slave[/slaveIP:33400] request offset 2235247317
2020-11-18 14:21:19 INFO WriteSocketService - master transfer data from 2235247317 to slave[/slaveIP:33400], and slave request 2235247317
2020-11-18 14:21:19 WARN SendMessageThread_40 - putMessage not in lock eclipse time(ms)=4370, bodyLength=492
2020-11-18 14:21:19 WARN SendMessageThread_37 - putMessage not in lock eclipse time(ms)=4056, bodyLength=466
2020-11-18 14:21:19 WARN SendMessageThread_34 - putMessage not in lock eclipse time(ms)=2280, bodyLength=459
2020-11-18 14:21:19 WARN SendMessageThread_10 - putMessage not in lock eclipse time(ms)=1081, bodyLength=523
2020-11-18 14:21:19 WARN SendMessageThread_28 - putMessage not in lock eclipse time(ms)=719, bodyLength=444
2020-11-18 14:22:05 INFO StoreStatsService - [STORETPS] put_tps 0.8332777814812345 get_found_tps 0.8332777814812345 get_miss_tps 4.466368908739417 get_transfered_tps 0.8332777814812345
2020-11-18 14:22:05 INFO StoreStatsService - [PAGECACHERT] TotalPut 50, PutMessageDistributeTime [<=0ms]:10 [0~10ms]:35 [10~50ms]:0 [50~100ms]:0 [100~200ms]:0 [200~500ms]:0 [500ms~1s]:1 [1~2s]:1 [2~3s]:1 [3~4s]:0 [4~5s]:2 [5~10s]:0 [10s~]:0 

GC的情况:

查看了GC的日志,发现GC是没有什么问题的,GC时间都很短而且在RemotingTimeoutException异常发生时没有出现GC

认真研读了Commitlog#putmessage(msg) 的逻辑,同时观察日志没有出现明显的异常,现在似乎是卡在这里了,是不是我们的内存配置,或者是主从交换的配置有什么问题,看到有一个issue:issues428的情况跟我一样,是有关于心跳的问题吗?

@RongtongJin 期待您的回答,万分感谢您!

那就看看是不是磁盘的问题,应该刷盘慢的时间段await, io util也很高,但看起来业务量不大。配置看起来设置比较简单,MQ这边可以调优一下线程数sendMessageThreadPoolNums,开启transientStorePoolEnable减少刷盘毛刺。可以执行下distribution/bin/os.sh脚本优化一下操作系统参数,特别是一些脏页writeback参数、IO请求调度算法等,另外尽量不要使用swap分区,需要通过工具来复现找出具体原因。

和issue428应该没有什么关系,刷盘慢是原因,timeout是结果。

1reaction
RongtongJincommented, Nov 14, 2020

putMessage not in lock eclipse

2020-11-10 18:47:22 WARN SendMessageThread_5 - putMessage not in lock eclipse time(ms)=4401, bodyLength=520 2020-11-10 18:47:22 WARN SendMessageThread_44 - putMessage not in lock eclipse time(ms)=4344, bodyLength=497 2020-11-10 18:47:22 WARN SendMessageThread_35 - putMessage not in lock eclipse time(ms)=2453, bodyLength=539 2020-11-10 18:47:22 WARN SendMessageThread_43 - putMessage not in lock eclipse time(ms)=3086, bodyLength=552

这些日志显示某些请求的putMessage时间超过了4s,但从当前日志无法判定是刷盘慢了,还是同步数据数据慢了,又或者是发生了full gc, 可以排查一下broker的gc日志,或者看看store.log中[PAGECACHERT]的统计是否有明显刷盘超时等

Read more comments on GitHub >

github_iconTop Results From Across the Web

[GitHub] [rocketmq] RongtongJin commented on ... - The Mail Archive
[GitHub] [rocketmq] RongtongJin commented on issue #2427: Producer wait for broker response when send a message often shows RemotingTimeoutException.
Read more >
Guidelines to handle Timeout exception for Kafka Producer?
To handle the timeout exceptions, the general practice is: Rule out broker side issues. make sure that the topic partitions are fully replicated ......
Read more >
How should I implement request response with JMS - ActiveMQ
The best way to implement request-response over JMS is to create a temporary queue and consumer per client on startup, set JMSReplyTo property...
Read more >
Frequently Asked Questions: Real Estate Brokers ... - Mass.gov
I processed my renewal application online with the payment but I have not received my wallet license in the mail. What is the...
Read more >
11 Best Real Estate Text Message Scripts for Agents Who Don ...
My name is [your first name] and I'm a Realtor with [your brokerage]. Is your home at [home's address] still for sale? [Wait...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found