All of lore.kernel.org
 help / color / mirror / Atom feed
* Pipe "deadlock" in Hammer, 0.94.5
@ 2017-01-10 11:47 许雪寒
  2017-01-10 13:11 ` Sage Weil
  0 siblings, 1 reply; 6+ messages in thread
From: 许雪寒 @ 2017-01-10 11:47 UTC (permalink / raw)
  To: ceph-devel

Hi, everyone.

Recently, we did some experiment to test the stability of the ceph cluster. We used Hammer version which is the mostly used version of online cluster. One of the scenarios that we simulated is poor network connectivity, in which we used iptables to drop TCP/IP packet under some probability. And sometimes, we can see that some OSD suicide themselves. 

We used gdb to debug the core dumped by linux. We found that the thread that hit the suicide time threshold is a peering thread who is trying to send a pg_notify message, the ceph-osd log file and gdb output is as follows:

Log file:
    -3> 2017-01-10 17:02:13.469949 7fd446ff7700  1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7fd440bed700' had timed out after 15
    -2> 2017-01-10 17:02:13.469952 7fd446ff7700  1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7fd440bed700' had suicide timed out after 150
    -1> 2017-01-10 17:02:13.469954 7fd4451f4700  1 -- 10.160.132.157:6818/10014122 <== osd.20 10.160.132.156:0/24908 163 ==== osd_ping(ping e4030 stamp 2017-01-10 17:02:13.450374) v2 ==== 47+0+0 (3247646131 0 0) 0x7fd418ca8600 con 0x7fd413c89700
     0> 2017-01-10 17:02:13.496895 7fd446ff7700 -1 error_msg common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fd446ff7700 time 2017-01-10 17:02:13.469969
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

GDB OUTPUT:
(gdb) thread 8
[Switching to thread 8 (Thread 0x7fd440bed700 (LWP 15302))]#0  0x0000003c5d80e334 in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x0000003c5d80e334 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003c5d8095d8 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x0000003c5d8094a7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000001a54ae4 in Mutex::Lock (this=0x7fd426453598, no_lockdep=false) at common/Mutex.cc:96
#4  0x0000000001409285 in Mutex::Locker::Locker (this=0x7fd440beb6c0, m=...) at common/Mutex.h:115
#5  0x0000000001c46446 in PipeConnection::try_get_pipe (this=0x7fd426453580, p=0x7fd440beb908) at msg/simple/PipeConnection.cc:38
#6  0x0000000001c05809 in SimpleMessenger::submit_message (this=0x7fd482029400, m=0x7fd425538d00, con=0x7fd426453580, dest_addr=..., dest_type=4, already_locked=false) at msg/simple/SimpleMessenger.cc:443
#7  0x0000000001c033fa in SimpleMessenger::_send_message (this=0x7fd482029400, m=0x7fd425538d00, con=0x7fd426453580) at msg/simple/SimpleMessenger.cc:136
#8  0x0000000001c467c7 in SimpleMessenger::send_message (this=0x7fd482029400, m=0x7fd425538d00, con=0x7fd426453580) at msg/simple/SimpleMessenger.h:139
#9  0x0000000001c466a1 in PipeConnection::send_message (this=0x7fd426453580, m=0x7fd425538d00) at msg/simple/PipeConnection.cc:78
#10 0x00000000013b3ff2 in OSDService::send_map (this=0x7fd4821e76c8, m=0x7fd425538d00, con=0x7fd426453580) at osd/OSD.cc:1054
#11 0x00000000013b45e7 in OSDService::send_incremental_map (this=0x7fd4821e76c8, since=4028, con=0x7fd426453580, osdmap=std::tr1::shared_ptr (count 49) 0x7fd426c0f480) at osd/OSD.cc:1087
#12 0x00000000013b215f in OSDService::share_map_peer (this=0x7fd4821e76c8, peer=9, con=0x7fd426453580, map=std::tr1::shared_ptr (count 49) 0x7fd426c0f480) at osd/OSD.cc:887
#13 0x00000000013f43cc in OSD::do_notifies (this=0x7fd4821e6000, notify_list=std::map with 7 elements = {...}, curmap=std::tr1::shared_ptr (count 49) 0x7fd426c0f480) at osd/OSD.cc:7246
#14 0x00000000013f3c99 in OSD::dispatch_context (this=0x7fd4821e6000, ctx=..., pg=0x0, curmap=std::tr1::shared_ptr (count 49) 0x7fd426c0f480, handle=0x7fd440becb40) at osd/OSD.cc:7198
#15 0x000000000140043e in OSD::process_peering_events (this=0x7fd4821e6000, pgs=std::list = {...}, handle=...) at osd/OSD.cc:8539
#16 0x000000000141e094 in OSD::PeeringWQ::_process (this=0x7fd4821e7070, pgs=std::list = {...}, handle=...) at osd/OSD.h:1601
#17 0x00000000014b94bf in ThreadPool::BatchWorkQueue<PG>::_void_process (this=0x7fd4821e7070, p=0x7fd425419040, handle=...) at common/WorkQueue.h:107
#18 0x0000000001b2d2e8 in ThreadPool::worker (this=0x7fd4821e64b0, wt=0x7fd4761db430) at common/WorkQueue.cc:128
#19 0x0000000001b313f7 in ThreadPool::WorkThread::entry (this=0x7fd4761db430) at common/WorkQueue.h:318
#20 0x0000000001b33d40 in Thread::entry_wrapper (this=0x7fd4761db430) at common/Thread.cc:61
#21 0x0000000001b33cb2 in Thread::_entry_func (arg=0x7fd4761db430) at common/Thread.cc:45
#22 0x0000003c5d807aa1 in start_thread () from /lib64/libpthread.so.0
#23 0x0000003c5d4e8aad in clone () from /lib64/libc.so.6

And the thread that is holding the mutex lock which this thread is trying to get is a pipe reader_thread who is trying to read a full message that sent from another OSD:

(gdb) frame 2
#2  0x0000003c5d8094a7 in pthread_mutex_lock () from /lib64/libpthread.so.0
(gdb) info reg
rax            0xfffffffffffffe00            -512
rbx            0x7fd425538d00             140549136026880
rcx            0xffffffffffffffff               -1
rdx            0x10    16
rsi            0x80      128
rdi            0x7fd4264535a8              140549151864232
rbp            0x7fd440beb680            0x7fd440beb680
rsp            0x7fd440beb648            0x7fd440beb648
r8             0x7fd4264535a8               140549151864232
r9             0x3bc6 15302
r10            0x1       1
r11            0x246  582
r12            0x4       4
r13            0x7fd440bed9c0             140549596043712
r14            0x0       0
r15            0x3       3
rip            0x3c5d8094a7   0x3c5d8094a7 <pthread_mutex_lock+55>
eflags         0x246               [ PF ZF IF ]
cs             0x33      51
ss             0x2b     43
ds             0x0       0
es             0x0       0
fs             0x0        0
gs             0x0        0
(gdb) p *(pthread_mutex_t*)0x7fd4264535a8
$4 = {__data = {__lock = 2, __count = 0, __owner = 5008, __nusers = 1, __kind = 2, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\002\000\000\000\000\000\000\000\220\023\000\000\001\000\000\000\002", '\000' <repeats 22 times>, __align = 2}
(gdb) thread 50
[Switching to thread 50 (Thread 0x7fd4001f1700 (LWP 5008))]#0  0x0000003c5d80ec2c in recv () from /lib64/libpthread.so.0
(gdb) bt
#0  0x0000003c5d80ec2c in recv () from /lib64/libpthread.so.0
#1  0x0000000001c3a6f8 in Pipe::do_recv (this=0x7fd426561000, buf=0x7fd42649c000 "\a\220\006", len=4096, flags=64) at msg/simple/Pipe.cc:2428
#2  0x0000000001c3aa4d in Pipe::buffered_recv (this=0x7fd426561000, buf=0x7fd4224c01d3 "", len=4096, flags=64) at msg/simple/Pipe.cc:2474
#3  0x0000000001c3ab0d in Pipe::tcp_read_nonblocking (this=0x7fd426561000, buf=0x7fd4224c0000 "\t\to\030", len=4096) at msg/simple/Pipe.cc:2492
#4  0x0000000001c37a25 in Pipe::read_message (this=0x7fd426561000, pm=0x7fd4001f0b80, auth_handler=0x7fd422481080) at msg/simple/Pipe.cc:2032
#5  0x0000000001c33146 in Pipe::reader (this=0x7fd426561000) at msg/simple/Pipe.cc:1581
#6  0x0000000001c3b0c8 in Pipe::Reader::entry (this=0x7fd426561018) at msg/simple/Pipe.h:50
#7  0x0000000001b33d40 in Thread::entry_wrapper (this=0x7fd426561018) at common/Thread.cc:61
#8  0x0000000001b33cb2 in Thread::_entry_func (arg=0x7fd426561018) at common/Thread.cc:45
#9  0x0000003c5d807aa1 in start_thread () from /lib64/libpthread.so.0
#10 0x0000003c5d4e8aad in clone () from /lib64/libc.so.6

Because we are dropping a large amount of ip packets, so we think the “recv” call is blocked for a long time, which led to the suicide timeout.

We read the source code, and found that the lock that directly made the thread hit suicide timeout is Connection::lock. Why is this lock used in both the submit_message and read_message?  Is it supposed to be this way?

Please help me, thank you☺



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Pipe "deadlock" in Hammer, 0.94.5
  2017-01-10 11:47 Pipe "deadlock" in Hammer, 0.94.5 许雪寒
@ 2017-01-10 13:11 ` Sage Weil
  2017-01-11  7:13   ` 答复: " 许雪寒
  2017-01-11 10:29   ` 许雪寒
  0 siblings, 2 replies; 6+ messages in thread
From: Sage Weil @ 2017-01-10 13:11 UTC (permalink / raw)
  To: 许雪寒; +Cc: ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 9010 bytes --]

On Tue, 10 Jan 2017, 许雪寒 wrote:
> Hi, everyone.
> 
> Recently, we did some experiment to test the stability of the ceph cluster. We used Hammer version which is the mostly used version of online cluster. One of the scenarios that we simulated is poor network connectivity, in which we used iptables to drop TCP/IP packet under some probability. And sometimes, we can see that some OSD suicide themselves. 
> 
> We used gdb to debug the core dumped by linux. We found that the thread that hit the suicide time threshold is a peering thread who is trying to send a pg_notify message, the ceph-osd log file and gdb output is as follows:
> 
> Log file:
>     -3> 2017-01-10 17:02:13.469949 7fd446ff7700  1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7fd440bed700' had timed out after 15
>     -2> 2017-01-10 17:02:13.469952 7fd446ff7700  1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7fd440bed700' had suicide timed out after 150
>     -1> 2017-01-10 17:02:13.469954 7fd4451f4700  1 -- 10.160.132.157:6818/10014122 <== osd.20 10.160.132.156:0/24908 163 ==== osd_ping(ping e4030 stamp 2017-01-10 17:02:13.450374) v2 ==== 47+0+0 (3247646131 0 0) 0x7fd418ca8600 con 0x7fd413c89700
>      0> 2017-01-10 17:02:13.496895 7fd446ff7700 -1 error_msg common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fd446ff7700 time 2017-01-10 17:02:13.469969
> common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
> 
> GDB OUTPUT:
> (gdb) thread 8
> [Switching to thread 8 (Thread 0x7fd440bed700 (LWP 15302))]#0  0x0000003c5d80e334 in __lll_lock_wait () from /lib64/libpthread.so.0
> (gdb) bt
> #0  0x0000003c5d80e334 in __lll_lock_wait () from /lib64/libpthread.so.0
> #1  0x0000003c5d8095d8 in _L_lock_854 () from /lib64/libpthread.so.0
> #2  0x0000003c5d8094a7 in pthread_mutex_lock () from /lib64/libpthread.so.0
> #3  0x0000000001a54ae4 in Mutex::Lock (this=0x7fd426453598, no_lockdep=false) at common/Mutex.cc:96
> #4  0x0000000001409285 in Mutex::Locker::Locker (this=0x7fd440beb6c0, m=...) at common/Mutex.h:115
> #5  0x0000000001c46446 in PipeConnection::try_get_pipe (this=0x7fd426453580, p=0x7fd440beb908) at msg/simple/PipeConnection.cc:38
> #6  0x0000000001c05809 in SimpleMessenger::submit_message (this=0x7fd482029400, m=0x7fd425538d00, con=0x7fd426453580, dest_addr=..., dest_type=4, already_locked=false) at msg/simple/SimpleMessenger.cc:443
> #7  0x0000000001c033fa in SimpleMessenger::_send_message (this=0x7fd482029400, m=0x7fd425538d00, con=0x7fd426453580) at msg/simple/SimpleMessenger.cc:136
> #8  0x0000000001c467c7 in SimpleMessenger::send_message (this=0x7fd482029400, m=0x7fd425538d00, con=0x7fd426453580) at msg/simple/SimpleMessenger.h:139
> #9  0x0000000001c466a1 in PipeConnection::send_message (this=0x7fd426453580, m=0x7fd425538d00) at msg/simple/PipeConnection.cc:78
> #10 0x00000000013b3ff2 in OSDService::send_map (this=0x7fd4821e76c8, m=0x7fd425538d00, con=0x7fd426453580) at osd/OSD.cc:1054
> #11 0x00000000013b45e7 in OSDService::send_incremental_map (this=0x7fd4821e76c8, since=4028, con=0x7fd426453580, osdmap=std::tr1::shared_ptr (count 49) 0x7fd426c0f480) at osd/OSD.cc:1087
> #12 0x00000000013b215f in OSDService::share_map_peer (this=0x7fd4821e76c8, peer=9, con=0x7fd426453580, map=std::tr1::shared_ptr (count 49) 0x7fd426c0f480) at osd/OSD.cc:887
> #13 0x00000000013f43cc in OSD::do_notifies (this=0x7fd4821e6000, notify_list=std::map with 7 elements = {...}, curmap=std::tr1::shared_ptr (count 49) 0x7fd426c0f480) at osd/OSD.cc:7246
> #14 0x00000000013f3c99 in OSD::dispatch_context (this=0x7fd4821e6000, ctx=..., pg=0x0, curmap=std::tr1::shared_ptr (count 49) 0x7fd426c0f480, handle=0x7fd440becb40) at osd/OSD.cc:7198
> #15 0x000000000140043e in OSD::process_peering_events (this=0x7fd4821e6000, pgs=std::list = {...}, handle=...) at osd/OSD.cc:8539
> #16 0x000000000141e094 in OSD::PeeringWQ::_process (this=0x7fd4821e7070, pgs=std::list = {...}, handle=...) at osd/OSD.h:1601
> #17 0x00000000014b94bf in ThreadPool::BatchWorkQueue<PG>::_void_process (this=0x7fd4821e7070, p=0x7fd425419040, handle=...) at common/WorkQueue.h:107
> #18 0x0000000001b2d2e8 in ThreadPool::worker (this=0x7fd4821e64b0, wt=0x7fd4761db430) at common/WorkQueue.cc:128
> #19 0x0000000001b313f7 in ThreadPool::WorkThread::entry (this=0x7fd4761db430) at common/WorkQueue.h:318
> #20 0x0000000001b33d40 in Thread::entry_wrapper (this=0x7fd4761db430) at common/Thread.cc:61
> #21 0x0000000001b33cb2 in Thread::_entry_func (arg=0x7fd4761db430) at common/Thread.cc:45
> #22 0x0000003c5d807aa1 in start_thread () from /lib64/libpthread.so.0
> #23 0x0000003c5d4e8aad in clone () from /lib64/libc.so.6
> 
> And the thread that is holding the mutex lock which this thread is trying to get is a pipe reader_thread who is trying to read a full message that sent from another OSD:
> 
> (gdb) frame 2
> #2  0x0000003c5d8094a7 in pthread_mutex_lock () from /lib64/libpthread.so.0
> (gdb) info reg
> rax            0xfffffffffffffe00            -512
> rbx            0x7fd425538d00             140549136026880
> rcx            0xffffffffffffffff               -1
> rdx            0x10    16
> rsi            0x80      128
> rdi            0x7fd4264535a8              140549151864232
> rbp            0x7fd440beb680            0x7fd440beb680
> rsp            0x7fd440beb648            0x7fd440beb648
> r8             0x7fd4264535a8               140549151864232
> r9             0x3bc6 15302
> r10            0x1       1
> r11            0x246  582
> r12            0x4       4
> r13            0x7fd440bed9c0             140549596043712
> r14            0x0       0
> r15            0x3       3
> rip            0x3c5d8094a7   0x3c5d8094a7 <pthread_mutex_lock+55>
> eflags         0x246               [ PF ZF IF ]
> cs             0x33      51
> ss             0x2b     43
> ds             0x0       0
> es             0x0       0
> fs             0x0        0
> gs             0x0        0
> (gdb) p *(pthread_mutex_t*)0x7fd4264535a8
> $4 = {__data = {__lock = 2, __count = 0, __owner = 5008, __nusers = 1, __kind = 2, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\002\000\000\000\000\000\000\000\220\023\000\000\001\000\000\000\002", '\000' <repeats 22 times>, __align = 2}
> (gdb) thread 50
> [Switching to thread 50 (Thread 0x7fd4001f1700 (LWP 5008))]#0  0x0000003c5d80ec2c in recv () from /lib64/libpthread.so.0
> (gdb) bt
> #0  0x0000003c5d80ec2c in recv () from /lib64/libpthread.so.0
> #1  0x0000000001c3a6f8 in Pipe::do_recv (this=0x7fd426561000, buf=0x7fd42649c000 "\a\220\006", len=4096, flags=64) at msg/simple/Pipe.cc:2428
> #2  0x0000000001c3aa4d in Pipe::buffered_recv (this=0x7fd426561000, buf=0x7fd4224c01d3 "", len=4096, flags=64) at msg/simple/Pipe.cc:2474
> #3  0x0000000001c3ab0d in Pipe::tcp_read_nonblocking (this=0x7fd426561000, buf=0x7fd4224c0000 "\t\to\030", len=4096) at msg/simple/Pipe.cc:2492

It's supposed to be a non-blocking read, so in principle this shouldn't 
block and hold the lock.

I suspect what you're seeing is a variation of #14120, fixed by 
63e44e32974c9bae17bb1bfd4261dcb024ad845c.

How reproducible is this?  We can push a build that includes the fix, but 
it would be nice to have some confirmation that it is the right one since 
so far this has been a very rare case.

Thanks!
sage


> #4  0x0000000001c37a25 in Pipe::read_message (this=0x7fd426561000, pm=0x7fd4001f0b80, auth_handler=0x7fd422481080) at msg/simple/Pipe.cc:2032
> #5  0x0000000001c33146 in Pipe::reader (this=0x7fd426561000) at msg/simple/Pipe.cc:1581
> #6  0x0000000001c3b0c8 in Pipe::Reader::entry (this=0x7fd426561018) at msg/simple/Pipe.h:50
> #7  0x0000000001b33d40 in Thread::entry_wrapper (this=0x7fd426561018) at common/Thread.cc:61
> #8  0x0000000001b33cb2 in Thread::_entry_func (arg=0x7fd426561018) at common/Thread.cc:45
> #9  0x0000003c5d807aa1 in start_thread () from /lib64/libpthread.so.0
> #10 0x0000003c5d4e8aad in clone () from /lib64/libc.so.6
> 
> Because we are dropping a large amount of ip packets, so we think the “recv” call is blocked for a long time, which led to the suicide timeout.
> 
> We read the source code, and found that the lock that directly made the thread hit suicide timeout is Connection::lock. Why is this lock used in both the submit_message and read_message?  Is it supposed to be this way?
> 
> Please help me, thank you☺
> 
> 
> N?????r??y??????X??ǧv???)޺{.n?????z?]z????ay?\x1dʇڙ??j\a??f???h?????\x1e?w???\f???j:+v???w????????\a????zZ+???????j"????i

^ permalink raw reply	[flat|nested] 6+ messages in thread

* 答复: Pipe "deadlock" in Hammer, 0.94.5
  2017-01-10 13:11 ` Sage Weil
@ 2017-01-11  7:13   ` 许雪寒
  2017-01-11 10:29   ` 许雪寒
  1 sibling, 0 replies; 6+ messages in thread
From: 许雪寒 @ 2017-01-11  7:13 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Thanks for your reply, sir:-)

Actually, this case is not very rare in our test. When iptables drop ip packets with the probability around 98%~99%, this case occurs about every three times our test runs. 
I checked #14120 as you recommended, however, it doesn't seem to be our problem. Because, as http://tracker.ceph.com/issues/18184 says, #14120 is caused by a commit that changed Pipe::tcp_read_wait() to return -errno instead of "-1" and this commit is not applied in our tested hammer verion(0.94.5).
And I truly agree with you in that the invocation of "recv" function passes MSG_DONTWAIT flag, so it shouldn't block, however, every time we encounter this problem it's always the Pipe::reader_thread who's calling recv that hold the lock when suicide happens, this is really confusing.

Please help us, thank you:-)



-----邮件原件-----
发件人: Sage Weil [mailto:sage@newdream.net] 
发送时间: 2017年1月10日 21:12
收件人: 许雪寒
抄送: ceph-devel@vger.kernel.org
主题: Re: Pipe "deadlock" in Hammer, 0.94.5

On Tue, 10 Jan 2017, 许雪寒 wrote:
> Hi, everyone.
> 
> Recently, we did some experiment to test the stability of the ceph cluster. We used Hammer version which is the mostly used version of online cluster. One of the scenarios that we simulated is poor network connectivity, in which we used iptables to drop TCP/IP packet under some probability. And sometimes, we can see that some OSD suicide themselves. 
> 
> We used gdb to debug the core dumped by linux. We found that the thread that hit the suicide time threshold is a peering thread who is trying to send a pg_notify message, the ceph-osd log file and gdb output is as follows:
> 
> Log file:
>     -3> 2017-01-10 17:02:13.469949 7fd446ff7700  1 heartbeat_map 
> is_healthy 'OSD::osd_tp thread 0x7fd440bed700' had timed out after 15
>     -2> 2017-01-10 17:02:13.469952 7fd446ff7700  1 heartbeat_map 
> is_healthy 'OSD::osd_tp thread 0x7fd440bed700' had suicide timed out 
> after 150
>     -1> 2017-01-10 17:02:13.469954 7fd4451f4700  1 -- 
> 10.160.132.157:6818/10014122 <== osd.20 10.160.132.156:0/24908 163 
> ==== osd_ping(ping e4030 stamp 2017-01-10 17:02:13.450374) v2 ==== 
> 47+0+0 (3247646131 0 0) 0x7fd418ca8600 con 0x7fd413c89700
>      0> 2017-01-10 17:02:13.496895 7fd446ff7700 -1 error_msg 
> common/HeartbeatMap.cc: In function 'bool 
> ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, 
> time_t)' thread 7fd446ff7700 time 2017-01-10 17:02:13.469969
> common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
> 
> GDB OUTPUT:
> (gdb) thread 8
> [Switching to thread 8 (Thread 0x7fd440bed700 (LWP 15302))]#0  
> 0x0000003c5d80e334 in __lll_lock_wait () from /lib64/libpthread.so.0
> (gdb) bt
> #0  0x0000003c5d80e334 in __lll_lock_wait () from 
> /lib64/libpthread.so.0
> #1  0x0000003c5d8095d8 in _L_lock_854 () from /lib64/libpthread.so.0
> #2  0x0000003c5d8094a7 in pthread_mutex_lock () from 
> /lib64/libpthread.so.0
> #3  0x0000000001a54ae4 in Mutex::Lock (this=0x7fd426453598, 
> no_lockdep=false) at common/Mutex.cc:96
> #4  0x0000000001409285 in Mutex::Locker::Locker (this=0x7fd440beb6c0, 
> m=...) at common/Mutex.h:115
> #5  0x0000000001c46446 in PipeConnection::try_get_pipe 
> (this=0x7fd426453580, p=0x7fd440beb908) at 
> msg/simple/PipeConnection.cc:38
> #6  0x0000000001c05809 in SimpleMessenger::submit_message 
> (this=0x7fd482029400, m=0x7fd425538d00, con=0x7fd426453580, 
> dest_addr=..., dest_type=4, already_locked=false) at 
> msg/simple/SimpleMessenger.cc:443
> #7  0x0000000001c033fa in SimpleMessenger::_send_message 
> (this=0x7fd482029400, m=0x7fd425538d00, con=0x7fd426453580) at 
> msg/simple/SimpleMessenger.cc:136
> #8  0x0000000001c467c7 in SimpleMessenger::send_message 
> (this=0x7fd482029400, m=0x7fd425538d00, con=0x7fd426453580) at 
> msg/simple/SimpleMessenger.h:139
> #9  0x0000000001c466a1 in PipeConnection::send_message 
> (this=0x7fd426453580, m=0x7fd425538d00) at 
> msg/simple/PipeConnection.cc:78
> #10 0x00000000013b3ff2 in OSDService::send_map (this=0x7fd4821e76c8, 
> m=0x7fd425538d00, con=0x7fd426453580) at osd/OSD.cc:1054
> #11 0x00000000013b45e7 in OSDService::send_incremental_map 
> (this=0x7fd4821e76c8, since=4028, con=0x7fd426453580, 
> osdmap=std::tr1::shared_ptr (count 49) 0x7fd426c0f480) at 
> osd/OSD.cc:1087
> #12 0x00000000013b215f in OSDService::share_map_peer 
> (this=0x7fd4821e76c8, peer=9, con=0x7fd426453580, 
> map=std::tr1::shared_ptr (count 49) 0x7fd426c0f480) at osd/OSD.cc:887
> #13 0x00000000013f43cc in OSD::do_notifies (this=0x7fd4821e6000, 
> notify_list=std::map with 7 elements = {...}, 
> curmap=std::tr1::shared_ptr (count 49) 0x7fd426c0f480) at 
> osd/OSD.cc:7246
> #14 0x00000000013f3c99 in OSD::dispatch_context (this=0x7fd4821e6000, 
> ctx=..., pg=0x0, curmap=std::tr1::shared_ptr (count 49) 
> 0x7fd426c0f480, handle=0x7fd440becb40) at osd/OSD.cc:7198
> #15 0x000000000140043e in OSD::process_peering_events 
> (this=0x7fd4821e6000, pgs=std::list = {...}, handle=...) at 
> osd/OSD.cc:8539
> #16 0x000000000141e094 in OSD::PeeringWQ::_process 
> (this=0x7fd4821e7070, pgs=std::list = {...}, handle=...) at 
> osd/OSD.h:1601
> #17 0x00000000014b94bf in 
> ThreadPool::BatchWorkQueue<PG>::_void_process (this=0x7fd4821e7070, 
> p=0x7fd425419040, handle=...) at common/WorkQueue.h:107
> #18 0x0000000001b2d2e8 in ThreadPool::worker (this=0x7fd4821e64b0, 
> wt=0x7fd4761db430) at common/WorkQueue.cc:128
> #19 0x0000000001b313f7 in ThreadPool::WorkThread::entry 
> (this=0x7fd4761db430) at common/WorkQueue.h:318
> #20 0x0000000001b33d40 in Thread::entry_wrapper (this=0x7fd4761db430) 
> at common/Thread.cc:61
> #21 0x0000000001b33cb2 in Thread::_entry_func (arg=0x7fd4761db430) at 
> common/Thread.cc:45
> #22 0x0000003c5d807aa1 in start_thread () from /lib64/libpthread.so.0
> #23 0x0000003c5d4e8aad in clone () from /lib64/libc.so.6
> 
> And the thread that is holding the mutex lock which this thread is trying to get is a pipe reader_thread who is trying to read a full message that sent from another OSD:
> 
> (gdb) frame 2
> #2  0x0000003c5d8094a7 in pthread_mutex_lock () from 
> /lib64/libpthread.so.0
> (gdb) info reg
> rax            0xfffffffffffffe00            -512 rbx            
> 0x7fd425538d00             140549136026880 rcx            
> 0xffffffffffffffff               -1 rdx            0x10    16 rsi            
> 0x80      128 rdi            0x7fd4264535a8              
> 140549151864232 rbp            0x7fd440beb680            
> 0x7fd440beb680 rsp            0x7fd440beb648            0x7fd440beb648
> r8             0x7fd4264535a8               140549151864232
> r9             0x3bc6 15302
> r10            0x1       1
> r11            0x246  582
> r12            0x4       4
> r13            0x7fd440bed9c0             140549596043712
> r14            0x0       0
> r15            0x3       3
> rip            0x3c5d8094a7   0x3c5d8094a7 <pthread_mutex_lock+55> 
> eflags         0x246               [ PF ZF IF ] cs             0x33      
> 51 ss             0x2b     43 ds             0x0       0 es             
> 0x0       0 fs             0x0        0 gs             0x0        0
> (gdb) p *(pthread_mutex_t*)0x7fd4264535a8
> $4 = {__data = {__lock = 2, __count = 0, __owner = 5008, __nusers = 1, 
> __kind = 2, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, 
> __size = 
> "\002\000\000\000\000\000\000\000\220\023\000\000\001\000\000\000\002"
> , '\000' <repeats 22 times>, __align = 2}
> (gdb) thread 50
> [Switching to thread 50 (Thread 0x7fd4001f1700 (LWP 5008))]#0  
> 0x0000003c5d80ec2c in recv () from /lib64/libpthread.so.0
> (gdb) bt
> #0  0x0000003c5d80ec2c in recv () from /lib64/libpthread.so.0
> #1  0x0000000001c3a6f8 in Pipe::do_recv (this=0x7fd426561000, 
> buf=0x7fd42649c000 "\a\220\006", len=4096, flags=64) at 
> msg/simple/Pipe.cc:2428
> #2  0x0000000001c3aa4d in Pipe::buffered_recv (this=0x7fd426561000, 
> buf=0x7fd4224c01d3 "", len=4096, flags=64) at msg/simple/Pipe.cc:2474
> #3  0x0000000001c3ab0d in Pipe::tcp_read_nonblocking 
> (this=0x7fd426561000, buf=0x7fd4224c0000 "\t\to\030", len=4096) at 
> msg/simple/Pipe.cc:2492

It's supposed to be a non-blocking read, so in principle this shouldn't block and hold the lock.

I suspect what you're seeing is a variation of #14120, fixed by 63e44e32974c9bae17bb1bfd4261dcb024ad845c.

How reproducible is this?  We can push a build that includes the fix, but it would be nice to have some confirmation that it is the right one since so far this has been a very rare case.

Thanks!
sage


> #4  0x0000000001c37a25 in Pipe::read_message (this=0x7fd426561000, 
> pm=0x7fd4001f0b80, auth_handler=0x7fd422481080) at 
> msg/simple/Pipe.cc:2032
> #5  0x0000000001c33146 in Pipe::reader (this=0x7fd426561000) at 
> msg/simple/Pipe.cc:1581
> #6  0x0000000001c3b0c8 in Pipe::Reader::entry (this=0x7fd426561018) at 
> msg/simple/Pipe.h:50
> #7  0x0000000001b33d40 in Thread::entry_wrapper (this=0x7fd426561018) 
> at common/Thread.cc:61
> #8  0x0000000001b33cb2 in Thread::_entry_func (arg=0x7fd426561018) at 
> common/Thread.cc:45
> #9  0x0000003c5d807aa1 in start_thread () from /lib64/libpthread.so.0
> #10 0x0000003c5d4e8aad in clone () from /lib64/libc.so.6
> 
> Because we are dropping a large amount of ip packets, so we think the “recv” call is blocked for a long time, which led to the suicide timeout.
> 
> We read the source code, and found that the lock that directly made the thread hit suicide timeout is Connection::lock. Why is this lock used in both the submit_message and read_message?  Is it supposed to be this way?
> 
> Please help me, thank you☺
> 
> 
> N?????r??y??????X??ǧv???)޺{.n?????z?]z????ay?\x1dʇڙ??j ??f???h?????\x1e?w???
???j:+v???w???????? ????zZ+???????j"????i

^ permalink raw reply	[flat|nested] 6+ messages in thread

* 答复: Pipe "deadlock" in Hammer, 0.94.5
  2017-01-10 13:11 ` Sage Weil
  2017-01-11  7:13   ` 答复: " 许雪寒
@ 2017-01-11 10:29   ` 许雪寒
  2017-01-18 23:17     ` Padmanabh Ratnakar
  1 sibling, 1 reply; 6+ messages in thread
From: 许雪寒 @ 2017-01-11 10:29 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

In our test, when one machine is under heavy packet loss, OSDs on other machines can be brought down and sometimes more than one OSD go down, because they may try to read a message that comes from that machine. So we are really concerned, please help us. Thanks


-----邮件原件-----
发件人: 许雪寒 
发送时间: 2017年1月11日 15:13
收件人: 'Sage Weil'
抄送: ceph-devel@vger.kernel.org
主题: 答复: Pipe "deadlock" in Hammer, 0.94.5

Thanks for your reply, sir:-)

Actually, this case is not very rare in our test. When iptables drop ip packets with the probability around 98%~99%, this case occurs about every three times our test runs. 
I checked #14120 as you recommended, however, it doesn't seem to be our problem. Because, as http://tracker.ceph.com/issues/18184 says, #14120 is caused by a commit that changed Pipe::tcp_read_wait() to return -errno instead of "-1" and this commit is not applied in our tested hammer verion(0.94.5).
And I truly agree with you in that the invocation of "recv" function passes MSG_DONTWAIT flag, so it shouldn't block, however, every time we encounter this problem it's always the Pipe::reader_thread who's calling recv that hold the lock when suicide happens, this is really confusing.

Please help us, thank you:-)



-----邮件原件-----
发件人: Sage Weil [mailto:sage@newdream.net]
发送时间: 2017年1月10日 21:12
收件人: 许雪寒
抄送: ceph-devel@vger.kernel.org
主题: Re: Pipe "deadlock" in Hammer, 0.94.5

On Tue, 10 Jan 2017, 许雪寒 wrote:
> Hi, everyone.
> 
> Recently, we did some experiment to test the stability of the ceph cluster. We used Hammer version which is the mostly used version of online cluster. One of the scenarios that we simulated is poor network connectivity, in which we used iptables to drop TCP/IP packet under some probability. And sometimes, we can see that some OSD suicide themselves. 
> 
> We used gdb to debug the core dumped by linux. We found that the thread that hit the suicide time threshold is a peering thread who is trying to send a pg_notify message, the ceph-osd log file and gdb output is as follows:
> 
> Log file:
>     -3> 2017-01-10 17:02:13.469949 7fd446ff7700  1 heartbeat_map 
> is_healthy 'OSD::osd_tp thread 0x7fd440bed700' had timed out after 15
>     -2> 2017-01-10 17:02:13.469952 7fd446ff7700  1 heartbeat_map 
> is_healthy 'OSD::osd_tp thread 0x7fd440bed700' had suicide timed out 
> after 150
>     -1> 2017-01-10 17:02:13.469954 7fd4451f4700  1 --
> 10.160.132.157:6818/10014122 <== osd.20 10.160.132.156:0/24908 163 
> ==== osd_ping(ping e4030 stamp 2017-01-10 17:02:13.450374) v2 ====
> 47+0+0 (3247646131 0 0) 0x7fd418ca8600 con 0x7fd413c89700
>      0> 2017-01-10 17:02:13.496895 7fd446ff7700 -1 error_msg
> common/HeartbeatMap.cc: In function 'bool 
> ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, 
> time_t)' thread 7fd446ff7700 time 2017-01-10 17:02:13.469969
> common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
> 
> GDB OUTPUT:
> (gdb) thread 8
> [Switching to thread 8 (Thread 0x7fd440bed700 (LWP 15302))]#0
> 0x0000003c5d80e334 in __lll_lock_wait () from /lib64/libpthread.so.0
> (gdb) bt
> #0  0x0000003c5d80e334 in __lll_lock_wait () from
> /lib64/libpthread.so.0
> #1  0x0000003c5d8095d8 in _L_lock_854 () from /lib64/libpthread.so.0
> #2  0x0000003c5d8094a7 in pthread_mutex_lock () from
> /lib64/libpthread.so.0
> #3  0x0000000001a54ae4 in Mutex::Lock (this=0x7fd426453598,
> no_lockdep=false) at common/Mutex.cc:96
> #4  0x0000000001409285 in Mutex::Locker::Locker (this=0x7fd440beb6c0,
> m=...) at common/Mutex.h:115
> #5  0x0000000001c46446 in PipeConnection::try_get_pipe 
> (this=0x7fd426453580, p=0x7fd440beb908) at
> msg/simple/PipeConnection.cc:38
> #6  0x0000000001c05809 in SimpleMessenger::submit_message 
> (this=0x7fd482029400, m=0x7fd425538d00, con=0x7fd426453580, 
> dest_addr=..., dest_type=4, already_locked=false) at
> msg/simple/SimpleMessenger.cc:443
> #7  0x0000000001c033fa in SimpleMessenger::_send_message 
> (this=0x7fd482029400, m=0x7fd425538d00, con=0x7fd426453580) at
> msg/simple/SimpleMessenger.cc:136
> #8  0x0000000001c467c7 in SimpleMessenger::send_message 
> (this=0x7fd482029400, m=0x7fd425538d00, con=0x7fd426453580) at
> msg/simple/SimpleMessenger.h:139
> #9  0x0000000001c466a1 in PipeConnection::send_message 
> (this=0x7fd426453580, m=0x7fd425538d00) at
> msg/simple/PipeConnection.cc:78
> #10 0x00000000013b3ff2 in OSDService::send_map (this=0x7fd4821e76c8, 
> m=0x7fd425538d00, con=0x7fd426453580) at osd/OSD.cc:1054
> #11 0x00000000013b45e7 in OSDService::send_incremental_map 
> (this=0x7fd4821e76c8, since=4028, con=0x7fd426453580, 
> osdmap=std::tr1::shared_ptr (count 49) 0x7fd426c0f480) at
> osd/OSD.cc:1087
> #12 0x00000000013b215f in OSDService::share_map_peer 
> (this=0x7fd4821e76c8, peer=9, con=0x7fd426453580, 
> map=std::tr1::shared_ptr (count 49) 0x7fd426c0f480) at osd/OSD.cc:887
> #13 0x00000000013f43cc in OSD::do_notifies (this=0x7fd4821e6000, 
> notify_list=std::map with 7 elements = {...}, 
> curmap=std::tr1::shared_ptr (count 49) 0x7fd426c0f480) at
> osd/OSD.cc:7246
> #14 0x00000000013f3c99 in OSD::dispatch_context (this=0x7fd4821e6000, 
> ctx=..., pg=0x0, curmap=std::tr1::shared_ptr (count 49) 
> 0x7fd426c0f480, handle=0x7fd440becb40) at osd/OSD.cc:7198
> #15 0x000000000140043e in OSD::process_peering_events 
> (this=0x7fd4821e6000, pgs=std::list = {...}, handle=...) at
> osd/OSD.cc:8539
> #16 0x000000000141e094 in OSD::PeeringWQ::_process 
> (this=0x7fd4821e7070, pgs=std::list = {...}, handle=...) at
> osd/OSD.h:1601
> #17 0x00000000014b94bf in
> ThreadPool::BatchWorkQueue<PG>::_void_process (this=0x7fd4821e7070, 
> p=0x7fd425419040, handle=...) at common/WorkQueue.h:107
> #18 0x0000000001b2d2e8 in ThreadPool::worker (this=0x7fd4821e64b0,
> wt=0x7fd4761db430) at common/WorkQueue.cc:128
> #19 0x0000000001b313f7 in ThreadPool::WorkThread::entry
> (this=0x7fd4761db430) at common/WorkQueue.h:318
> #20 0x0000000001b33d40 in Thread::entry_wrapper (this=0x7fd4761db430) 
> at common/Thread.cc:61
> #21 0x0000000001b33cb2 in Thread::_entry_func (arg=0x7fd4761db430) at
> common/Thread.cc:45
> #22 0x0000003c5d807aa1 in start_thread () from /lib64/libpthread.so.0
> #23 0x0000003c5d4e8aad in clone () from /lib64/libc.so.6
> 
> And the thread that is holding the mutex lock which this thread is trying to get is a pipe reader_thread who is trying to read a full message that sent from another OSD:
> 
> (gdb) frame 2
> #2  0x0000003c5d8094a7 in pthread_mutex_lock () from
> /lib64/libpthread.so.0
> (gdb) info reg
> rax            0xfffffffffffffe00            -512 rbx
> 0x7fd425538d00             140549136026880 rcx 0xffffffffffffffff               
> -1 rdx            0x10    16 rsi
> 0x80      128 rdi            0x7fd4264535a8
> 140549151864232 rbp            0x7fd440beb680
> 0x7fd440beb680 rsp            0x7fd440beb648            0x7fd440beb648
> r8             0x7fd4264535a8               140549151864232
> r9             0x3bc6 15302
> r10            0x1       1
> r11            0x246  582
> r12            0x4       4
> r13            0x7fd440bed9c0             140549596043712
> r14            0x0       0
> r15            0x3       3
> rip            0x3c5d8094a7   0x3c5d8094a7 <pthread_mutex_lock+55> 
> eflags         0x246               [ PF ZF IF ] cs             0x33
> 51 ss             0x2b     43 ds             0x0       0 es
> 0x0       0 fs             0x0        0 gs             0x0        0
> (gdb) p *(pthread_mutex_t*)0x7fd4264535a8
> $4 = {__data = {__lock = 2, __count = 0, __owner = 5008, __nusers = 1, 
> __kind = 2, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, 
> __size = 
> "\002\000\000\000\000\000\000\000\220\023\000\000\001\000\000\000\002"
> , '\000' <repeats 22 times>, __align = 2}
> (gdb) thread 50
> [Switching to thread 50 (Thread 0x7fd4001f1700 (LWP 5008))]#0 
> 0x0000003c5d80ec2c in recv () from /lib64/libpthread.so.0
> (gdb) bt
> #0  0x0000003c5d80ec2c in recv () from /lib64/libpthread.so.0
> #1  0x0000000001c3a6f8 in Pipe::do_recv (this=0x7fd426561000,
> buf=0x7fd42649c000 "\a\220\006", len=4096, flags=64) at
> msg/simple/Pipe.cc:2428
> #2  0x0000000001c3aa4d in Pipe::buffered_recv (this=0x7fd426561000,
> buf=0x7fd4224c01d3 "", len=4096, flags=64) at msg/simple/Pipe.cc:2474
> #3  0x0000000001c3ab0d in Pipe::tcp_read_nonblocking 
> (this=0x7fd426561000, buf=0x7fd4224c0000 "\t\to\030", len=4096) at
> msg/simple/Pipe.cc:2492

It's supposed to be a non-blocking read, so in principle this shouldn't block and hold the lock.

I suspect what you're seeing is a variation of #14120, fixed by 63e44e32974c9bae17bb1bfd4261dcb024ad845c.

How reproducible is this?  We can push a build that includes the fix, but it would be nice to have some confirmation that it is the right one since so far this has been a very rare case.

Thanks!
sage


> #4  0x0000000001c37a25 in Pipe::read_message (this=0x7fd426561000, 
> pm=0x7fd4001f0b80, auth_handler=0x7fd422481080) at
> msg/simple/Pipe.cc:2032
> #5  0x0000000001c33146 in Pipe::reader (this=0x7fd426561000) at
> msg/simple/Pipe.cc:1581
> #6  0x0000000001c3b0c8 in Pipe::Reader::entry (this=0x7fd426561018) at
> msg/simple/Pipe.h:50
> #7  0x0000000001b33d40 in Thread::entry_wrapper (this=0x7fd426561018) 
> at common/Thread.cc:61
> #8  0x0000000001b33cb2 in Thread::_entry_func (arg=0x7fd426561018) at
> common/Thread.cc:45
> #9  0x0000003c5d807aa1 in start_thread () from /lib64/libpthread.so.0
> #10 0x0000003c5d4e8aad in clone () from /lib64/libc.so.6
> 
> Because we are dropping a large amount of ip packets, so we think the “recv” call is blocked for a long time, which led to the suicide timeout.
> 
> We read the source code, and found that the lock that directly made the thread hit suicide timeout is Connection::lock. Why is this lock used in both the submit_message and read_message?  Is it supposed to be this way?
> 
> Please help me, thank you☺
> 
> 
> N?????r??y??????X??ǧv???)޺{.n?????z?]z????ay?\x1dʇڙ??j ??f???h?????\x1e?w???
???j:+v???w???????? ????zZ+???????j"????i

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 答复: Pipe "deadlock" in Hammer, 0.94.5
  2017-01-11 10:29   ` 许雪寒
@ 2017-01-18 23:17     ` Padmanabh Ratnakar
  2017-03-15  9:39       ` 答复: " 许雪寒
  0 siblings, 1 reply; 6+ messages in thread
From: Padmanabh Ratnakar @ 2017-01-18 23:17 UTC (permalink / raw)
  To: 许雪寒; +Cc: Sage Weil, ceph-devel

Hi Sage,
       Fix 63e44e32974c9bae17bb1bfd4261dcb024ad845c is not pulled in v10.2.5.
Any reasons for this. Is it recommended to pull this fix in our deployments.

Thanks,
Padmanabh

On Wed, Jan 11, 2017 at 3:59 PM, 许雪寒 <xuxuehan@360.cn> wrote:
> In our test, when one machine is under heavy packet loss, OSDs on other machines can be brought down and sometimes more than one OSD go down, because they may try to read a message that comes from that machine. So we are really concerned, please help us. Thanks
>
>
> -----邮件原件-----
> 发件人: 许雪寒
> 发送时间: 2017年1月11日 15:13
> 收件人: 'Sage Weil'
> 抄送: ceph-devel@vger.kernel.org
> 主题: 答复: Pipe "deadlock" in Hammer, 0.94.5
>
> Thanks for your reply, sir:-)
>
> Actually, this case is not very rare in our test. When iptables drop ip packets with the probability around 98%~99%, this case occurs about every three times our test runs.
> I checked #14120 as you recommended, however, it doesn't seem to be our problem. Because, as http://tracker.ceph.com/issues/18184 says, #14120 is caused by a commit that changed Pipe::tcp_read_wait() to return -errno instead of "-1" and this commit is not applied in our tested hammer verion(0.94.5).
> And I truly agree with you in that the invocation of "recv" function passes MSG_DONTWAIT flag, so it shouldn't block, however, every time we encounter this problem it's always the Pipe::reader_thread who's calling recv that hold the lock when suicide happens, this is really confusing.
>
> Please help us, thank you:-)
>
>
>
> -----邮件原件-----
> 发件人: Sage Weil [mailto:sage@newdream.net]
> 发送时间: 2017年1月10日 21:12
> 收件人: 许雪寒
> 抄送: ceph-devel@vger.kernel.org
> 主题: Re: Pipe "deadlock" in Hammer, 0.94.5
>
> On Tue, 10 Jan 2017, 许雪寒 wrote:
>> Hi, everyone.
>>
>> Recently, we did some experiment to test the stability of the ceph cluster. We used Hammer version which is the mostly used version of online cluster. One of the scenarios that we simulated is poor network connectivity, in which we used iptables to drop TCP/IP packet under some probability. And sometimes, we can see that some OSD suicide themselves.
>>
>> We used gdb to debug the core dumped by linux. We found that the thread that hit the suicide time threshold is a peering thread who is trying to send a pg_notify message, the ceph-osd log file and gdb output is as follows:
>>
>> Log file:
>>     -3> 2017-01-10 17:02:13.469949 7fd446ff7700  1 heartbeat_map
>> is_healthy 'OSD::osd_tp thread 0x7fd440bed700' had timed out after 15
>>     -2> 2017-01-10 17:02:13.469952 7fd446ff7700  1 heartbeat_map
>> is_healthy 'OSD::osd_tp thread 0x7fd440bed700' had suicide timed out
>> after 150
>>     -1> 2017-01-10 17:02:13.469954 7fd4451f4700  1 --
>> 10.160.132.157:6818/10014122 <== osd.20 10.160.132.156:0/24908 163
>> ==== osd_ping(ping e4030 stamp 2017-01-10 17:02:13.450374) v2 ====
>> 47+0+0 (3247646131 0 0) 0x7fd418ca8600 con 0x7fd413c89700
>>      0> 2017-01-10 17:02:13.496895 7fd446ff7700 -1 error_msg
>> common/HeartbeatMap.cc: In function 'bool
>> ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*,
>> time_t)' thread 7fd446ff7700 time 2017-01-10 17:02:13.469969
>> common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
>>
>> GDB OUTPUT:
>> (gdb) thread 8
>> [Switching to thread 8 (Thread 0x7fd440bed700 (LWP 15302))]#0
>> 0x0000003c5d80e334 in __lll_lock_wait () from /lib64/libpthread.so.0
>> (gdb) bt
>> #0  0x0000003c5d80e334 in __lll_lock_wait () from
>> /lib64/libpthread.so.0
>> #1  0x0000003c5d8095d8 in _L_lock_854 () from /lib64/libpthread.so.0
>> #2  0x0000003c5d8094a7 in pthread_mutex_lock () from
>> /lib64/libpthread.so.0
>> #3  0x0000000001a54ae4 in Mutex::Lock (this=0x7fd426453598,
>> no_lockdep=false) at common/Mutex.cc:96
>> #4  0x0000000001409285 in Mutex::Locker::Locker (this=0x7fd440beb6c0,
>> m=...) at common/Mutex.h:115
>> #5  0x0000000001c46446 in PipeConnection::try_get_pipe
>> (this=0x7fd426453580, p=0x7fd440beb908) at
>> msg/simple/PipeConnection.cc:38
>> #6  0x0000000001c05809 in SimpleMessenger::submit_message
>> (this=0x7fd482029400, m=0x7fd425538d00, con=0x7fd426453580,
>> dest_addr=..., dest_type=4, already_locked=false) at
>> msg/simple/SimpleMessenger.cc:443
>> #7  0x0000000001c033fa in SimpleMessenger::_send_message
>> (this=0x7fd482029400, m=0x7fd425538d00, con=0x7fd426453580) at
>> msg/simple/SimpleMessenger.cc:136
>> #8  0x0000000001c467c7 in SimpleMessenger::send_message
>> (this=0x7fd482029400, m=0x7fd425538d00, con=0x7fd426453580) at
>> msg/simple/SimpleMessenger.h:139
>> #9  0x0000000001c466a1 in PipeConnection::send_message
>> (this=0x7fd426453580, m=0x7fd425538d00) at
>> msg/simple/PipeConnection.cc:78
>> #10 0x00000000013b3ff2 in OSDService::send_map (this=0x7fd4821e76c8,
>> m=0x7fd425538d00, con=0x7fd426453580) at osd/OSD.cc:1054
>> #11 0x00000000013b45e7 in OSDService::send_incremental_map
>> (this=0x7fd4821e76c8, since=4028, con=0x7fd426453580,
>> osdmap=std::tr1::shared_ptr (count 49) 0x7fd426c0f480) at
>> osd/OSD.cc:1087
>> #12 0x00000000013b215f in OSDService::share_map_peer
>> (this=0x7fd4821e76c8, peer=9, con=0x7fd426453580,
>> map=std::tr1::shared_ptr (count 49) 0x7fd426c0f480) at osd/OSD.cc:887
>> #13 0x00000000013f43cc in OSD::do_notifies (this=0x7fd4821e6000,
>> notify_list=std::map with 7 elements = {...},
>> curmap=std::tr1::shared_ptr (count 49) 0x7fd426c0f480) at
>> osd/OSD.cc:7246
>> #14 0x00000000013f3c99 in OSD::dispatch_context (this=0x7fd4821e6000,
>> ctx=..., pg=0x0, curmap=std::tr1::shared_ptr (count 49)
>> 0x7fd426c0f480, handle=0x7fd440becb40) at osd/OSD.cc:7198
>> #15 0x000000000140043e in OSD::process_peering_events
>> (this=0x7fd4821e6000, pgs=std::list = {...}, handle=...) at
>> osd/OSD.cc:8539
>> #16 0x000000000141e094 in OSD::PeeringWQ::_process
>> (this=0x7fd4821e7070, pgs=std::list = {...}, handle=...) at
>> osd/OSD.h:1601
>> #17 0x00000000014b94bf in
>> ThreadPool::BatchWorkQueue<PG>::_void_process (this=0x7fd4821e7070,
>> p=0x7fd425419040, handle=...) at common/WorkQueue.h:107
>> #18 0x0000000001b2d2e8 in ThreadPool::worker (this=0x7fd4821e64b0,
>> wt=0x7fd4761db430) at common/WorkQueue.cc:128
>> #19 0x0000000001b313f7 in ThreadPool::WorkThread::entry
>> (this=0x7fd4761db430) at common/WorkQueue.h:318
>> #20 0x0000000001b33d40 in Thread::entry_wrapper (this=0x7fd4761db430)
>> at common/Thread.cc:61
>> #21 0x0000000001b33cb2 in Thread::_entry_func (arg=0x7fd4761db430) at
>> common/Thread.cc:45
>> #22 0x0000003c5d807aa1 in start_thread () from /lib64/libpthread.so.0
>> #23 0x0000003c5d4e8aad in clone () from /lib64/libc.so.6
>>
>> And the thread that is holding the mutex lock which this thread is trying to get is a pipe reader_thread who is trying to read a full message that sent from another OSD:
>>
>> (gdb) frame 2
>> #2  0x0000003c5d8094a7 in pthread_mutex_lock () from
>> /lib64/libpthread.so.0
>> (gdb) info reg
>> rax            0xfffffffffffffe00            -512 rbx
>> 0x7fd425538d00             140549136026880 rcx 0xffffffffffffffff
>> -1 rdx            0x10    16 rsi
>> 0x80      128 rdi            0x7fd4264535a8
>> 140549151864232 rbp            0x7fd440beb680
>> 0x7fd440beb680 rsp            0x7fd440beb648            0x7fd440beb648
>> r8             0x7fd4264535a8               140549151864232
>> r9             0x3bc6 15302
>> r10            0x1       1
>> r11            0x246  582
>> r12            0x4       4
>> r13            0x7fd440bed9c0             140549596043712
>> r14            0x0       0
>> r15            0x3       3
>> rip            0x3c5d8094a7   0x3c5d8094a7 <pthread_mutex_lock+55>
>> eflags         0x246               [ PF ZF IF ] cs             0x33
>> 51 ss             0x2b     43 ds             0x0       0 es
>> 0x0       0 fs             0x0        0 gs             0x0        0
>> (gdb) p *(pthread_mutex_t*)0x7fd4264535a8
>> $4 = {__data = {__lock = 2, __count = 0, __owner = 5008, __nusers = 1,
>> __kind = 2, __spins = 0, __list = {__prev = 0x0, __next = 0x0}},
>> __size =
>> "\002\000\000\000\000\000\000\000\220\023\000\000\001\000\000\000\002"
>> , '\000' <repeats 22 times>, __align = 2}
>> (gdb) thread 50
>> [Switching to thread 50 (Thread 0x7fd4001f1700 (LWP 5008))]#0
>> 0x0000003c5d80ec2c in recv () from /lib64/libpthread.so.0
>> (gdb) bt
>> #0  0x0000003c5d80ec2c in recv () from /lib64/libpthread.so.0
>> #1  0x0000000001c3a6f8 in Pipe::do_recv (this=0x7fd426561000,
>> buf=0x7fd42649c000 "\a\220\006", len=4096, flags=64) at
>> msg/simple/Pipe.cc:2428
>> #2  0x0000000001c3aa4d in Pipe::buffered_recv (this=0x7fd426561000,
>> buf=0x7fd4224c01d3 "", len=4096, flags=64) at msg/simple/Pipe.cc:2474
>> #3  0x0000000001c3ab0d in Pipe::tcp_read_nonblocking
>> (this=0x7fd426561000, buf=0x7fd4224c0000 "\t\to\030", len=4096) at
>> msg/simple/Pipe.cc:2492
>
> It's supposed to be a non-blocking read, so in principle this shouldn't block and hold the lock.
>
> I suspect what you're seeing is a variation of #14120, fixed by 63e44e32974c9bae17bb1bfd4261dcb024ad845c.
>
> How reproducible is this?  We can push a build that includes the fix, but it would be nice to have some confirmation that it is the right one since so far this has been a very rare case.
>
> Thanks!
> sage
>
>
>> #4  0x0000000001c37a25 in Pipe::read_message (this=0x7fd426561000,
>> pm=0x7fd4001f0b80, auth_handler=0x7fd422481080) at
>> msg/simple/Pipe.cc:2032
>> #5  0x0000000001c33146 in Pipe::reader (this=0x7fd426561000) at
>> msg/simple/Pipe.cc:1581
>> #6  0x0000000001c3b0c8 in Pipe::Reader::entry (this=0x7fd426561018) at
>> msg/simple/Pipe.h:50
>> #7  0x0000000001b33d40 in Thread::entry_wrapper (this=0x7fd426561018)
>> at common/Thread.cc:61
>> #8  0x0000000001b33cb2 in Thread::_entry_func (arg=0x7fd426561018) at
>> common/Thread.cc:45
>> #9  0x0000003c5d807aa1 in start_thread () from /lib64/libpthread.so.0
>> #10 0x0000003c5d4e8aad in clone () from /lib64/libc.so.6
>>
>> Because we are dropping a large amount of ip packets, so we think the “recv” call is blocked for a long time, which led to the suicide timeout.
>>
>> We read the source code, and found that the lock that directly made the thread hit suicide timeout is Connection::lock. Why is this lock used in both the submit_message and read_message?  Is it supposed to be this way?
>>
>> Please help me, thank you☺
>>
>>
>> N?????r??y??????X??ǧv???)޺{.n?????z?]z????ay? ʇڙ??j ??f???h????? ?w???
> ???j:+v???w???????? ????zZ+???????j"????i

^ permalink raw reply	[flat|nested] 6+ messages in thread

* 答复: 答复: Pipe "deadlock" in Hammer, 0.94.5
  2017-01-18 23:17     ` Padmanabh Ratnakar
@ 2017-03-15  9:39       ` 许雪寒
  0 siblings, 0 replies; 6+ messages in thread
From: 许雪寒 @ 2017-03-15  9:39 UTC (permalink / raw)
  To: Padmanabh Ratnakar; +Cc: Sage Weil, ceph-devel

Hi, thanks for your help.

I checked this fix, it should be the one that we need. However, I saw that this fix had been backported down to v11.0.0, and we are using 0.94.5. Is it safe for us to simply "cherry pick" this commit into our codes?

Thank you:-)

-----邮件原件-----
发件人: Padmanabh Ratnakar [mailto:padmanabh.ratnakar@snapdeal.com] 
发送时间: 2017年1月19日 7:18
收件人: 许雪寒
抄送: Sage Weil; ceph-devel@vger.kernel.org
主题: Re: 答复: Pipe "deadlock" in Hammer, 0.94.5

Hi Sage,
       Fix 63e44e32974c9bae17bb1bfd4261dcb024ad845c is not pulled in v10.2.5.
Any reasons for this. Is it recommended to pull this fix in our deployments.

Thanks,
Padmanabh

On Wed, Jan 11, 2017 at 3:59 PM, 许雪寒 <xuxuehan@360.cn> wrote:
> In our test, when one machine is under heavy packet loss, OSDs on 
> other machines can be brought down and sometimes more than one OSD go 
> down, because they may try to read a message that comes from that 
> machine. So we are really concerned, please help us. Thanks
>
>
> -----邮件原件-----
> 发件人: 许雪寒
> 发送时间: 2017年1月11日 15:13
> 收件人: 'Sage Weil'
> 抄送: ceph-devel@vger.kernel.org
> 主题: 答复: Pipe "deadlock" in Hammer, 0.94.5
>
> Thanks for your reply, sir:-)
>
> Actually, this case is not very rare in our test. When iptables drop ip packets with the probability around 98%~99%, this case occurs about every three times our test runs.
> I checked #14120 as you recommended, however, it doesn't seem to be our problem. Because, as http://tracker.ceph.com/issues/18184 says, #14120 is caused by a commit that changed Pipe::tcp_read_wait() to return -errno instead of "-1" and this commit is not applied in our tested hammer verion(0.94.5).
> And I truly agree with you in that the invocation of "recv" function passes MSG_DONTWAIT flag, so it shouldn't block, however, every time we encounter this problem it's always the Pipe::reader_thread who's calling recv that hold the lock when suicide happens, this is really confusing.
>
> Please help us, thank you:-)
>
>
>
> -----邮件原件-----
> 发件人: Sage Weil [mailto:sage@newdream.net]
> 发送时间: 2017年1月10日 21:12
> 收件人: 许雪寒
> 抄送: ceph-devel@vger.kernel.org
> 主题: Re: Pipe "deadlock" in Hammer, 0.94.5
>
> On Tue, 10 Jan 2017, 许雪寒 wrote:
>> Hi, everyone.
>>
>> Recently, we did some experiment to test the stability of the ceph cluster. We used Hammer version which is the mostly used version of online cluster. One of the scenarios that we simulated is poor network connectivity, in which we used iptables to drop TCP/IP packet under some probability. And sometimes, we can see that some OSD suicide themselves.
>>
>> We used gdb to debug the core dumped by linux. We found that the thread that hit the suicide time threshold is a peering thread who is trying to send a pg_notify message, the ceph-osd log file and gdb output is as follows:
>>
>> Log file:
>>     -3> 2017-01-10 17:02:13.469949 7fd446ff7700  1 heartbeat_map 
>> is_healthy 'OSD::osd_tp thread 0x7fd440bed700' had timed out after 15
>>     -2> 2017-01-10 17:02:13.469952 7fd446ff7700  1 heartbeat_map 
>> is_healthy 'OSD::osd_tp thread 0x7fd440bed700' had suicide timed out 
>> after 150
>>     -1> 2017-01-10 17:02:13.469954 7fd4451f4700  1 --
>> 10.160.132.157:6818/10014122 <== osd.20 10.160.132.156:0/24908 163 
>> ==== osd_ping(ping e4030 stamp 2017-01-10 17:02:13.450374) v2 ====
>> 47+0+0 (3247646131 0 0) 0x7fd418ca8600 con 0x7fd413c89700
>>      0> 2017-01-10 17:02:13.496895 7fd446ff7700 -1 error_msg
>> common/HeartbeatMap.cc: In function 'bool 
>> ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, 
>> time_t)' thread 7fd446ff7700 time 2017-01-10 17:02:13.469969
>> common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
>>
>> GDB OUTPUT:
>> (gdb) thread 8
>> [Switching to thread 8 (Thread 0x7fd440bed700 (LWP 15302))]#0
>> 0x0000003c5d80e334 in __lll_lock_wait () from /lib64/libpthread.so.0
>> (gdb) bt
>> #0  0x0000003c5d80e334 in __lll_lock_wait () from
>> /lib64/libpthread.so.0
>> #1  0x0000003c5d8095d8 in _L_lock_854 () from /lib64/libpthread.so.0
>> #2  0x0000003c5d8094a7 in pthread_mutex_lock () from
>> /lib64/libpthread.so.0
>> #3  0x0000000001a54ae4 in Mutex::Lock (this=0x7fd426453598,
>> no_lockdep=false) at common/Mutex.cc:96
>> #4  0x0000000001409285 in Mutex::Locker::Locker (this=0x7fd440beb6c0,
>> m=...) at common/Mutex.h:115
>> #5  0x0000000001c46446 in PipeConnection::try_get_pipe 
>> (this=0x7fd426453580, p=0x7fd440beb908) at
>> msg/simple/PipeConnection.cc:38
>> #6  0x0000000001c05809 in SimpleMessenger::submit_message 
>> (this=0x7fd482029400, m=0x7fd425538d00, con=0x7fd426453580, 
>> dest_addr=..., dest_type=4, already_locked=false) at
>> msg/simple/SimpleMessenger.cc:443
>> #7  0x0000000001c033fa in SimpleMessenger::_send_message 
>> (this=0x7fd482029400, m=0x7fd425538d00, con=0x7fd426453580) at
>> msg/simple/SimpleMessenger.cc:136
>> #8  0x0000000001c467c7 in SimpleMessenger::send_message 
>> (this=0x7fd482029400, m=0x7fd425538d00, con=0x7fd426453580) at
>> msg/simple/SimpleMessenger.h:139
>> #9  0x0000000001c466a1 in PipeConnection::send_message 
>> (this=0x7fd426453580, m=0x7fd425538d00) at
>> msg/simple/PipeConnection.cc:78
>> #10 0x00000000013b3ff2 in OSDService::send_map (this=0x7fd4821e76c8, 
>> m=0x7fd425538d00, con=0x7fd426453580) at osd/OSD.cc:1054
>> #11 0x00000000013b45e7 in OSDService::send_incremental_map 
>> (this=0x7fd4821e76c8, since=4028, con=0x7fd426453580, 
>> osdmap=std::tr1::shared_ptr (count 49) 0x7fd426c0f480) at
>> osd/OSD.cc:1087
>> #12 0x00000000013b215f in OSDService::share_map_peer 
>> (this=0x7fd4821e76c8, peer=9, con=0x7fd426453580, 
>> map=std::tr1::shared_ptr (count 49) 0x7fd426c0f480) at osd/OSD.cc:887
>> #13 0x00000000013f43cc in OSD::do_notifies (this=0x7fd4821e6000, 
>> notify_list=std::map with 7 elements = {...}, 
>> curmap=std::tr1::shared_ptr (count 49) 0x7fd426c0f480) at
>> osd/OSD.cc:7246
>> #14 0x00000000013f3c99 in OSD::dispatch_context (this=0x7fd4821e6000, 
>> ctx=..., pg=0x0, curmap=std::tr1::shared_ptr (count 49) 
>> 0x7fd426c0f480, handle=0x7fd440becb40) at osd/OSD.cc:7198
>> #15 0x000000000140043e in OSD::process_peering_events 
>> (this=0x7fd4821e6000, pgs=std::list = {...}, handle=...) at
>> osd/OSD.cc:8539
>> #16 0x000000000141e094 in OSD::PeeringWQ::_process 
>> (this=0x7fd4821e7070, pgs=std::list = {...}, handle=...) at
>> osd/OSD.h:1601
>> #17 0x00000000014b94bf in
>> ThreadPool::BatchWorkQueue<PG>::_void_process (this=0x7fd4821e7070, 
>> p=0x7fd425419040, handle=...) at common/WorkQueue.h:107
>> #18 0x0000000001b2d2e8 in ThreadPool::worker (this=0x7fd4821e64b0,
>> wt=0x7fd4761db430) at common/WorkQueue.cc:128
>> #19 0x0000000001b313f7 in ThreadPool::WorkThread::entry
>> (this=0x7fd4761db430) at common/WorkQueue.h:318
>> #20 0x0000000001b33d40 in Thread::entry_wrapper (this=0x7fd4761db430) 
>> at common/Thread.cc:61
>> #21 0x0000000001b33cb2 in Thread::_entry_func (arg=0x7fd4761db430) at
>> common/Thread.cc:45
>> #22 0x0000003c5d807aa1 in start_thread () from /lib64/libpthread.so.0
>> #23 0x0000003c5d4e8aad in clone () from /lib64/libc.so.6
>>
>> And the thread that is holding the mutex lock which this thread is trying to get is a pipe reader_thread who is trying to read a full message that sent from another OSD:
>>
>> (gdb) frame 2
>> #2  0x0000003c5d8094a7 in pthread_mutex_lock () from
>> /lib64/libpthread.so.0
>> (gdb) info reg
>> rax            0xfffffffffffffe00            -512 rbx
>> 0x7fd425538d00             140549136026880 rcx 0xffffffffffffffff
>> -1 rdx            0x10    16 rsi
>> 0x80      128 rdi            0x7fd4264535a8
>> 140549151864232 rbp            0x7fd440beb680
>> 0x7fd440beb680 rsp            0x7fd440beb648            0x7fd440beb648
>> r8             0x7fd4264535a8               140549151864232
>> r9             0x3bc6 15302
>> r10            0x1       1
>> r11            0x246  582
>> r12            0x4       4
>> r13            0x7fd440bed9c0             140549596043712
>> r14            0x0       0
>> r15            0x3       3
>> rip            0x3c5d8094a7   0x3c5d8094a7 <pthread_mutex_lock+55>
>> eflags         0x246               [ PF ZF IF ] cs             0x33
>> 51 ss             0x2b     43 ds             0x0       0 es
>> 0x0       0 fs             0x0        0 gs             0x0        0
>> (gdb) p *(pthread_mutex_t*)0x7fd4264535a8
>> $4 = {__data = {__lock = 2, __count = 0, __owner = 5008, __nusers = 
>> 1, __kind = 2, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, 
>> __size = 
>> "\002\000\000\000\000\000\000\000\220\023\000\000\001\000\000\000\002"
>> , '\000' <repeats 22 times>, __align = 2}
>> (gdb) thread 50
>> [Switching to thread 50 (Thread 0x7fd4001f1700 (LWP 5008))]#0 
>> 0x0000003c5d80ec2c in recv () from /lib64/libpthread.so.0
>> (gdb) bt
>> #0  0x0000003c5d80ec2c in recv () from /lib64/libpthread.so.0
>> #1  0x0000000001c3a6f8 in Pipe::do_recv (this=0x7fd426561000,
>> buf=0x7fd42649c000 "\a\220\006", len=4096, flags=64) at
>> msg/simple/Pipe.cc:2428
>> #2  0x0000000001c3aa4d in Pipe::buffered_recv (this=0x7fd426561000,
>> buf=0x7fd4224c01d3 "", len=4096, flags=64) at msg/simple/Pipe.cc:2474
>> #3  0x0000000001c3ab0d in Pipe::tcp_read_nonblocking 
>> (this=0x7fd426561000, buf=0x7fd4224c0000 "\t\to\030", len=4096) at
>> msg/simple/Pipe.cc:2492
>
> It's supposed to be a non-blocking read, so in principle this shouldn't block and hold the lock.
>
> I suspect what you're seeing is a variation of #14120, fixed by 63e44e32974c9bae17bb1bfd4261dcb024ad845c.
>
> How reproducible is this?  We can push a build that includes the fix, but it would be nice to have some confirmation that it is the right one since so far this has been a very rare case.
>
> Thanks!
> sage
>
>
>> #4  0x0000000001c37a25 in Pipe::read_message (this=0x7fd426561000, 
>> pm=0x7fd4001f0b80, auth_handler=0x7fd422481080) at
>> msg/simple/Pipe.cc:2032
>> #5  0x0000000001c33146 in Pipe::reader (this=0x7fd426561000) at
>> msg/simple/Pipe.cc:1581
>> #6  0x0000000001c3b0c8 in Pipe::Reader::entry (this=0x7fd426561018) 
>> at
>> msg/simple/Pipe.h:50
>> #7  0x0000000001b33d40 in Thread::entry_wrapper (this=0x7fd426561018) 
>> at common/Thread.cc:61
>> #8  0x0000000001b33cb2 in Thread::_entry_func (arg=0x7fd426561018) at
>> common/Thread.cc:45
>> #9  0x0000003c5d807aa1 in start_thread () from /lib64/libpthread.so.0
>> #10 0x0000003c5d4e8aad in clone () from /lib64/libc.so.6
>>
>> Because we are dropping a large amount of ip packets, so we think the “recv” call is blocked for a long time, which led to the suicide timeout.
>>
>> We read the source code, and found that the lock that directly made the thread hit suicide timeout is Connection::lock. Why is this lock used in both the submit_message and read_message?  Is it supposed to be this way?
>>
>> Please help me, thank you☺
>>
>>
>> N?????r??y??????X??ǧv???)޺{.n?????z?]z????ay? ʇڙ??j ??f???h????? ?w???
> ???j:+v???w???????? ????zZ+???????j"????i

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-03-15  9:47 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-10 11:47 Pipe "deadlock" in Hammer, 0.94.5 许雪寒
2017-01-10 13:11 ` Sage Weil
2017-01-11  7:13   ` 答复: " 许雪寒
2017-01-11 10:29   ` 许雪寒
2017-01-18 23:17     ` Padmanabh Ratnakar
2017-03-15  9:39       ` 答复: " 许雪寒

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.