All of lore.kernel.org
 help / color / mirror / Atom feed
From: Wido den Hollander <wido@42on.com>
To: huang jun <hjwsm1989@gmail.com>
Cc: Dan van der Ster <dan@vanderster.com>,
	Xiaoxi Chen <superdebuger@gmail.com>,
	ceph-devel <ceph-devel@vger.kernel.org>
Subject: Re: Failing OSDs (suicide timeout) due to flaky clients
Date: Tue, 5 Jul 2016 11:10:35 +0200 (CEST)	[thread overview]
Message-ID: <2029231343.493.1467709835165@ox.pcextreme.nl> (raw)
In-Reply-To: <CABAwU-bBofDt4yaWqgVADZP1Vx=CQKqFrbJY7DSPYCcq8AM9Qw@mail.gmail.com>


> Op 5 juli 2016 om 10:56 schreef huang jun <hjwsm1989@gmail.com>:
> 
> 
> i see osd timed out many times.
> In SimpleMessenger mode, when sending msg, the Pipeconnection will
> hold a lock, which maybe hold by other threads,
> it's reported before: http://tracker.ceph.com/issues/9921
> 

Thank you! It surely looks like the same symptoms we are seeing in this cluster.

The bug has been marked as resolved, but are you sure it is?

Wido

> 2016-07-05 15:26 GMT+08:00 Wido den Hollander <wido@42on.com>:
> >
> >> Op 5 juli 2016 om 3:38 schreef Xiaoxi Chen <superdebuger@gmail.com>:
> >>
> >>
> >>  This is not heartbeat timeout between OSD, it's internal thread
> >> timeout, in this case is there is an OSD_OP_THREAD 7f8cb45a3700
> >> timeout after 150s. The timeout value is counted by a alarm, which is
> >> reset every time the thread picked and about to start processing a
> >> request. So if something prevent the thread from finishing the
> >> request, it could timeout like this.
> >>
> >
> > Yes, that's what I came to as well.
> >
> >> But I dont understand as the send_message in ceph is async--just
> >> putting the message into buffer rather than send it, so theoretically
> >> it will not block the op thread?
> >>
> >
> > Indeed, in theory it shouldn't. But I only see this happening when network connectivity to one or more clients becomes bad. High packetloss and such. Then OSDs start to fall over and mark the other one down.
> >
> > The network between the OSDs is just fine.
> >
> > Could it be that "a buffer" becomes full which causes the async apply to block?
> >
> > Just thinking out loud as this is a hard to diagnose and trigger problem.
> >
> > Wido
> >
> >> 2016-07-05 0:48 GMT+08:00 Wido den Hollander <wido@42on.com>:
> >> >
> >> >> Op 4 juli 2016 om 17:54 schreef Dan van der Ster <dan@vanderster.com>:
> >> >>
> >> >>
> >> >> Hi Wido,
> >> >>
> >> >> On Mon, Jul 4, 2016 at 4:41 PM, Wido den Hollander <wido@42on.com> wrote:
> >> >> > Hi,
> >> >> >
> >> >> > On a Ceph Hammer (0.94.5) cluster I've seen OSDs fail multiple times when one or more clients are responding flaky (full netfilter conntrack table).
> >> >> >
> >> >> > The cluster is a 250 OSD cluster with roughly 30 clients all running KVM/Qemu with librbd.
> >> >> >
> >> >> > What happens is that the cluster is doing roughly 10k IOps and then a client starts to experience packet loss. Due to this responses sent back to the client by the OSD might be lost and the OSD (TCP) will re-transmit these.
> >> >> >
> >> >> > The clients in this case run into a full conntrack table on the hypervisor which also causes Ceph traffic to be dropped.
> >> >> >
> >> >> > After a few minutes OSDs will start to commit suicide throughout the cluster:
> >> >> >
> >> >> >     -4> 2016-07-04 15:38:37.794230 7f8c9b63c700 10 monclient: renew_subs
> >> >> >     -3> 2016-07-04 15:38:37.794235 7f8c9b63c700 10 monclient: _send_mon_message to mon.charlie at [2a00:f10:121:200::6789:3]:6789/0
> >> >> >     -2> 2016-07-04 15:38:39.944116 7f8cb45a3700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f8c95630700' had timed out after 15
> >> >> >     -1> 2016-07-04 15:38:39.944146 7f8cb45a3700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f8c95630700' had suicide timed out after 150
> >> >> >      0> 2016-07-04 15:38:39.952054 7f8cb45a3700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f8cb45a3700 time 2016-07-04 15:38:39.944180
> >> >> > common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
> >> >> >  ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
> >> >> >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xbc60eb]
> >> >> >  2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2a9) [0xb02089]
> >> >> >  3: (ceph::HeartbeatMap::is_healthy()+0xd6) [0xb02916]
> >> >> >  4: (ceph::HeartbeatMap::check_touch_file()+0x17) [0xb02ff7]
> >> >> >  5: (CephContextServiceThread::entry()+0x154) [0xbd61d4]
> >> >> >  6: (()+0x8182) [0x7f8cb7a0d182]
> >> >> >  7: (clone()+0x6d) [0x7f8cb5f5747d]
> >> >> >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> >> >>
> >> >> I don't understand why a client's full conntrack table would affect
> >> >> heartbeats between OSDs.
> >> >
> >> > Me neither, but it is happening. My theory for now is that somewhere in the network stack something becomes saturated because the OSDs can't send it's reply back properly.
> >> >
> >> > As soon as the client's network is working fine again the Ceph cluster is healthy and OSDs are no longer crashing.
> >> >
> >> >> Can you work backwards in the logs to see what 7f8c95630700 was doing?
> >> >
> >> >  -1196> 2016-07-04 15:36:27.792105 7f8c9b63c700 10 log_client  will send 2016-07-04 15:36:24.187675 osd.9 [XXXXX:2]:6820/3330 23763 : cluster [WRN] slow request 2.996474 seconds old, received at 2016-07-04 15:36:21.191081: osd_op(client.18130582.0:130421015 rbd_data.b7b662ae8944a.000000000000d339 [set-alloc-hint object_size 4194304 write_size 4194304,write 1339392~8192] 20.6b89ee5c snapc 37e=[] ack+ondisk+write e382057) currently no flag points reached
> >> >  -1195> 2016-07-04 15:36:27.792121 7f8c9b63c700 10 log_client  will send 2016-07-04 15:36:24.187680 osd.9 [XXXXX:2]:6820/3330 23764 : cluster [WRN] slow request 2.879790 seconds old, received at 2016-07-04 15:36:21.307765: osd_repop(client.12376090.0:17851029 26.1f2 4a7c1f2/rbd_data.270ca14cf1dfd2.0000000000000a3b/head//26 v 382057'4620873) currently no flag points reached
> >> >
> >> > That's what it was doing before it crashed. The 100 lines before are almost identical.
> >> >
> >> >>
> >> >> >
> >> >> > On other OSDs you might see messages like these come by:
> >> >> >
> >> >> >    -96> 2016-07-04 15:33:23.645713 7fa9b7243700  0 -- [XX:YY:AA:BB::30]:6803/7451 >> [XX:YY:AA:BB::29]:6800/1019342 pipe(0x2c4e4000 sd=22 :56587 s=1 pgs=835 cs=1 l=0 c=0x318f1de0).connect got RESETSESSION
> >> >> >    -95> 2016-07-04 15:33:23.645836 7fa9db92d700  0 -- [XX:YY:AA:BB::30]:6803/7451 >> [XX:YY:AA:BB::39]:6822/2687 pipe(0x4914000 sd=20 :38712 s=1 pgs=20 cs=1 l=0 c=0x3e4e1080).connect got RESETSESSION
> >> >> >    -94> 2016-07-04 15:33:24.563595 7faa09f29700 10 monclient: renew_subs
> >> >> >    -93> 2016-07-04 15:33:24.563621 7faa09f29700 10 monclient: _send_mon_message to mon.beta at [2a00:f10:121:200::6789:2]:6789/0
> >> >> >    -92> 2016-07-04 15:33:24.564221 7faa09f29700 10 monclient: renew_subs
> >> >> >    -91> 2016-07-04 15:33:24.564226 7faa09f29700 10 monclient: _send_mon_message to mon.beta at [2a00:f10:121:200::6789:2]:6789/0
> >> >> >    -90> 2016-07-04 15:33:24.567635 7faa09f29700 10 monclient: renew_subs
> >> >> >    -89> 2016-07-04 15:33:24.567642 7faa09f29700 10 monclient: _send_mon_message to mon.beta at [2a00:f10:121:200::6789:2]:6789/0
> >> >> >    -88> 2016-07-04 15:33:24.567667 7faa0cf2f700  0 log_channel(cluster) log [WRN] : map e382054 wrongly marked me down
> >> >> >    -87> 2016-07-04 15:33:25.231536 7faa270cd700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7faa0271a700' had timed out after 15
> >> >> >    -86> 2016-07-04 15:33:28.800283 7fa9e65d9700  0 -- [XX:YY:AA:BB::30]:6813/1007451 >> [XX:YY:AA:BB::20]:6821/2641 pipe(0x3706c000 sd=25 :48471 s=2 pgs=25097 cs=1 l=0 c=0x120c5fa0).fault, initiating reconnect
> >> >> >    -85> 2016-07-04 15:33:28.800795 7fa9ec437700  0 -- [XX:YY:AA:BB::30]:6813/1007451 >> [XX:YY:AA:BB::39]:6813/4497 pipe(0x341bc000 sd=26 :53469 s=2 pgs=24916 cs=1 l=0 c=0xdadf8c0).fault, initiating reconnect
> >> >> >    -84> 2016-07-04 15:33:28.801374 7fa9a4f81700  0 -- [XX:YY:AA:BB::30]:6813/1007451 >> [XX:YY:AA:BB::20]:6821/2641 pipe(0x3706c000 sd=25 :48473 s=1 pgs=25097 cs=2 l=0 c=0x120c5fa0).connect got RESETSESSION
> >> >> >    -83> 2016-07-04 15:33:28.801832 7fa9aced3700  0 -- [XX:YY:AA:BB::30]:6813/1007451 >> [XX:YY:AA:BB::39]:6813/4497 pipe(0x341bc000 sd=26 :53471 s=1 pgs=24916 cs=2 l=0 c=0xdadf8c0).connect got RESETSESSION
> >> >> >    -82> 2016-07-04 15:33:30.231626 7faa270cd700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7faa0271a700' had timed out after 15
> >> >> >    -81> 2016-07-04 15:33:32.366023 7faa04f1f700 10 monclient: tick
> >> >>
> >> >>
> >> >> Again, what did thread 7faa0271a700 last do before timing out?
> >> >>
> >> >> (BTW, in the past we used
> >> >>
> >> >>   ms tcp read timeout = 60
> >> >>
> >> >> to work around strange el6 kernel networking bugs -- maybe it would help here.)
> >> >>
> >> >> -- Dan
> >> >>
> >> >> >
> >> >> >
> >> >> > Due to the flaky responses of the clients roughly 10 to 20 OSDs die or start to respond very, very slowly causing outage on RBD.
> >> >> >
> >> >> > As soon as the client is fixed it takes just a few minutes for the cluster to become healthy a proceed.
> >> >> >
> >> >> > To me it seems like there are some buffers inside the OSD which fill up and cause the OSD to respond slowly to other network traffic.
> >> >> >
> >> >> > Does that make any sense?
> >> >> >
> >> >> > Wido
> >> >> > --
> >> >> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> >> > the body of a message to majordomo@vger.kernel.org
> >> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> > the body of a message to majordomo@vger.kernel.org
> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> -- 
> Thank you!
> HuangJun
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2016-07-05  9:10 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-04 14:41 Failing OSDs (suicide timeout) due to flaky clients Wido den Hollander
2016-07-04 15:54 ` Dan van der Ster
2016-07-04 16:48   ` Wido den Hollander
2016-07-05  1:38     ` Xiaoxi Chen
2016-07-05  7:26       ` Wido den Hollander
2016-07-05  8:56         ` huang jun
2016-07-05  9:10           ` Wido den Hollander [this message]
2016-07-05 17:27             ` Gregory Farnum
2016-07-05 17:45               ` Wido den Hollander
2016-07-05 17:48                 ` Gregory Farnum
2016-07-05 18:32                   ` Wido den Hollander
2016-07-05 18:35                     ` Gregory Farnum
2016-07-05 18:59                       ` Wido den Hollander

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2029231343.493.1467709835165@ox.pcextreme.nl \
    --to=wido@42on.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=dan@vanderster.com \
    --cc=hjwsm1989@gmail.com \
    --cc=superdebuger@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.