All of lore.kernel.org
 help / color / mirror / Atom feed
* Handling is_readable=0 periods in mon
@ 2017-05-23 17:29 John Spray
  2017-05-24  5:12 ` Gregory Farnum
  0 siblings, 1 reply; 8+ messages in thread
From: John Spray @ 2017-05-23 17:29 UTC (permalink / raw)
  To: Ceph Development, Kefu Chai, Joao Eduardo Luis

Hi all,

I could use some help from people who understand the mon better than I
do with this ticket: http://tracker.ceph.com/issues/19706

The issue is that MDSMonitor is incorrectly killing MDSs because it
hasn't seen beacon messages, but the beacon messages are actually just
held up because is_readable = 0, like this:
2017-05-23 13:34:20.054785 7f772f1c2700 10
mon.b@0(leader).paxosservice(mdsmap 1..11) dispatch 0x7f7742989740
mdsbeacon(4141/a up:active seq 96 v9) v7 from mds.0
172.21.15.77:6809/2700711429 con 0x7f77428d8f00
2017-05-23 13:34:20.054788 7f772f1c2700  5 mon.b@0(leader).paxos(paxos
recovering c 1..293) is_readable = 0 - now=2017-05-23 13:34:20.054789
lease_expire=0.000000 has v0 lc 293
2017-05-23 13:34:20.054791 7f772f1c2700 10
mon.b@0(leader).paxosservice(mdsmap 1..11)  waiting for paxos ->
readable (v9)

This appears to be happening when one or more mons are a bit laggy,
but it is happening before an election has happened.

We have code for handling slow elections by checking how long it has
been since the last tick, and resetting our timeout information for
MDS beacons if it has been too long
(https://github.com/ceph/ceph/blob/master/src/mon/MDSMonitor.cc#L2070)

However, in this case the tick() function is getting called
throughout, we're just not seeing the beacons because they're held up
waiting for readable.

I could hack around this by only doing timeouts if *any* daemon has
successfully got a beacon through in the last (mds_beacon_grace*2) or
something like that, but I wonder if there's a Right way to handle
this for PaxosService subclasses?

Thanks,
John

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Handling is_readable=0 periods in mon
  2017-05-23 17:29 Handling is_readable=0 periods in mon John Spray
@ 2017-05-24  5:12 ` Gregory Farnum
  2017-05-24 10:19   ` John Spray
  2017-05-25 16:24   ` kefu chai
  0 siblings, 2 replies; 8+ messages in thread
From: Gregory Farnum @ 2017-05-24  5:12 UTC (permalink / raw)
  To: John Spray, Ceph Development, Kefu Chai, Joao Eduardo Luis

On Tue, May 23, 2017 at 10:31 AM John Spray <jspray@redhat.com> wrote:
>
> Hi all,
>
> I could use some help from people who understand the mon better than I
> do with this ticket: http://tracker.ceph.com/issues/19706
>
> The issue is that MDSMonitor is incorrectly killing MDSs because it
> hasn't seen beacon messages, but the beacon messages are actually just
> held up because is_readable = 0, like this:
> 2017-05-23 13:34:20.054785 7f772f1c2700 10
> mon.b@0(leader).paxosservice(mdsmap 1..11) dispatch 0x7f7742989740
> mdsbeacon(4141/a up:active seq 96 v9) v7 from mds.0
> 172.21.15.77:6809/2700711429 con 0x7f77428d8f00
> 2017-05-23 13:34:20.054788 7f772f1c2700  5 mon.b@0(leader).paxos(paxos
> recovering c 1..293) is_readable = 0 - now=2017-05-23 13:34:20.054789
> lease_expire=0.000000 has v0 lc 293
> 2017-05-23 13:34:20.054791 7f772f1c2700 10
> mon.b@0(leader).paxosservice(mdsmap 1..11)  waiting for paxos ->
> readable (v9)
>
> This appears to be happening when one or more mons are a bit laggy,
> but it is happening before an election has happened.
>
> We have code for handling slow elections by checking how long it has
> been since the last tick, and resetting our timeout information for
> MDS beacons if it has been too long
> (https://github.com/ceph/ceph/blob/master/src/mon/MDSMonitor.cc#L2070)
>
>
> However, in this case the tick() function is getting called
> throughout, we're just not seeing the beacons because they're held up
> waiting for readable.


This story doesn't seem quite right/complete This code is only invoked
if the monitor is the leader, which certainly doesn't happen when the
monitor is out of quorum. Is the beacon maybe going to a peon which
isn't forwarding quickly enough?

>
>
> I could hack around this by only doing timeouts if *any* daemon has
> successfully got a beacon through in the last (mds_beacon_grace*2) or
> something like that, but I wonder if there's a Right way to handle
> this for PaxosService subclasses?


The MDS is I think the only thing doing this, so if its patterns
aren't working we probably don't. (The manager just started doing it,
so may have a similar problem?)
I suppose you might take into account the paxos quorum timeouts and
which monitor the MDS was connected to, so that it only marks an MDS
down if you have positive belief the monitor it was pinging is alive?
-Greg

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Handling is_readable=0 periods in mon
  2017-05-24  5:12 ` Gregory Farnum
@ 2017-05-24 10:19   ` John Spray
  2017-05-24 20:45     ` Gregory Farnum
  2017-05-25 16:24   ` kefu chai
  1 sibling, 1 reply; 8+ messages in thread
From: John Spray @ 2017-05-24 10:19 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Ceph Development, Kefu Chai, Joao Eduardo Luis

On Wed, May 24, 2017 at 6:12 AM, Gregory Farnum <gfarnum@redhat.com> wrote:
> On Tue, May 23, 2017 at 10:31 AM John Spray <jspray@redhat.com> wrote:
>>
>> Hi all,
>>
>> I could use some help from people who understand the mon better than I
>> do with this ticket: http://tracker.ceph.com/issues/19706
>>
>> The issue is that MDSMonitor is incorrectly killing MDSs because it
>> hasn't seen beacon messages, but the beacon messages are actually just
>> held up because is_readable = 0, like this:
>> 2017-05-23 13:34:20.054785 7f772f1c2700 10
>> mon.b@0(leader).paxosservice(mdsmap 1..11) dispatch 0x7f7742989740
>> mdsbeacon(4141/a up:active seq 96 v9) v7 from mds.0
>> 172.21.15.77:6809/2700711429 con 0x7f77428d8f00
>> 2017-05-23 13:34:20.054788 7f772f1c2700  5 mon.b@0(leader).paxos(paxos
>> recovering c 1..293) is_readable = 0 - now=2017-05-23 13:34:20.054789
>> lease_expire=0.000000 has v0 lc 293
>> 2017-05-23 13:34:20.054791 7f772f1c2700 10
>> mon.b@0(leader).paxosservice(mdsmap 1..11)  waiting for paxos ->
>> readable (v9)
>>
>> This appears to be happening when one or more mons are a bit laggy,
>> but it is happening before an election has happened.
>>
>> We have code for handling slow elections by checking how long it has
>> been since the last tick, and resetting our timeout information for
>> MDS beacons if it has been too long
>> (https://github.com/ceph/ceph/blob/master/src/mon/MDSMonitor.cc#L2070)
>>
>>
>> However, in this case the tick() function is getting called
>> throughout, we're just not seeing the beacons because they're held up
>> waiting for readable.
>
>
> This story doesn't seem quite right/complete This code is only invoked
> if the monitor is the leader, which certainly doesn't happen when the
> monitor is out of quorum. Is the beacon maybe going to a peon which
> isn't forwarding quickly enough?

That was what I thought (the messages are indeed being forwarded and
getting held up a bit there), but then I looked at the leader's log
and they were getting held up there too.

I was thinking/hoping that there would be an existing mechanism to
deal with laggy mons not forwarding messages, but perhaps not if they
only look laggy to the other mons and not to the client :-/

>
>>
>>
>> I could hack around this by only doing timeouts if *any* daemon has
>> successfully got a beacon through in the last (mds_beacon_grace*2) or
>> something like that, but I wonder if there's a Right way to handle
>> this for PaxosService subclasses?
>
>
> The MDS is I think the only thing doing this, so if its patterns
> aren't working we probably don't. (The manager just started doing it,
> so may have a similar problem?)
> I suppose you might take into account the paxos quorum timeouts and
> which monitor the MDS was connected to, so that it only marks an MDS
> down if you have positive belief the monitor it was pinging is alive?

Yeah, you're probably right that if there's not something else
thinking about the forwarding, then MDSMonitor does need to track who
forwarded and consider whether the forwarder is still considered
healthy.

John

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Handling is_readable=0 periods in mon
  2017-05-24 10:19   ` John Spray
@ 2017-05-24 20:45     ` Gregory Farnum
  2017-05-24 21:14       ` John Spray
  0 siblings, 1 reply; 8+ messages in thread
From: Gregory Farnum @ 2017-05-24 20:45 UTC (permalink / raw)
  To: John Spray; +Cc: Ceph Development, Kefu Chai, Joao Eduardo Luis

On Wed, May 24, 2017 at 3:20 AM John Spray <jspray@redhat.com> wrote:
>
> On Wed, May 24, 2017 at 6:12 AM, Gregory Farnum <gfarnum@redhat.com> wrote:
> > On Tue, May 23, 2017 at 10:31 AM John Spray <jspray@redhat.com> wrote:
> >>
> >> Hi all,
> >>
> >> I could use some help from people who understand the mon better than I
> >> do with this ticket: http://tracker.ceph.com/issues/19706
> >>
> >> The issue is that MDSMonitor is incorrectly killing MDSs because it
> >> hasn't seen beacon messages, but the beacon messages are actually just
> >> held up because is_readable = 0, like this:
> >> 2017-05-23 13:34:20.054785 7f772f1c2700 10
> >> mon.b@0(leader).paxosservice(mdsmap 1..11) dispatch 0x7f7742989740
> >> mdsbeacon(4141/a up:active seq 96 v9) v7 from mds.0
> >> 172.21.15.77:6809/2700711429 con 0x7f77428d8f00
> >> 2017-05-23 13:34:20.054788 7f772f1c2700  5 mon.b@0(leader).paxos(paxos
> >> recovering c 1..293) is_readable = 0 - now=2017-05-23 13:34:20.054789
> >> lease_expire=0.000000 has v0 lc 293
> >> 2017-05-23 13:34:20.054791 7f772f1c2700 10
> >> mon.b@0(leader).paxosservice(mdsmap 1..11)  waiting for paxos ->
> >> readable (v9)
> >>
> >> This appears to be happening when one or more mons are a bit laggy,
> >> but it is happening before an election has happened.
> >>
> >> We have code for handling slow elections by checking how long it has
> >> been since the last tick, and resetting our timeout information for
> >> MDS beacons if it has been too long
> >> (https://github.com/ceph/ceph/blob/master/src/mon/MDSMonitor.cc#L2070)
> >>
> >>
> >> However, in this case the tick() function is getting called
> >> throughout, we're just not seeing the beacons because they're held up
> >> waiting for readable.
> >
> >
> > This story doesn't seem quite right/complete This code is only invoked
> > if the monitor is the leader, which certainly doesn't happen when the
> > monitor is out of quorum. Is the beacon maybe going to a peon which
> > isn't forwarding quickly enough?
>
> That was what I thought (the messages are indeed being forwarded and
> getting held up a bit there), but then I looked at the leader's log
> and they were getting held up there too.


Okay, so it's not that one peon is laggy but that the monitors are
knowingly not in a quorum, including the one who is leader on both
sides of the election (but not while things are laggy!). And then when
the election happens, the leader does a tick and notices it hasn't
gotten a beacon from any MDSes in the past timeout interval.

So that suggests you actually just need to take account of how long
it's been since the monitor went active, not that you need a major
rework of the tracking. Probably just modify it so the code won't mark
down an MDS until at least beacon_timeout has passed since the mon
went active? (I don't know off-hand how to query that state but I'm
sure it's available somewhere.)
-Greg

>
>
> I was thinking/hoping that there would be an existing mechanism to
> deal with laggy mons not forwarding messages, but perhaps not if they
> only look laggy to the other mons and not to the client :-/
>
> >
> >>
> >>
> >> I could hack around this by only doing timeouts if *any* daemon has
> >> successfully got a beacon through in the last (mds_beacon_grace*2) or
> >> something like that, but I wonder if there's a Right way to handle
> >> this for PaxosService subclasses?
> >
> >
> > The MDS is I think the only thing doing this, so if its patterns
> > aren't working we probably don't. (The manager just started doing it,
> > so may have a similar problem?)
> > I suppose you might take into account the paxos quorum timeouts and
> > which monitor the MDS was connected to, so that it only marks an MDS
> > down if you have positive belief the monitor it was pinging is alive?
>
> Yeah, you're probably right that if there's not something else
> thinking about the forwarding, then MDSMonitor does need to track who
> forwarded and consider whether the forwarder is still considered
> healthy.
>
> John

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Handling is_readable=0 periods in mon
  2017-05-24 20:45     ` Gregory Farnum
@ 2017-05-24 21:14       ` John Spray
  2017-05-24 21:31         ` Gregory Farnum
  0 siblings, 1 reply; 8+ messages in thread
From: John Spray @ 2017-05-24 21:14 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Ceph Development, Kefu Chai, Joao Eduardo Luis

[-- Attachment #1: Type: text/plain, Size: 4574 bytes --]

On Wed, May 24, 2017 at 9:45 PM, Gregory Farnum <gfarnum@redhat.com> wrote:
> On Wed, May 24, 2017 at 3:20 AM John Spray <jspray@redhat.com> wrote:
>>
>> On Wed, May 24, 2017 at 6:12 AM, Gregory Farnum <gfarnum@redhat.com> wrote:
>> > On Tue, May 23, 2017 at 10:31 AM John Spray <jspray@redhat.com> wrote:
>> >>
>> >> Hi all,
>> >>
>> >> I could use some help from people who understand the mon better than I
>> >> do with this ticket: http://tracker.ceph.com/issues/19706
>> >>
>> >> The issue is that MDSMonitor is incorrectly killing MDSs because it
>> >> hasn't seen beacon messages, but the beacon messages are actually just
>> >> held up because is_readable = 0, like this:
>> >> 2017-05-23 13:34:20.054785 7f772f1c2700 10
>> >> mon.b@0(leader).paxosservice(mdsmap 1..11) dispatch 0x7f7742989740
>> >> mdsbeacon(4141/a up:active seq 96 v9) v7 from mds.0
>> >> 172.21.15.77:6809/2700711429 con 0x7f77428d8f00
>> >> 2017-05-23 13:34:20.054788 7f772f1c2700  5 mon.b@0(leader).paxos(paxos
>> >> recovering c 1..293) is_readable = 0 - now=2017-05-23 13:34:20.054789
>> >> lease_expire=0.000000 has v0 lc 293
>> >> 2017-05-23 13:34:20.054791 7f772f1c2700 10
>> >> mon.b@0(leader).paxosservice(mdsmap 1..11)  waiting for paxos ->
>> >> readable (v9)
>> >>
>> >> This appears to be happening when one or more mons are a bit laggy,
>> >> but it is happening before an election has happened.
>> >>
>> >> We have code for handling slow elections by checking how long it has
>> >> been since the last tick, and resetting our timeout information for
>> >> MDS beacons if it has been too long
>> >> (https://github.com/ceph/ceph/blob/master/src/mon/MDSMonitor.cc#L2070)
>> >>
>> >>
>> >> However, in this case the tick() function is getting called
>> >> throughout, we're just not seeing the beacons because they're held up
>> >> waiting for readable.
>> >
>> >
>> > This story doesn't seem quite right/complete This code is only invoked
>> > if the monitor is the leader, which certainly doesn't happen when the
>> > monitor is out of quorum. Is the beacon maybe going to a peon which
>> > isn't forwarding quickly enough?
>>
>> That was what I thought (the messages are indeed being forwarded and
>> getting held up a bit there), but then I looked at the leader's log
>> and they were getting held up there too.
>
>
> Okay, so it's not that one peon is laggy but that the monitors are
> knowingly not in a quorum, including the one who is leader on both
> sides of the election (but not while things are laggy!). And then when
> the election happens, the leader does a tick and notices it hasn't
> gotten a beacon from any MDSes in the past timeout interval.

That's what I thought too, until I noticed that the elections were
only happening *after* we've had our tick() and mistakenly killed an
MDS.  There's this period where the beacons are getting ignored, but
our tick() is still getting called.

I've snipped out the timeline in an attachment to this mail (easier to
read in a nice widescreen text editor than most mail clients).

John

> So that suggests you actually just need to take account of how long
> it's been since the monitor went active, not that you need a major
> rework of the tracking. Probably just modify it so the code won't mark
> down an MDS until at least beacon_timeout has passed since the mon
> went active? (I don't know off-hand how to query that state but I'm
> sure it's available somewhere.)
> -Greg
>
>>
>>
>> I was thinking/hoping that there would be an existing mechanism to
>> deal with laggy mons not forwarding messages, but perhaps not if they
>> only look laggy to the other mons and not to the client :-/
>>
>> >
>> >>
>> >>
>> >> I could hack around this by only doing timeouts if *any* daemon has
>> >> successfully got a beacon through in the last (mds_beacon_grace*2) or
>> >> something like that, but I wonder if there's a Right way to handle
>> >> this for PaxosService subclasses?
>> >
>> >
>> > The MDS is I think the only thing doing this, so if its patterns
>> > aren't working we probably don't. (The manager just started doing it,
>> > so may have a similar problem?)
>> > I suppose you might take into account the paxos quorum timeouts and
>> > which monitor the MDS was connected to, so that it only marks an MDS
>> > down if you have positive belief the monitor it was pinging is alive?
>>
>> Yeah, you're probably right that if there's not something else
>> thinking about the forwarding, then MDSMonitor does need to track who
>> forwarded and consider whether the forwarder is still considered
>> healthy.
>>
>> John

[-- Attachment #2: 19706.txt --]
[-- Type: text/plain, Size: 2462 bytes --]



The last preprocess_beacon calls before we killed the active daemon:
2017-05-23 13:33:56.375442 7f772f1c2700 12 mon.b@0(leader).mds e11 preprocess_beacon mdsbeacon(4141/a up:active seq 94 v9)...
2017-05-23 13:33:57.586688 7f772f1c2700 12 mon.b@0(leader).mds e11 preprocess_beacon mdsbeacon(4131/a-s up:standby seq 94 v11)...
2017-05-23 13:34:01.586968 7f772f1c2700 12 mon.b@0(leader).mds e11 preprocess_beacon mdsbeacon(4131/a-s up:standby seq 95 v11)...

An intervening period in which beacon messages are received but ignored like this:
2017-05-23 13:34:09.627452 7f772f1c2700 10 mon.b@0(leader).paxosservice(mdsmap 1..11) dispatch 0x7f7742987a00 mdsbeacon(4131/a-s up:standby seq 96 v11) v7 from mds.? 172.21.15.74:6808/2057197859 con 0x7f774288b800
2017-05-23 13:34:09.627455 7f772f1c2700  5 mon.b@0(leader).paxos(paxos updating c 1..292) is_readable = 0 - now=2017-05-23 13:34:09.627456 lease_expire=2017-05-23 13:34:09.211333 has v0 lc 292
2017-05-23 13:34:09.627459 7f772f1c2700 10 mon.b@0(leader).paxosservice(mdsmap 1..11)  waiting for paxos -> readable (v11)

The active daemon gets killed:
2017-05-23 13:34:14.627396 7f77319c7700 10 mon.b@0(leader).mds e11 e11: 1/1/1 up {0=a=up:active}, 1 up:standby
2017-05-23 13:34:14.627414 7f77319c7700 10 mon.b@0(leader).mds e11 no beacon from 4141 172.21.15.77:6809/2700711429 mds.0.7 up:active since 2017-05-23 13:33:56.375452
2017-05-23 13:34:14.627429 7f77319c7700 10 mon.b@0(leader).mds e11  replacing 4141 172.21.15.77:6809/2700711429 mds.0.7 up:active with 4131/a-s 172.21.15.74:6808/2057197859
2017-05-23 13:34:14.627436 7f77319c7700 10 mon.b@0(leader).mds e11 fail_mds_gid 4141 mds.a role 0

Eventually an election happens:
2017-05-23 13:34:18.452518 7f772f1c2700  1 -- 172.21.15.74:6789/0 _send_message--> mon.1 172.21.15.77:6789/0 -- election(0327ace0-be76-4567-8281-a76d1933366f propose 7) v6 -- ?+0 0x7f77429a1800
2017-05-23 13:34:18.452567 7f772f1c2700  1 -- 172.21.15.74:6789/0 _send_message--> mon.2 172.21.15.74:6790/0 -- election(0327ace0-be76-4567-8281-a76d1933366f propose 7) v6 -- ?+0 0x7f77429a2100
2017-05-23 13:34:18.494331 7f772f1c2700  1 -- 172.21.15.74:6789/0 _send_message--> mon.1 172.21.15.77:6789/0 -- election(0327ace0-be76-4567-8281-a76d1933366f victory 8) v6 -- ?+0 0x7f77429a2100
2017-05-23 13:34:18.494359 7f772f1c2700  1 -- 172.21.15.74:6789/0 _send_message--> mon.2 172.21.15.74:6790/0 -- election(0327ace0-be76-4567-8281-a76d1933366f victory 8) v6 -- ?+0 0x7f77429a0600


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Handling is_readable=0 periods in mon
  2017-05-24 21:14       ` John Spray
@ 2017-05-24 21:31         ` Gregory Farnum
  2017-05-24 21:37           ` Sage Weil
  0 siblings, 1 reply; 8+ messages in thread
From: Gregory Farnum @ 2017-05-24 21:31 UTC (permalink / raw)
  To: John Spray; +Cc: Ceph Development, Kefu Chai, Joao Eduardo Luis

On Wed, May 24, 2017 at 2:15 PM John Spray <jspray@redhat.com> wrote:
>
> On Wed, May 24, 2017 at 9:45 PM, Gregory Farnum <gfarnum@redhat.com> wrote:
> > On Wed, May 24, 2017 at 3:20 AM John Spray <jspray@redhat.com> wrote:
> >>
> >> On Wed, May 24, 2017 at 6:12 AM, Gregory Farnum <gfarnum@redhat.com> wrote:
> >> > On Tue, May 23, 2017 at 10:31 AM John Spray <jspray@redhat.com> wrote:
> >> >>
> >> >> Hi all,
> >> >>
> >> >> I could use some help from people who understand the mon better than I
> >> >> do with this ticket: http://tracker.ceph.com/issues/19706
> >> >>
> >> >> The issue is that MDSMonitor is incorrectly killing MDSs because it
> >> >> hasn't seen beacon messages, but the beacon messages are actually just
> >> >> held up because is_readable = 0, like this:
> >> >> 2017-05-23 13:34:20.054785 7f772f1c2700 10
> >> >> mon.b@0(leader).paxosservice(mdsmap 1..11) dispatch 0x7f7742989740
> >> >> mdsbeacon(4141/a up:active seq 96 v9) v7 from mds.0
> >> >> 172.21.15.77:6809/2700711429 con 0x7f77428d8f00
> >> >> 2017-05-23 13:34:20.054788 7f772f1c2700  5 mon.b@0(leader).paxos(paxos
> >> >> recovering c 1..293) is_readable = 0 - now=2017-05-23 13:34:20.054789
> >> >> lease_expire=0.000000 has v0 lc 293
> >> >> 2017-05-23 13:34:20.054791 7f772f1c2700 10
> >> >> mon.b@0(leader).paxosservice(mdsmap 1..11)  waiting for paxos ->
> >> >> readable (v9)
> >> >>
> >> >> This appears to be happening when one or more mons are a bit laggy,
> >> >> but it is happening before an election has happened.
> >> >>
> >> >> We have code for handling slow elections by checking how long it has
> >> >> been since the last tick, and resetting our timeout information for
> >> >> MDS beacons if it has been too long
> >> >> (https://github.com/ceph/ceph/blob/master/src/mon/MDSMonitor.cc#L2070)
> >> >>
> >> >>
> >> >> However, in this case the tick() function is getting called
> >> >> throughout, we're just not seeing the beacons because they're held up
> >> >> waiting for readable.
> >> >
> >> >
> >> > This story doesn't seem quite right/complete This code is only invoked
> >> > if the monitor is the leader, which certainly doesn't happen when the
> >> > monitor is out of quorum. Is the beacon maybe going to a peon which
> >> > isn't forwarding quickly enough?
> >>
> >> That was what I thought (the messages are indeed being forwarded and
> >> getting held up a bit there), but then I looked at the leader's log
> >> and they were getting held up there too.
> >
> >
> > Okay, so it's not that one peon is laggy but that the monitors are
> > knowingly not in a quorum, including the one who is leader on both
> > sides of the election (but not while things are laggy!). And then when
> > the election happens, the leader does a tick and notices it hasn't
> > gotten a beacon from any MDSes in the past timeout interval.
>
> That's what I thought too, until I noticed that the elections were
> only happening *after* we've had our tick() and mistakenly killed an
> MDS.  There's this period where the beacons are getting ignored, but
> our tick() is still getting called.
>
> I've snipped out the timeline in an attachment to this mail (easier to
> read in a nice widescreen text editor than most mail clients).


Okay, now I get it.

1) leader is not readable because there's a proposal pending
(presumably, waiting on some slow peon)
2) leader receives forwarded beacons, holds off until readable
3) leader ticks() and says it hasn't processed a beacon in an
acceptable amount of time so it kills MDS
4) peon finally times out so an election is called

So, yeah, I don't think we have any good patterns established there.
You could presumably narrow the race by not doing the eviction
processing while not readable, but I don't have any particular reason
to think that closes it. You could perhaps try and track unprocessed
beacons, and not evaluate it until you've seen them all?

(FYI: attachments won't go through vger at all — I imagine you got a
bounce back from it. Use pastebin or whatever.)
-Greg

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Handling is_readable=0 periods in mon
  2017-05-24 21:31         ` Gregory Farnum
@ 2017-05-24 21:37           ` Sage Weil
  0 siblings, 0 replies; 8+ messages in thread
From: Sage Weil @ 2017-05-24 21:37 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: John Spray, Ceph Development, Kefu Chai, Joao Eduardo Luis

On Wed, 24 May 2017, Gregory Farnum wrote:
> On Wed, May 24, 2017 at 2:15 PM John Spray <jspray@redhat.com> wrote:
> >
> > On Wed, May 24, 2017 at 9:45 PM, Gregory Farnum <gfarnum@redhat.com> wrote:
> > > On Wed, May 24, 2017 at 3:20 AM John Spray <jspray@redhat.com> wrote:
> > >>
> > >> On Wed, May 24, 2017 at 6:12 AM, Gregory Farnum <gfarnum@redhat.com> wrote:
> > >> > On Tue, May 23, 2017 at 10:31 AM John Spray <jspray@redhat.com> wrote:
> > >> >>
> > >> >> Hi all,
> > >> >>
> > >> >> I could use some help from people who understand the mon better than I
> > >> >> do with this ticket: http://tracker.ceph.com/issues/19706
> > >> >>
> > >> >> The issue is that MDSMonitor is incorrectly killing MDSs because it
> > >> >> hasn't seen beacon messages, but the beacon messages are actually just
> > >> >> held up because is_readable = 0, like this:
> > >> >> 2017-05-23 13:34:20.054785 7f772f1c2700 10
> > >> >> mon.b@0(leader).paxosservice(mdsmap 1..11) dispatch 0x7f7742989740
> > >> >> mdsbeacon(4141/a up:active seq 96 v9) v7 from mds.0
> > >> >> 172.21.15.77:6809/2700711429 con 0x7f77428d8f00
> > >> >> 2017-05-23 13:34:20.054788 7f772f1c2700  5 mon.b@0(leader).paxos(paxos
> > >> >> recovering c 1..293) is_readable = 0 - now=2017-05-23 13:34:20.054789
> > >> >> lease_expire=0.000000 has v0 lc 293
> > >> >> 2017-05-23 13:34:20.054791 7f772f1c2700 10
> > >> >> mon.b@0(leader).paxosservice(mdsmap 1..11)  waiting for paxos ->
> > >> >> readable (v9)
> > >> >>
> > >> >> This appears to be happening when one or more mons are a bit laggy,
> > >> >> but it is happening before an election has happened.
> > >> >>
> > >> >> We have code for handling slow elections by checking how long it has
> > >> >> been since the last tick, and resetting our timeout information for
> > >> >> MDS beacons if it has been too long
> > >> >> (https://github.com/ceph/ceph/blob/master/src/mon/MDSMonitor.cc#L2070)
> > >> >>
> > >> >>
> > >> >> However, in this case the tick() function is getting called
> > >> >> throughout, we're just not seeing the beacons because they're held up
> > >> >> waiting for readable.
> > >> >
> > >> >
> > >> > This story doesn't seem quite right/complete This code is only invoked
> > >> > if the monitor is the leader, which certainly doesn't happen when the
> > >> > monitor is out of quorum. Is the beacon maybe going to a peon which
> > >> > isn't forwarding quickly enough?
> > >>
> > >> That was what I thought (the messages are indeed being forwarded and
> > >> getting held up a bit there), but then I looked at the leader's log
> > >> and they were getting held up there too.
> > >
> > >
> > > Okay, so it's not that one peon is laggy but that the monitors are
> > > knowingly not in a quorum, including the one who is leader on both
> > > sides of the election (but not while things are laggy!). And then when
> > > the election happens, the leader does a tick and notices it hasn't
> > > gotten a beacon from any MDSes in the past timeout interval.
> >
> > That's what I thought too, until I noticed that the elections were
> > only happening *after* we've had our tick() and mistakenly killed an
> > MDS.  There's this period where the beacons are getting ignored, but
> > our tick() is still getting called.
> >
> > I've snipped out the timeline in an attachment to this mail (easier to
> > read in a nice widescreen text editor than most mail clients).
> 
> 
> Okay, now I get it.
> 
> 1) leader is not readable because there's a proposal pending
> (presumably, waiting on some slow peon)
> 2) leader receives forwarded beacons, holds off until readable
> 3) leader ticks() and says it hasn't processed a beacon in an
> acceptable amount of time so it kills MDS
> 4) peon finally times out so an election is called
> 
> So, yeah, I don't think we have any good patterns established there.
> You could presumably narrow the race by not doing the eviction
> processing while not readable, but I don't have any particular reason
> to think that closes it. You could perhaps try and track unprocessed
> beacons, and not evaluate it until you've seen them all?

We could measure the time that a PaxosService is proposing (and not 
readable), and add that value to the threshold for killing an mds or mgr.  
That should work as long as we assume during any given readable 
interval we'll be able to process all pending beacons.  I suspect that is 
normally the case.

We could also make it so that propose_pending doesn't happen while we are 
processing backlogged/delayed ops.  There are a very small number of 
instances where we *force* an immediate proposal (and arguably those 
should be fixed to not require that since they do not scale).  (FWIW this 
might also address the potential for livelock on cross-service 
proposals.)

sage

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Handling is_readable=0 periods in mon
  2017-05-24  5:12 ` Gregory Farnum
  2017-05-24 10:19   ` John Spray
@ 2017-05-25 16:24   ` kefu chai
  1 sibling, 0 replies; 8+ messages in thread
From: kefu chai @ 2017-05-25 16:24 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: John Spray, Ceph Development, Kefu Chai, Joao Eduardo Luis

John,

i am noting down my findings on http://tracker.ceph.com/issues/19706
when reading jspray-2017-05-23_11:58:06-fs-wip-jcsp-testing-20170523-distro-basic-smithi/1221142.

On Wed, May 24, 2017 at 1:12 PM, Gregory Farnum <gfarnum@redhat.com> wrote:
> On Tue, May 23, 2017 at 10:31 AM John Spray <jspray@redhat.com> wrote:
>>
>> Hi all,
>>
>> I could use some help from people who understand the mon better than I
>> do with this ticket: http://tracker.ceph.com/issues/19706
>>
>> The issue is that MDSMonitor is incorrectly killing MDSs because it
>> hasn't seen beacon messages, but the beacon messages are actually just
>> held up because is_readable = 0, like this:
>> 2017-05-23 13:34:20.054785 7f772f1c2700 10
>> mon.b@0(leader).paxosservice(mdsmap 1..11) dispatch 0x7f7742989740
>> mdsbeacon(4141/a up:active seq 96 v9) v7 from mds.0
>> 172.21.15.77:6809/2700711429 con 0x7f77428d8f00
>> 2017-05-23 13:34:20.054788 7f772f1c2700  5 mon.b@0(leader).paxos(paxos
>> recovering c 1..293) is_readable = 0 - now=2017-05-23 13:34:20.054789
>> lease_expire=0.000000 has v0 lc 293
>> 2017-05-23 13:34:20.054791 7f772f1c2700 10
>> mon.b@0(leader).paxosservice(mdsmap 1..11)  waiting for paxos ->
>> readable (v9)
>>
>> This appears to be happening when one or more mons are a bit laggy,
>> but it is happening before an election has happened.
>>
>> We have code for handling slow elections by checking how long it has
>> been since the last tick, and resetting our timeout information for
>> MDS beacons if it has been too long
>> (https://github.com/ceph/ceph/blob/master/src/mon/MDSMonitor.cc#L2070)
>>
>>
>> However, in this case the tick() function is getting called
>> throughout, we're just not seeing the beacons because they're held up
>> waiting for readable.
>
>
> This story doesn't seem quite right/complete This code is only invoked
> if the monitor is the leader, which certainly doesn't happen when the
> monitor is out of quorum. Is the beacon maybe going to a peon which
> isn't forwarding quickly enough?

no, the this code is also invoked when the mon is the peon, see


```
bool PaxosService::dispatch(MonOpRequestRef op)
...
  // make sure our map is readable and up to date
  if (!is_readable(m->version)) {
    dout(10) << " waiting for paxos -> readable (v" << m->version <<
")" << dendl;
    wait_for_readable(op, new C_RetryMessage(this, op), m->version);
    return true;
  }
```

like John pointed out, it's being hold by peon which is waiting for a
readable store.

i just updated http://tracker.ceph.com/issues/19706#note-7 with my finding:

the peon took more than 5 seconds to apply the transaction.

this just does not make sense.

>
>>
>>
>> I could hack around this by only doing timeouts if *any* daemon has
>> successfully got a beacon through in the last (mds_beacon_grace*2) or
>> something like that, but I wonder if there's a Right way to handle
>> this for PaxosService subclasses?
>
>
> The MDS is I think the only thing doing this, so if its patterns
> aren't working we probably don't. (The manager just started doing it,
> so may have a similar problem?)
> I suppose you might take into account the paxos quorum timeouts and
> which monitor the MDS was connected to, so that it only marks an MDS
> down if you have positive belief the monitor it was pinging is alive?
> -Greg
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Regards
Kefu Chai

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-05-25 16:24 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-23 17:29 Handling is_readable=0 periods in mon John Spray
2017-05-24  5:12 ` Gregory Farnum
2017-05-24 10:19   ` John Spray
2017-05-24 20:45     ` Gregory Farnum
2017-05-24 21:14       ` John Spray
2017-05-24 21:31         ` Gregory Farnum
2017-05-24 21:37           ` Sage Weil
2017-05-25 16:24   ` kefu chai

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.