Ceph watchdog-like thing to reduce IO block during process goes down by abort()

All of lore.kernel.org
 help / color / mirror / Atom feed

* Ceph watchdog-like thing to reduce IO block during process goes down by abort()
@ 2016-03-24  7:00 Igor.Podoski
       [not found] ` <CACJqLyax9Nntz09qJF2E_jjtoVa2moJxzW5T42bHVFGLop31Ug@mail.gmail.com>
  2016-03-24 19:53 ` Ilya Dryomov
  0 siblings, 2 replies; 18+ messages in thread
From: Igor.Podoski @ 2016-03-24  7:00 UTC (permalink / raw)
  To: ceph-devel

Hi Cephers!

Currently when we had a disk failure, assert() and then abort() was triggered and process was killed (ABRT). Other osds will eventually mark dead one as down, but it depends of heartbeat settings and monitor settings (mon_osd_min_down_reporters/mon_osd_min_down_reports). During dead-not-marked-as-down osd you can see blocked IO during writes and reads.

Recently I've made https://github.com/ceph/ceph/pull/7740 which is about sending MakrMeDown msg to monitor just before osd is going bye-bye. It prevents blocked IO in above case, and any other assert that is not on message sending path, so I need messenger/pipes/connections working for this. I've made some test and it looks good, when I pull out drive from my cluster during rados bench, IO blocks for less than 1 second or not at all, previously it was > 10 sec (on my cluster settings).

Sage pointed me that some time ago was similar PR https://github.com/ceph/ceph/pull/6514 and there was a thought about ceph-watchdog process, that could monitor osd's and send info directly to monitor when they disappear. This would prevent all assert() cases, and other ones like kill -9 or similar.

I have a few ideas how such functionality could be implemented, so my question is - does any of you started already doing something similar?

Let's have a brain storm about it!

Ideas about improving 7740/6514 MarkMeDown internal mechanism:
- I think, I could send message with MarkMeDown payload, but in a raw way, not through Messenger path. This could be as good as bad in this case.
- I could poke osd-neighbor through signal and neighbor will send Mark(SignalSender)Down message (this won't work If whole hdd controller will be down, all osd will be dead in narrow time window). So it's like instant bad-health heartbeat message. Still depends of Messenger send path of osd-neighbor.

External ceph-watchdog:
Just like Sage wrote https://github.com/ceph/ceph/pull/6514#issuecomment-159372845 Or similar: each osd, during start passes its own PID to ceph-watchdog process through shared memory/socket/named pipe (whatever). Ceph-watchdog checks if current PID exists, by checking changes in /proc/PID or /proc/PID/cmd directory/file (maybe Inotify could handle this). When file or folder is changed(missing) it sends MarkThisOsdDown to monitor and that's all. But this won't be watchdog strict, rather process down notify. 

Or maybe both ways PR7740 + external ?

Regards,
Igor.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Ceph watchdog-like thing to reduce IO block during process goes down by abort()
       [not found] ` <CACJqLyax9Nntz09qJF2E_jjtoVa2moJxzW5T42bHVFGLop31Ug@mail.gmail.com>
@ 2016-03-24 11:47   ` Igor.Podoski
  2016-03-24 20:55     ` Sage Weil
  0 siblings, 1 reply; 18+ messages in thread
From: Igor.Podoski @ 2016-03-24 11:47 UTC (permalink / raw)
  To: Haomai Wang; +Cc: ceph-devel

> -----Original Message-----
> From: Haomai Wang [mailto:haomaiwang@gmail.com]
> Sent: Thursday, March 24, 2016 10:48 AM
> To: Podoski, Igor
> Cc: ceph-devel
> Subject: Re: Ceph watchdog-like thing to reduce IO block during process goes
> down by abort()
> 
> 
> 
> On Thu, Mar 24, 2016 at 3:00 PM, Igor.Podoski@ts.fujitsu.com
> <Igor.Podoski@ts.fujitsu.com> wrote:
> > Hi Cephers!
> >
> > Currently when we had a disk failure, assert() and then abort() was
> triggered and process was killed (ABRT). Other osds will eventually mark
> dead one as down, but it depends of heartbeat settings and monitor settings
> (mon_osd_min_down_reporters/mon_osd_min_down_reports). During
> dead-not-marked-as-down osd you can see blocked IO during writes and
> reads.
> >
> > Recently I've made https://github.com/ceph/ceph/pull/7740 which is
> about sending MakrMeDown msg to monitor just before osd is going bye-
> bye. It prevents blocked IO in above case, and any other assert that is not on
> message sending path, so I need messenger/pipes/connections working for
> this. I've made some test and it looks good, when I pull out drive from my
> cluster during rados bench, IO blocks for less than 1 second or not at all,
> previously it was > 10 sec (on my cluster settings).
> >
> > Sage pointed me that some time ago was similar PR
> https://github.com/ceph/ceph/pull/6514 and there was a thought about
> ceph-watchdog process, that could monitor osd's and send info directly to
> monitor when they disappear. This would prevent all assert() cases, and
> other ones like kill -9 or similar.
> >
> > I have a few ideas how such functionality could be implemented, so my
> question is - does any of you started already doing something similar?
> >
> > Let's have a brain storm about it!
> >
> > Ideas about improving 7740/6514 MarkMeDown internal mechanism:
> > - I think, I could send message with MarkMeDown payload, but in a raw
> way, not through Messenger path. This could be as good as bad in this case.
> 
> I think we still go though msgr path or a specfied api? urgent or out-of-bound
> flag?

But we need messenger instance in a good shape for sending MarkMeDown, if we got assert in the core of msgr, we will fail to send anything.

> 
> > - I could poke osd-neighbor through signal and neighbor will send
> Mark(SignalSender)Down message (this won't work If whole hdd controller
> will be down, all osd will be dead in narrow time window). So it's like instant
> bad-health heartbeat message. Still depends of Messenger send path of
> osd-neighbor.
> 
> >
> > External ceph-watchdog:
> > Just like Sage wrote
> https://github.com/ceph/ceph/pull/6514#issuecomment-159372845 Or
> similar: each osd, during start passes its own PID to ceph-watchdog process
> through shared memory/socket/named pipe (whatever). Ceph-watchdog
> checks if current PID exists, by checking changes in /proc/PID or
> /proc/PID/cmd directory/file (maybe Inotify could handle this). When file or
> folder is changed(missing) it sends MarkThisOsdDown to monitor and that's
> all. But this won't be watchdog strict, rather process down notify.
> 
> it looks a little complexity and redundant, we need to manage the lifecycle of
> watchdog itself, and it's much like systemd...

Ok, so back to slightly modified Sage idea:

Osd before abort() could write its ID (from **argv) to ceph-watchdog named pipe. Only one could be hazard here - case when all osd's want to notify watchdog in the same time. As I wrote before it would not be a 'watchdog' process, but 'process down notify', so question is do we need watchdog like thing for some other stuff (in the feature) or process down notify will be sufficient?

> 
> >
> > Or maybe both ways PR7740 + external ?
> >
> > Regards,
> > Igor.
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> --
> 
> 
> Best Regards,
> 
> Wheat

Regards,
Igor.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Ceph watchdog-like thing to reduce IO block during process goes down by abort()
  2016-03-24  7:00 Ceph watchdog-like thing to reduce IO block during process goes down by abort() Igor.Podoski
       [not found] ` <CACJqLyax9Nntz09qJF2E_jjtoVa2moJxzW5T42bHVFGLop31Ug@mail.gmail.com>
@ 2016-03-24 19:53 ` Ilya Dryomov
  2016-03-24 20:25   ` Gregory Farnum
  1 sibling, 1 reply; 18+ messages in thread
From: Ilya Dryomov @ 2016-03-24 19:53 UTC (permalink / raw)
  To: Igor.Podoski; +Cc: ceph-devel

On Thu, Mar 24, 2016 at 8:00 AM, Igor.Podoski@ts.fujitsu.com
<Igor.Podoski@ts.fujitsu.com> wrote:
> Hi Cephers!
>
> Currently when we had a disk failure, assert() and then abort() was triggered and process was killed (ABRT). Other osds will eventually mark dead one as down, but it depends of heartbeat settings and monitor settings (mon_osd_min_down_reporters/mon_osd_min_down_reports). During dead-not-marked-as-down osd you can see blocked IO during writes and reads.
>
> Recently I've made https://github.com/ceph/ceph/pull/7740 which is about sending MakrMeDown msg to monitor just before osd is going bye-bye. It prevents blocked IO in above case, and any other assert that is not on message sending path, so I need messenger/pipes/connections working for this. I've made some test and it looks good, when I pull out drive from my cluster during rados bench, IO blocks for less than 1 second or not at all, previously it was > 10 sec (on my cluster settings).
>
> Sage pointed me that some time ago was similar PR https://github.com/ceph/ceph/pull/6514 and there was a thought about ceph-watchdog process, that could monitor osd's and send info directly to monitor when they disappear. This would prevent all assert() cases, and other ones like kill -9 or similar.
>
> I have a few ideas how such functionality could be implemented, so my question is - does any of you started already doing something similar?
>
> Let's have a brain storm about it!
>
> Ideas about improving 7740/6514 MarkMeDown internal mechanism:
> - I think, I could send message with MarkMeDown payload, but in a raw way, not through Messenger path. This could be as good as bad in this case.
> - I could poke osd-neighbor through signal and neighbor will send Mark(SignalSender)Down message (this won't work If whole hdd controller will be down, all osd will be dead in narrow time window). So it's like instant bad-health heartbeat message. Still depends of Messenger send path of osd-neighbor.

>
> External ceph-watchdog:
> Just like Sage wrote https://github.com/ceph/ceph/pull/6514#issuecomment-159372845 Or similar: each osd, during start passes its own PID to ceph-watchdog process through shared memory/socket/named pipe (whatever). Ceph-watchdog checks if current PID exists, by checking changes in /proc/PID or /proc/PID/cmd directory/file (maybe Inotify could handle this). When file or folder is changed(missing) it sends MarkThisOsdDown to monitor and that's all. But this won't be watchdog strict, rather process down notify.
>
> Or maybe both ways PR7740 + external ?

I'm not involved in any of this, but since you asked for a brain
storm... ;)

Is it worth bothering with the corrupted data structures case at all?
Trying to handle it from within the aborting ceph-osd process is not a
very easy thing to do ("raw way, not through Messenger", signals, etc)
and if you do it wrong, you'd mask the original stack trace.  An
external ceph-watchdog is yet another entity which has to be set up,
maintained and accounted for.

Why not just distinguish legitimate/expected errors which we check for
but currently handle with assert(0) and the actual assert failures?  In
the vast majority of cases that fall into the former bucket all of the
internal data structures, including the messenger, will be in order and
so we can send a MarkMeDown message and fail gracefully.  Implementing
it is just a matter of identifying those sites, but that's not a bad
exercise to do even on its own.

The actual assert failures can abort() as they do now.  Any such
failure is a serious bug and there's hopefully not too many of them to
worry about shrinking the timeout to a minimum, unless there are hard
numbers that prove otherwise, of course.  And if you kill -9 your OSDs,
you deserve to wait for MONs to catch up.  Am I missing any use cases
here?

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Ceph watchdog-like thing to reduce IO block during process goes down by abort()
  2016-03-24 19:53 ` Ilya Dryomov
@ 2016-03-24 20:25   ` Gregory Farnum
  2016-03-24 20:46     ` Ilya Dryomov
  2016-03-25 15:08     ` Milosz Tanski
  0 siblings, 2 replies; 18+ messages in thread
From: Gregory Farnum @ 2016-03-24 20:25 UTC (permalink / raw)
  To: Ilya Dryomov, Sage Weil; +Cc: Igor.Podoski, ceph-devel

On Thu, Mar 24, 2016 at 12:53 PM, Ilya Dryomov <idryomov@gmail.com> wrote:
> On Thu, Mar 24, 2016 at 8:00 AM, Igor.Podoski@ts.fujitsu.com
> <Igor.Podoski@ts.fujitsu.com> wrote:
>> Hi Cephers!
>>
>> Currently when we had a disk failure, assert() and then abort() was triggered and process was killed (ABRT). Other osds will eventually mark dead one as down, but it depends of heartbeat settings and monitor settings (mon_osd_min_down_reporters/mon_osd_min_down_reports). During dead-not-marked-as-down osd you can see blocked IO during writes and reads.
>>
>> Recently I've made https://github.com/ceph/ceph/pull/7740 which is about sending MakrMeDown msg to monitor just before osd is going bye-bye. It prevents blocked IO in above case, and any other assert that is not on message sending path, so I need messenger/pipes/connections working for this. I've made some test and it looks good, when I pull out drive from my cluster during rados bench, IO blocks for less than 1 second or not at all, previously it was > 10 sec (on my cluster settings).
>>
>> Sage pointed me that some time ago was similar PR https://github.com/ceph/ceph/pull/6514 and there was a thought about ceph-watchdog process, that could monitor osd's and send info directly to monitor when they disappear. This would prevent all assert() cases, and other ones like kill -9 or similar.
>>
>> I have a few ideas how such functionality could be implemented, so my question is - does any of you started already doing something similar?
>>
>> Let's have a brain storm about it!
>>
>> Ideas about improving 7740/6514 MarkMeDown internal mechanism:
>> - I think, I could send message with MarkMeDown payload, but in a raw way, not through Messenger path. This could be as good as bad in this case.
>> - I could poke osd-neighbor through signal and neighbor will send Mark(SignalSender)Down message (this won't work If whole hdd controller will be down, all osd will be dead in narrow time window). So it's like instant bad-health heartbeat message. Still depends of Messenger send path of osd-neighbor.
>
>>
>> External ceph-watchdog:
>> Just like Sage wrote https://github.com/ceph/ceph/pull/6514#issuecomment-159372845 Or similar: each osd, during start passes its own PID to ceph-watchdog process through shared memory/socket/named pipe (whatever). Ceph-watchdog checks if current PID exists, by checking changes in /proc/PID or /proc/PID/cmd directory/file (maybe Inotify could handle this). When file or folder is changed(missing) it sends MarkThisOsdDown to monitor and that's all. But this won't be watchdog strict, rather process down notify.
>>
>> Or maybe both ways PR7740 + external ?
>
> I'm not involved in any of this, but since you asked for a brain
> storm... ;)
>
> Is it worth bothering with the corrupted data structures case at all?
> Trying to handle it from within the aborting ceph-osd process is not a
> very easy thing to do ("raw way, not through Messenger", signals, etc)
> and if you do it wrong, you'd mask the original stack trace.  An
> external ceph-watchdog is yet another entity which has to be set up,
> maintained and accounted for.
>
> Why not just distinguish legitimate/expected errors which we check for
> but currently handle with assert(0) and the actual assert failures?  In
> the vast majority of cases that fall into the former bucket all of the
> internal data structures, including the messenger, will be in order and
> so we can send a MarkMeDown message and fail gracefully.  Implementing
> it is just a matter of identifying those sites, but that's not a bad
> exercise to do even on its own.
>
> The actual assert failures can abort() as they do now.  Any such
> failure is a serious bug and there's hopefully not too many of them to
> worry about shrinking the timeout to a minimum, unless there are hard
> numbers that prove otherwise, of course.  And if you kill -9 your OSDs,
> you deserve to wait for MONs to catch up.  Am I missing any use cases
> here?

This is something Sam and I have talked about in the past, but
apparently Sage didn't like that idea in
https://github.com/ceph/ceph/pull/6514 and suggested a daemon watcher
instead?
Personally I tend towards building that kind of functionality into the
daemon, although he's right it will never be quite as good at catching
all cases as an external manager. The upside is that we don't have to
worry about the failure cases between the two of them. ;)
-Greg

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Ceph watchdog-like thing to reduce IO block during process goes down by abort()
  2016-03-24 20:25   ` Gregory Farnum
@ 2016-03-24 20:46     ` Ilya Dryomov
  2016-03-24 20:47       ` Gregory Farnum
  2016-03-25 15:08     ` Milosz Tanski
  1 sibling, 1 reply; 18+ messages in thread
From: Ilya Dryomov @ 2016-03-24 20:46 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Sage Weil, Igor.Podoski, ceph-devel

On Thu, Mar 24, 2016 at 9:25 PM, Gregory Farnum <gfarnum@redhat.com> wrote:
> On Thu, Mar 24, 2016 at 12:53 PM, Ilya Dryomov <idryomov@gmail.com> wrote:
>> On Thu, Mar 24, 2016 at 8:00 AM, Igor.Podoski@ts.fujitsu.com
>> <Igor.Podoski@ts.fujitsu.com> wrote:
>>> Hi Cephers!
>>>
>>> Currently when we had a disk failure, assert() and then abort() was triggered and process was killed (ABRT). Other osds will eventually mark dead one as down, but it depends of heartbeat settings and monitor settings (mon_osd_min_down_reporters/mon_osd_min_down_reports). During dead-not-marked-as-down osd you can see blocked IO during writes and reads.
>>>
>>> Recently I've made https://github.com/ceph/ceph/pull/7740 which is about sending MakrMeDown msg to monitor just before osd is going bye-bye. It prevents blocked IO in above case, and any other assert that is not on message sending path, so I need messenger/pipes/connections working for this. I've made some test and it looks good, when I pull out drive from my cluster during rados bench, IO blocks for less than 1 second or not at all, previously it was > 10 sec (on my cluster settings).
>>>
>>> Sage pointed me that some time ago was similar PR https://github.com/ceph/ceph/pull/6514 and there was a thought about ceph-watchdog process, that could monitor osd's and send info directly to monitor when they disappear. This would prevent all assert() cases, and other ones like kill -9 or similar.
>>>
>>> I have a few ideas how such functionality could be implemented, so my question is - does any of you started already doing something similar?
>>>
>>> Let's have a brain storm about it!
>>>
>>> Ideas about improving 7740/6514 MarkMeDown internal mechanism:
>>> - I think, I could send message with MarkMeDown payload, but in a raw way, not through Messenger path. This could be as good as bad in this case.
>>> - I could poke osd-neighbor through signal and neighbor will send Mark(SignalSender)Down message (this won't work If whole hdd controller will be down, all osd will be dead in narrow time window). So it's like instant bad-health heartbeat message. Still depends of Messenger send path of osd-neighbor.
>>
>>>
>>> External ceph-watchdog:
>>> Just like Sage wrote https://github.com/ceph/ceph/pull/6514#issuecomment-159372845 Or similar: each osd, during start passes its own PID to ceph-watchdog process through shared memory/socket/named pipe (whatever). Ceph-watchdog checks if current PID exists, by checking changes in /proc/PID or /proc/PID/cmd directory/file (maybe Inotify could handle this). When file or folder is changed(missing) it sends MarkThisOsdDown to monitor and that's all. But this won't be watchdog strict, rather process down notify.
>>>
>>> Or maybe both ways PR7740 + external ?
>>
>> I'm not involved in any of this, but since you asked for a brain
>> storm... ;)
>>
>> Is it worth bothering with the corrupted data structures case at all?
>> Trying to handle it from within the aborting ceph-osd process is not a
>> very easy thing to do ("raw way, not through Messenger", signals, etc)
>> and if you do it wrong, you'd mask the original stack trace.  An
>> external ceph-watchdog is yet another entity which has to be set up,
>> maintained and accounted for.
>>
>> Why not just distinguish legitimate/expected errors which we check for
>> but currently handle with assert(0) and the actual assert failures?  In
>> the vast majority of cases that fall into the former bucket all of the
>> internal data structures, including the messenger, will be in order and
>> so we can send a MarkMeDown message and fail gracefully.  Implementing
>> it is just a matter of identifying those sites, but that's not a bad
>> exercise to do even on its own.
>>
>> The actual assert failures can abort() as they do now.  Any such
>> failure is a serious bug and there's hopefully not too many of them to
>> worry about shrinking the timeout to a minimum, unless there are hard
>> numbers that prove otherwise, of course.  And if you kill -9 your OSDs,
>> you deserve to wait for MONs to catch up.  Am I missing any use cases
>> here?
>
> This is something Sam and I have talked about in the past, but
> apparently Sage didn't like that idea in
> https://github.com/ceph/ceph/pull/6514 and suggested a daemon watcher
> instead?
> Personally I tend towards building that kind of functionality into the
> daemon, although he's right it will never be quite as good at catching
> all cases as an external manager. The upside is that we don't have to
> worry about the failure cases between the two of them. ;)

Well, my argument is we don't need it to catch everything or be as
trustworthy as an external entity.  That diff indeed seems fragile,
although I think Sage's use of that word was directed at the overall
approach.

        derr << "FileJournal::do_write: pwrite(fd=" << fd
             << ", hbp.length=" << hbp.length() << ") failed :"
             << cpp_strerror(err) << dendl;
+       ceph_io_error_tidy_shutdown();
        ceph_abort();

or

+    if (m_filestore_fail_eio && r == -EIO) {
+      ceph_io_error_tidy_shutdown();
+    }
     assert(!m_filestore_fail_eio || r != -EIO);

I was thinking of having ceph_abort() and ceph_abort_gracefully().
Very clear, communicates intent and lets the actual asserts do what
they are supposed to do ;)

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Ceph watchdog-like thing to reduce IO block during process goes down by abort()
  2016-03-24 20:46     ` Ilya Dryomov
@ 2016-03-24 20:47       ` Gregory Farnum
  0 siblings, 0 replies; 18+ messages in thread
From: Gregory Farnum @ 2016-03-24 20:47 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Sage Weil, Igor.Podoski, ceph-devel

On Thu, Mar 24, 2016 at 1:46 PM, Ilya Dryomov <idryomov@gmail.com> wrote:
> On Thu, Mar 24, 2016 at 9:25 PM, Gregory Farnum <gfarnum@redhat.com> wrote:
>> On Thu, Mar 24, 2016 at 12:53 PM, Ilya Dryomov <idryomov@gmail.com> wrote:
>>> On Thu, Mar 24, 2016 at 8:00 AM, Igor.Podoski@ts.fujitsu.com
>>> <Igor.Podoski@ts.fujitsu.com> wrote:
>>>> Hi Cephers!
>>>>
>>>> Currently when we had a disk failure, assert() and then abort() was triggered and process was killed (ABRT). Other osds will eventually mark dead one as down, but it depends of heartbeat settings and monitor settings (mon_osd_min_down_reporters/mon_osd_min_down_reports). During dead-not-marked-as-down osd you can see blocked IO during writes and reads.
>>>>
>>>> Recently I've made https://github.com/ceph/ceph/pull/7740 which is about sending MakrMeDown msg to monitor just before osd is going bye-bye. It prevents blocked IO in above case, and any other assert that is not on message sending path, so I need messenger/pipes/connections working for this. I've made some test and it looks good, when I pull out drive from my cluster during rados bench, IO blocks for less than 1 second or not at all, previously it was > 10 sec (on my cluster settings).
>>>>
>>>> Sage pointed me that some time ago was similar PR https://github.com/ceph/ceph/pull/6514 and there was a thought about ceph-watchdog process, that could monitor osd's and send info directly to monitor when they disappear. This would prevent all assert() cases, and other ones like kill -9 or similar.
>>>>
>>>> I have a few ideas how such functionality could be implemented, so my question is - does any of you started already doing something similar?
>>>>
>>>> Let's have a brain storm about it!
>>>>
>>>> Ideas about improving 7740/6514 MarkMeDown internal mechanism:
>>>> - I think, I could send message with MarkMeDown payload, but in a raw way, not through Messenger path. This could be as good as bad in this case.
>>>> - I could poke osd-neighbor through signal and neighbor will send Mark(SignalSender)Down message (this won't work If whole hdd controller will be down, all osd will be dead in narrow time window). So it's like instant bad-health heartbeat message. Still depends of Messenger send path of osd-neighbor.
>>>
>>>>
>>>> External ceph-watchdog:
>>>> Just like Sage wrote https://github.com/ceph/ceph/pull/6514#issuecomment-159372845 Or similar: each osd, during start passes its own PID to ceph-watchdog process through shared memory/socket/named pipe (whatever). Ceph-watchdog checks if current PID exists, by checking changes in /proc/PID or /proc/PID/cmd directory/file (maybe Inotify could handle this). When file or folder is changed(missing) it sends MarkThisOsdDown to monitor and that's all. But this won't be watchdog strict, rather process down notify.
>>>>
>>>> Or maybe both ways PR7740 + external ?
>>>
>>> I'm not involved in any of this, but since you asked for a brain
>>> storm... ;)
>>>
>>> Is it worth bothering with the corrupted data structures case at all?
>>> Trying to handle it from within the aborting ceph-osd process is not a
>>> very easy thing to do ("raw way, not through Messenger", signals, etc)
>>> and if you do it wrong, you'd mask the original stack trace.  An
>>> external ceph-watchdog is yet another entity which has to be set up,
>>> maintained and accounted for.
>>>
>>> Why not just distinguish legitimate/expected errors which we check for
>>> but currently handle with assert(0) and the actual assert failures?  In
>>> the vast majority of cases that fall into the former bucket all of the
>>> internal data structures, including the messenger, will be in order and
>>> so we can send a MarkMeDown message and fail gracefully.  Implementing
>>> it is just a matter of identifying those sites, but that's not a bad
>>> exercise to do even on its own.
>>>
>>> The actual assert failures can abort() as they do now.  Any such
>>> failure is a serious bug and there's hopefully not too many of them to
>>> worry about shrinking the timeout to a minimum, unless there are hard
>>> numbers that prove otherwise, of course.  And if you kill -9 your OSDs,
>>> you deserve to wait for MONs to catch up.  Am I missing any use cases
>>> here?
>>
>> This is something Sam and I have talked about in the past, but
>> apparently Sage didn't like that idea in
>> https://github.com/ceph/ceph/pull/6514 and suggested a daemon watcher
>> instead?
>> Personally I tend towards building that kind of functionality into the
>> daemon, although he's right it will never be quite as good at catching
>> all cases as an external manager. The upside is that we don't have to
>> worry about the failure cases between the two of them. ;)
>
> Well, my argument is we don't need it to catch everything or be as
> trustworthy as an external entity.  That diff indeed seems fragile,
> although I think Sage's use of that word was directed at the overall
> approach.
>
>         derr << "FileJournal::do_write: pwrite(fd=" << fd
>              << ", hbp.length=" << hbp.length() << ") failed :"
>              << cpp_strerror(err) << dendl;
> +       ceph_io_error_tidy_shutdown();
>         ceph_abort();
>
> or
>
> +    if (m_filestore_fail_eio && r == -EIO) {
> +      ceph_io_error_tidy_shutdown();
> +    }
>      assert(!m_filestore_fail_eio || r != -EIO);
>
> I was thinking of having ceph_abort() and ceph_abort_gracefully().
> Very clear, communicates intent and lets the actual asserts do what
> they are supposed to do ;)

Yes, that definitely seems preferable and a good idea to me.
-Greg

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Ceph watchdog-like thing to reduce IO block during process goes down by abort()
  2016-03-24 11:47   ` Igor.Podoski
@ 2016-03-24 20:55     ` Sage Weil
  2016-03-24 21:15       ` Ilya Dryomov
  0 siblings, 1 reply; 18+ messages in thread
From: Sage Weil @ 2016-03-24 20:55 UTC (permalink / raw)
  To: Igor.Podoski; +Cc: Haomai Wang, ceph-devel

On Thu, 24 Mar 2016, Igor.Podoski@ts.fujitsu.com wrote:
> Ok, so back to slightly modified Sage idea:
> 
> Osd before abort() could write its ID (from **argv) to ceph-watchdog 
> named pipe. Only one could be hazard here - case when all osd's want to 
> notify watchdog in the same time. As I wrote before it would not be a 
> 'watchdog' process, but 'process down notify', so question is do we need 
> watchdog like thing for some other stuff (in the feature) or process 
> down notify will be sufficient?

I was imagining something that works the other way around, where the 
watchdog is very simple:

 - osd (or any daemon) opens a unix domain socket and identifies 
itself. e.g. "I am osd.123 at 1.2.3.4:6823"
 - if the socket is closed, the watchdog notifies the mon that there was a 
failure
 - the osd (or other daemon) can optionally send a message over the socket 
changing it's identifier (e.g, if the osd rebinds to a new ip).

This way the watchdog doesn't *do* anything except wait for new 
connections or for connections to close.  No polling of PIDs or anything 
like that.

We could figure out where the most common failures are (e.g., op thread 
timeout, or EIO), but I think in practice that will be hard--there are 
lots of places where as assert return values are 0.  An external watchdog, 
OTOH, would capture *all* of those cases, and the bugs.

The main concern I have is that the model doesn't work well when you have 
one daemon per host (e.g., microserver on an HDD).  Well, it works, but 
you double the number of monitor sessions.  Maybe that's okay, 
though--it's just an open TCP connection to a mon.

sage

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Ceph watchdog-like thing to reduce IO block during process goes down by abort()
  2016-03-24 20:55     ` Sage Weil
@ 2016-03-24 21:15       ` Ilya Dryomov
  2016-03-24 21:20         ` Gregory Farnum
  0 siblings, 1 reply; 18+ messages in thread
From: Ilya Dryomov @ 2016-03-24 21:15 UTC (permalink / raw)
  To: Sage Weil; +Cc: Igor.Podoski, Haomai Wang, ceph-devel

On Thu, Mar 24, 2016 at 9:55 PM, Sage Weil <sage@newdream.net> wrote:
> On Thu, 24 Mar 2016, Igor.Podoski@ts.fujitsu.com wrote:
>> Ok, so back to slightly modified Sage idea:
>>
>> Osd before abort() could write its ID (from **argv) to ceph-watchdog
>> named pipe. Only one could be hazard here - case when all osd's want to
>> notify watchdog in the same time. As I wrote before it would not be a
>> 'watchdog' process, but 'process down notify', so question is do we need
>> watchdog like thing for some other stuff (in the feature) or process
>> down notify will be sufficient?
>
> I was imagining something that works the other way around, where the
> watchdog is very simple:
>
>  - osd (or any daemon) opens a unix domain socket and identifies
> itself. e.g. "I am osd.123 at 1.2.3.4:6823"
>  - if the socket is closed, the watchdog notifies the mon that there was a
> failure
>  - the osd (or other daemon) can optionally send a message over the socket
> changing it's identifier (e.g, if the osd rebinds to a new ip).
>
> This way the watchdog doesn't *do* anything except wait for new
> connections or for connections to close.  No polling of PIDs or anything
> like that.
>
> We could figure out where the most common failures are (e.g., op thread
> timeout, or EIO), but I think in practice that will be hard--there are
> lots of places where as assert return values are 0.  An external watchdog,
> OTOH, would capture *all* of those cases, and the bugs.

What do you mean by a place where an assert return value is 0?
assert(!ret)?

My point is all of the asserts can be classified into two groups:
something (an error or a case) that isn't handled and an "oops" kind of
thing.  The actual condition doesn't matter.

Ultimately, this is about shrinking the time it takes for a MON to
notice the "oops".  Do we expect those things to be common and frequent
enough to justify an external daemon, however small and simple, on each
OSD node?

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Ceph watchdog-like thing to reduce IO block during process goes down by abort()
  2016-03-24 21:15       ` Ilya Dryomov
@ 2016-03-24 21:20         ` Gregory Farnum
  2016-03-25 12:54           ` Sage Weil
  0 siblings, 1 reply; 18+ messages in thread
From: Gregory Farnum @ 2016-03-24 21:20 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Sage Weil, Igor.Podoski, Haomai Wang, ceph-devel

On Thu, Mar 24, 2016 at 2:15 PM, Ilya Dryomov <idryomov@gmail.com> wrote:
>
> Ultimately, this is about shrinking the time it takes for a MON to
> notice the "oops".  Do we expect those things to be common and frequent
> enough to justify an external daemon, however small and simple, on each
> OSD node?

Let's not forget that extra daemons aren't free quite apart from
having to build them. There's a lot of user education to happen.
There's more stuff to install; we'll have extra cephx keys for them
that need to get placed; we need to update all our install and
management tools to set them up. We'll probably run into new kinds of
resource exhaustion, and we'll hit new errors around the local
communication setup. :/ I'm uneasy about creating *any* mechanism that
automatically marks down OSDs, but isn't directed by the OSD in
question.

Plus, I think there are other benefits of annotating our asserts more
carefully. They're kind of a mess right now and if we were able to do
more than crash on disk errors, it'd be nice when we move on to
gathering statistics and things...
-Greg

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Ceph watchdog-like thing to reduce IO block during process goes down by abort()
  2016-03-24 21:20         ` Gregory Farnum
@ 2016-03-25 12:54           ` Sage Weil
  2016-03-25 14:30             ` Ilya Dryomov
  0 siblings, 1 reply; 18+ messages in thread
From: Sage Weil @ 2016-03-25 12:54 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Ilya Dryomov, Igor.Podoski, Haomai Wang, ceph-devel

On Thu, 24 Mar 2016, Gregory Farnum wrote:
> On Thu, Mar 24, 2016 at 2:15 PM, Ilya Dryomov <idryomov@gmail.com> wrote:
> >
> > Ultimately, this is about shrinking the time it takes for a MON to
> > notice the "oops".  Do we expect those things to be common and frequent
> > enough to justify an external daemon, however small and simple, on each
> > OSD node?
> 
> Let's not forget that extra daemons aren't free quite apart from
> having to build them. There's a lot of user education to happen.
> There's more stuff to install; we'll have extra cephx keys for them
> that need to get placed; we need to update all our install and
> management tools to set them up. We'll probably run into new kinds of
> resource exhaustion, and we'll hit new errors around the local
> communication setup. :/ I'm uneasy about creating *any* mechanism that
> automatically marks down OSDs, but isn't directed by the OSD in
> question.
> 
> Plus, I think there are other benefits of annotating our asserts more
> carefully. They're kind of a mess right now and if we were able to do
> more than crash on disk errors, it'd be nice when we move on to
> gathering statistics and things...

Yep, I'm sold!  :)

Going back to Igor's PR...

	https://github.com/ceph/ceph/pull/7740

I think perhaps the first thing to do is to make a function like 
Ilya suggested that is

	ceph_abort_markmedown()

and then sort out where/when to call it (instead of tackling signal 
handlers immediately).  It seems like the semantics need to be something 
like

 - queue the markdown message for the mon
 - wait for N seconds (where N=5 or so?)
 - ceph_abort()

There are maybe three call sites that come to mind that will probably 
catch most issues:

 - the do_transaction (or equivalent) error code checks on write
 - a new helper that wraps up the checks/asserts about getting EIO on read
 - the internal heartbeat that goes off when a thread pool gets stuck

What else?

We could also go for an OSD signal handler, but it would have to be a 
best-effort sort of thing (obviuosly won't work if the messenger is 
busted), and it worries me a bit: what happens if there is a segv in the 
memory allocator, we try to stay alive longer so that we can send 
MarkMeDown, and as a result continue processing some IO but in the 
meantime let something corrupt reach disk or clients or otherwise get 
worse and propogate?

sage

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Ceph watchdog-like thing to reduce IO block during process goes down by abort()
  2016-03-25 12:54           ` Sage Weil
@ 2016-03-25 14:30             ` Ilya Dryomov
  0 siblings, 0 replies; 18+ messages in thread
From: Ilya Dryomov @ 2016-03-25 14:30 UTC (permalink / raw)
  To: Sage Weil; +Cc: Gregory Farnum, Igor.Podoski, Haomai Wang, ceph-devel

On Fri, Mar 25, 2016 at 1:54 PM, Sage Weil <sage@newdream.net> wrote:
> On Thu, 24 Mar 2016, Gregory Farnum wrote:
>> On Thu, Mar 24, 2016 at 2:15 PM, Ilya Dryomov <idryomov@gmail.com> wrote:
>> >
>> > Ultimately, this is about shrinking the time it takes for a MON to
>> > notice the "oops".  Do we expect those things to be common and frequent
>> > enough to justify an external daemon, however small and simple, on each
>> > OSD node?
>>
>> Let's not forget that extra daemons aren't free quite apart from
>> having to build them. There's a lot of user education to happen.
>> There's more stuff to install; we'll have extra cephx keys for them
>> that need to get placed; we need to update all our install and
>> management tools to set them up. We'll probably run into new kinds of
>> resource exhaustion, and we'll hit new errors around the local
>> communication setup. :/ I'm uneasy about creating *any* mechanism that
>> automatically marks down OSDs, but isn't directed by the OSD in
>> question.
>>
>> Plus, I think there are other benefits of annotating our asserts more
>> carefully. They're kind of a mess right now and if we were able to do
>> more than crash on disk errors, it'd be nice when we move on to
>> gathering statistics and things...
>
> Yep, I'm sold!  :)
>
> Going back to Igor's PR...
>
>         https://github.com/ceph/ceph/pull/7740
>
> I think perhaps the first thing to do is to make a function like
> Ilya suggested that is
>
>         ceph_abort_markmedown()
>
> and then sort out where/when to call it (instead of tackling signal
> handlers immediately).  It seems like the semantics need to be something
> like
>
>  - queue the markdown message for the mon
>  - wait for N seconds (where N=5 or so?)
>  - ceph_abort()

Is it to wait for the message to go out?  If so, maybe request
a MarkMeDown ack and have an N second Cond timeout?  Modidying
OSD::dispatch() or wiring it up through the service abstraction
shouldn't be hard - an ack would take a lot less than a second.

>
> There are maybe three call sites that come to mind that will probably
> catch most issues:
>
>  - the do_transaction (or equivalent) error code checks on write
>  - a new helper that wraps up the checks/asserts about getting EIO on read
>  - the internal heartbeat that goes off when a thread pool gets stuck
>
> What else?
>
> We could also go for an OSD signal handler, but it would have to be a
> best-effort sort of thing (obviuosly won't work if the messenger is
> busted), and it worries me a bit: what happens if there is a segv in the
> memory allocator, we try to stay alive longer so that we can send
> MarkMeDown, and as a result continue processing some IO but in the
> meantime let something corrupt reach disk or clients or otherwise get
> worse and propogate?

IMHO it's entirely unnecessary.  An "oops" assert should just abort() -
we are not the kernel, after all.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Ceph watchdog-like thing to reduce IO block during process goes down by abort()
  2016-03-24 20:25   ` Gregory Farnum
  2016-03-24 20:46     ` Ilya Dryomov
@ 2016-03-25 15:08     ` Milosz Tanski
  2016-03-25 15:12       ` Sage Weil
  1 sibling, 1 reply; 18+ messages in thread
From: Milosz Tanski @ 2016-03-25 15:08 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Ilya Dryomov, Sage Weil, Igor.Podoski, ceph-devel

On Thu, Mar 24, 2016 at 4:25 PM, Gregory Farnum <gfarnum@redhat.com> wrote:
> On Thu, Mar 24, 2016 at 12:53 PM, Ilya Dryomov <idryomov@gmail.com> wrote:
>> On Thu, Mar 24, 2016 at 8:00 AM, Igor.Podoski@ts.fujitsu.com
>> <Igor.Podoski@ts.fujitsu.com> wrote:
>>> Hi Cephers!
>>>
>>> Currently when we had a disk failure, assert() and then abort() was triggered and process was killed (ABRT). Other osds will eventually mark dead one as down, but it depends of heartbeat settings and monitor settings (mon_osd_min_down_reporters/mon_osd_min_down_reports). During dead-not-marked-as-down osd you can see blocked IO during writes and reads.
>>>
>>> Recently I've made https://github.com/ceph/ceph/pull/7740 which is about sending MakrMeDown msg to monitor just before osd is going bye-bye. It prevents blocked IO in above case, and any other assert that is not on message sending path, so I need messenger/pipes/connections working for this. I've made some test and it looks good, when I pull out drive from my cluster during rados bench, IO blocks for less than 1 second or not at all, previously it was > 10 sec (on my cluster settings).
>>>
>>> Sage pointed me that some time ago was similar PR https://github.com/ceph/ceph/pull/6514 and there was a thought about ceph-watchdog process, that could monitor osd's and send info directly to monitor when they disappear. This would prevent all assert() cases, and other ones like kill -9 or similar.
>>>
>>> I have a few ideas how such functionality could be implemented, so my question is - does any of you started already doing something similar?
>>>
>>> Let's have a brain storm about it!
>>>
>>> Ideas about improving 7740/6514 MarkMeDown internal mechanism:
>>> - I think, I could send message with MarkMeDown payload, but in a raw way, not through Messenger path. This could be as good as bad in this case.
>>> - I could poke osd-neighbor through signal and neighbor will send Mark(SignalSender)Down message (this won't work If whole hdd controller will be down, all osd will be dead in narrow time window). So it's like instant bad-health heartbeat message. Still depends of Messenger send path of osd-neighbor.
>>
>>>
>>> External ceph-watchdog:
>>> Just like Sage wrote https://github.com/ceph/ceph/pull/6514#issuecomment-159372845 Or similar: each osd, during start passes its own PID to ceph-watchdog process through shared memory/socket/named pipe (whatever). Ceph-watchdog checks if current PID exists, by checking changes in /proc/PID or /proc/PID/cmd directory/file (maybe Inotify could handle this). When file or folder is changed(missing) it sends MarkThisOsdDown to monitor and that's all. But this won't be watchdog strict, rather process down notify.
>>>
>>> Or maybe both ways PR7740 + external ?
>>
>> I'm not involved in any of this, but since you asked for a brain
>> storm... ;)
>>
>> Is it worth bothering with the corrupted data structures case at all?
>> Trying to handle it from within the aborting ceph-osd process is not a
>> very easy thing to do ("raw way, not through Messenger", signals, etc)
>> and if you do it wrong, you'd mask the original stack trace.  An
>> external ceph-watchdog is yet another entity which has to be set up,
>> maintained and accounted for.
>>
>> Why not just distinguish legitimate/expected errors which we check for
>> but currently handle with assert(0) and the actual assert failures?  In
>> the vast majority of cases that fall into the former bucket all of the
>> internal data structures, including the messenger, will be in order and
>> so we can send a MarkMeDown message and fail gracefully.  Implementing
>> it is just a matter of identifying those sites, but that's not a bad
>> exercise to do even on its own.
>>
>> The actual assert failures can abort() as they do now.  Any such
>> failure is a serious bug and there's hopefully not too many of them to
>> worry about shrinking the timeout to a minimum, unless there are hard
>> numbers that prove otherwise, of course.  And if you kill -9 your OSDs,
>> you deserve to wait for MONs to catch up.  Am I missing any use cases
>> here?
>
> This is something Sam and I have talked about in the past, but
> apparently Sage didn't like that idea in
> https://github.com/ceph/ceph/pull/6514 and suggested a daemon watcher
> instead?
> Personally I tend towards building that kind of functionality into the
> daemon, although he's right it will never be quite as good at catching
> all cases as an external manager. The upside is that we don't have to
> worry about the failure cases between the two of them. ;)
> -Greg

There's no reason the watcher process can't be a child that's kicked
off when the OSD startups. If there's a pipe between the two, when the
parent goes away the child will get a EOF on reading from the pipe. On
Linux you can also do a cute trick to have the child notified when
parent quits using prctl(PR_SET_PDEATHSIG, SIG???).

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Ceph watchdog-like thing to reduce IO block during process goes down by abort()
  2016-03-25 15:08     ` Milosz Tanski
@ 2016-03-25 15:12       ` Sage Weil
  2016-03-25 15:23         ` Milosz Tanski
                           ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Sage Weil @ 2016-03-25 15:12 UTC (permalink / raw)
  To: Milosz Tanski; +Cc: Gregory Farnum, Ilya Dryomov, Igor.Podoski, ceph-devel

On Fri, 25 Mar 2016, Milosz Tanski wrote:
> On Thu, Mar 24, 2016 at 4:25 PM, Gregory Farnum <gfarnum@redhat.com> wrote:
> > On Thu, Mar 24, 2016 at 12:53 PM, Ilya Dryomov <idryomov@gmail.com> wrote:
> >> On Thu, Mar 24, 2016 at 8:00 AM, Igor.Podoski@ts.fujitsu.com
> >> <Igor.Podoski@ts.fujitsu.com> wrote:
> >>> Hi Cephers!
> >>>
> >>> Currently when we had a disk failure, assert() and then abort() was triggered and process was killed (ABRT). Other osds will eventually mark dead one as down, but it depends of heartbeat settings and monitor settings (mon_osd_min_down_reporters/mon_osd_min_down_reports). During dead-not-marked-as-down osd you can see blocked IO during writes and reads.
> >>>
> >>> Recently I've made https://github.com/ceph/ceph/pull/7740 which is about sending MakrMeDown msg to monitor just before osd is going bye-bye. It prevents blocked IO in above case, and any other assert that is not on message sending path, so I need messenger/pipes/connections working for this. I've made some test and it looks good, when I pull out drive from my cluster during rados bench, IO blocks for less than 1 second or not at all, previously it was > 10 sec (on my cluster settings).
> >>>
> >>> Sage pointed me that some time ago was similar PR https://github.com/ceph/ceph/pull/6514 and there was a thought about ceph-watchdog process, that could monitor osd's and send info directly to monitor when they disappear. This would prevent all assert() cases, and other ones like kill -9 or similar.
> >>>
> >>> I have a few ideas how such functionality could be implemented, so my question is - does any of you started already doing something similar?
> >>>
> >>> Let's have a brain storm about it!
> >>>
> >>> Ideas about improving 7740/6514 MarkMeDown internal mechanism:
> >>> - I think, I could send message with MarkMeDown payload, but in a raw way, not through Messenger path. This could be as good as bad in this case.
> >>> - I could poke osd-neighbor through signal and neighbor will send Mark(SignalSender)Down message (this won't work If whole hdd controller will be down, all osd will be dead in narrow time window). So it's like instant bad-health heartbeat message. Still depends of Messenger send path of osd-neighbor.
> >>
> >>>
> >>> External ceph-watchdog:
> >>> Just like Sage wrote https://github.com/ceph/ceph/pull/6514#issuecomment-159372845 Or similar: each osd, during start passes its own PID to ceph-watchdog process through shared memory/socket/named pipe (whatever). Ceph-watchdog checks if current PID exists, by checking changes in /proc/PID or /proc/PID/cmd directory/file (maybe Inotify could handle this). When file or folder is changed(missing) it sends MarkThisOsdDown to monitor and that's all. But this won't be watchdog strict, rather process down notify.
> >>>
> >>> Or maybe both ways PR7740 + external ?
> >>
> >> I'm not involved in any of this, but since you asked for a brain
> >> storm... ;)
> >>
> >> Is it worth bothering with the corrupted data structures case at all?
> >> Trying to handle it from within the aborting ceph-osd process is not a
> >> very easy thing to do ("raw way, not through Messenger", signals, etc)
> >> and if you do it wrong, you'd mask the original stack trace.  An
> >> external ceph-watchdog is yet another entity which has to be set up,
> >> maintained and accounted for.
> >>
> >> Why not just distinguish legitimate/expected errors which we check for
> >> but currently handle with assert(0) and the actual assert failures?  In
> >> the vast majority of cases that fall into the former bucket all of the
> >> internal data structures, including the messenger, will be in order and
> >> so we can send a MarkMeDown message and fail gracefully.  Implementing
> >> it is just a matter of identifying those sites, but that's not a bad
> >> exercise to do even on its own.
> >>
> >> The actual assert failures can abort() as they do now.  Any such
> >> failure is a serious bug and there's hopefully not too many of them to
> >> worry about shrinking the timeout to a minimum, unless there are hard
> >> numbers that prove otherwise, of course.  And if you kill -9 your OSDs,
> >> you deserve to wait for MONs to catch up.  Am I missing any use cases
> >> here?
> >
> > This is something Sam and I have talked about in the past, but
> > apparently Sage didn't like that idea in
> > https://github.com/ceph/ceph/pull/6514 and suggested a daemon watcher
> > instead?
> > Personally I tend towards building that kind of functionality into the
> > daemon, although he's right it will never be quite as good at catching
> > all cases as an external manager. The upside is that we don't have to
> > worry about the failure cases between the two of them. ;)
> > -Greg
> 
> There's no reason the watcher process can't be a child that's kicked
> off when the OSD startups. If there's a pipe between the two, when the
> parent goes away the child will get a EOF on reading from the pipe. On
> Linux you can also do a cute trick to have the child notified when
> parent quits using prctl(PR_SET_PDEATHSIG, SIG???).

That does simplify the startup/management piece, but it means one watcher 
per OSD, and since we want the watcher to have an active mon session to 
make the notification quick, it doubles the mon session load.

Honestly I don't think the separate daemon is that much of an issue--it's 
a systemd unit file and a pretty simple watchdog process.  The key 
management and systemd enable/activate bit is the part that will be 
annoying.

sage

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Ceph watchdog-like thing to reduce IO block during process goes down by abort()
  2016-03-25 15:12       ` Sage Weil
@ 2016-03-25 15:23         ` Milosz Tanski
  2016-03-25 15:35         ` Matt Benjamin
  2016-03-29  6:07         ` Igor.Podoski
  2 siblings, 0 replies; 18+ messages in thread
From: Milosz Tanski @ 2016-03-25 15:23 UTC (permalink / raw)
  To: Sage Weil; +Cc: Gregory Farnum, Ilya Dryomov, Igor.Podoski, ceph-devel

On Fri, Mar 25, 2016 at 11:12 AM, Sage Weil <sweil@redhat.com> wrote:
> On Fri, 25 Mar 2016, Milosz Tanski wrote:
>> On Thu, Mar 24, 2016 at 4:25 PM, Gregory Farnum <gfarnum@redhat.com> wrote:
>> > On Thu, Mar 24, 2016 at 12:53 PM, Ilya Dryomov <idryomov@gmail.com> wrote:
>> >> On Thu, Mar 24, 2016 at 8:00 AM, Igor.Podoski@ts.fujitsu.com
>> >> <Igor.Podoski@ts.fujitsu.com> wrote:
>> >>> Hi Cephers!
>> >>>
>> >>> Currently when we had a disk failure, assert() and then abort() was triggered and process was killed (ABRT). Other osds will eventually mark dead one as down, but it depends of heartbeat settings and monitor settings (mon_osd_min_down_reporters/mon_osd_min_down_reports). During dead-not-marked-as-down osd you can see blocked IO during writes and reads.
>> >>>
>> >>> Recently I've made https://github.com/ceph/ceph/pull/7740 which is about sending MakrMeDown msg to monitor just before osd is going bye-bye. It prevents blocked IO in above case, and any other assert that is not on message sending path, so I need messenger/pipes/connections working for this. I've made some test and it looks good, when I pull out drive from my cluster during rados bench, IO blocks for less than 1 second or not at all, previously it was > 10 sec (on my cluster settings).
>> >>>
>> >>> Sage pointed me that some time ago was similar PR https://github.com/ceph/ceph/pull/6514 and there was a thought about ceph-watchdog process, that could monitor osd's and send info directly to monitor when they disappear. This would prevent all assert() cases, and other ones like kill -9 or similar.
>> >>>
>> >>> I have a few ideas how such functionality could be implemented, so my question is - does any of you started already doing something similar?
>> >>>
>> >>> Let's have a brain storm about it!
>> >>>
>> >>> Ideas about improving 7740/6514 MarkMeDown internal mechanism:
>> >>> - I think, I could send message with MarkMeDown payload, but in a raw way, not through Messenger path. This could be as good as bad in this case.
>> >>> - I could poke osd-neighbor through signal and neighbor will send Mark(SignalSender)Down message (this won't work If whole hdd controller will be down, all osd will be dead in narrow time window). So it's like instant bad-health heartbeat message. Still depends of Messenger send path of osd-neighbor.
>> >>
>> >>>
>> >>> External ceph-watchdog:
>> >>> Just like Sage wrote https://github.com/ceph/ceph/pull/6514#issuecomment-159372845 Or similar: each osd, during start passes its own PID to ceph-watchdog process through shared memory/socket/named pipe (whatever). Ceph-watchdog checks if current PID exists, by checking changes in /proc/PID or /proc/PID/cmd directory/file (maybe Inotify could handle this). When file or folder is changed(missing) it sends MarkThisOsdDown to monitor and that's all. But this won't be watchdog strict, rather process down notify.
>> >>>
>> >>> Or maybe both ways PR7740 + external ?
>> >>
>> >> I'm not involved in any of this, but since you asked for a brain
>> >> storm... ;)
>> >>
>> >> Is it worth bothering with the corrupted data structures case at all?
>> >> Trying to handle it from within the aborting ceph-osd process is not a
>> >> very easy thing to do ("raw way, not through Messenger", signals, etc)
>> >> and if you do it wrong, you'd mask the original stack trace.  An
>> >> external ceph-watchdog is yet another entity which has to be set up,
>> >> maintained and accounted for.
>> >>
>> >> Why not just distinguish legitimate/expected errors which we check for
>> >> but currently handle with assert(0) and the actual assert failures?  In
>> >> the vast majority of cases that fall into the former bucket all of the
>> >> internal data structures, including the messenger, will be in order and
>> >> so we can send a MarkMeDown message and fail gracefully.  Implementing
>> >> it is just a matter of identifying those sites, but that's not a bad
>> >> exercise to do even on its own.
>> >>
>> >> The actual assert failures can abort() as they do now.  Any such
>> >> failure is a serious bug and there's hopefully not too many of them to
>> >> worry about shrinking the timeout to a minimum, unless there are hard
>> >> numbers that prove otherwise, of course.  And if you kill -9 your OSDs,
>> >> you deserve to wait for MONs to catch up.  Am I missing any use cases
>> >> here?
>> >
>> > This is something Sam and I have talked about in the past, but
>> > apparently Sage didn't like that idea in
>> > https://github.com/ceph/ceph/pull/6514 and suggested a daemon watcher
>> > instead?
>> > Personally I tend towards building that kind of functionality into the
>> > daemon, although he's right it will never be quite as good at catching
>> > all cases as an external manager. The upside is that we don't have to
>> > worry about the failure cases between the two of them. ;)
>> > -Greg
>>
>> There's no reason the watcher process can't be a child that's kicked
>> off when the OSD startups. If there's a pipe between the two, when the
>> parent goes away the child will get a EOF on reading from the pipe. On
>> Linux you can also do a cute trick to have the child notified when
>> parent quits using prctl(PR_SET_PDEATHSIG, SIG???).
>
> That does simplify the startup/management piece, but it means one watcher
> per OSD, and since we want the watcher to have an active mon session to
> make the notification quick, it doubles the mon session load.

You could opt to not connect to the mon from the child until the
parent goes away. But it's just a suggestion so no bigs.

>
> Honestly I don't think the separate daemon is that much of an issue--it's
> a systemd unit file and a pretty simple watchdog process.  The key
> management and systemd enable/activate bit is the part that will be
> annoying.
>
> sage



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Ceph watchdog-like thing to reduce IO block during process goes down by abort()
  2016-03-25 15:12       ` Sage Weil
  2016-03-25 15:23         ` Milosz Tanski
@ 2016-03-25 15:35         ` Matt Benjamin
  2016-03-29  6:07         ` Igor.Podoski
  2 siblings, 0 replies; 18+ messages in thread
From: Matt Benjamin @ 2016-03-25 15:35 UTC (permalink / raw)
  To: Sage Weil
  Cc: Milosz Tanski, Gregory Farnum, Ilya Dryomov, Igor Podoski, ceph-devel

Hi,

----- Original Message -----
> From: "Sage Weil" <sweil@redhat.com>
> To: "Milosz Tanski" <milosz@adfin.com>
> Cc: "Gregory Farnum" <gfarnum@redhat.com>, "Ilya Dryomov" <idryomov@gmail.com>, "Igor Podoski"
> <Igor.Podoski@ts.fujitsu.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
> Sent: Friday, March 25, 2016 11:12:34 AM
> Subject: Re: Ceph watchdog-like thing to reduce IO block during process goes down by abort()

> > There's no reason the watcher process can't be a child that's kicked
> > off when the OSD startups. If there's a pipe between the two, when the
> > parent goes away the child will get a EOF on reading from the pipe. On
> > Linux you can also do a cute trick to have the child notified when
> > parent quits using prctl(PR_SET_PDEATHSIG, SIG???).
> 
> That does simplify the startup/management piece, but it means one watcher
> per OSD, and since we want the watcher to have an active mon session to
> make the notification quick, it doubles the mon session load.

Not sure if it's helpful, but

a) the count of sessions drops back once logical OSDs are colocated?
b) AFS had the notion of the "basic overseer" that started all its other processes--so you I think had the pipe infrastructure set up to do this sort of thing more like in Milosz' model, but just the one overseer per host

(But I might be misreading the thread.)

> 
> Honestly I don't think the separate daemon is that much of an issue--it's
> a systemd unit file and a pretty simple watchdog process.  The key
> management and systemd enable/activate bit is the part that will be
> annoying.
> 
> sage
> --

Matt

-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-707-0660
fax.  734-769-8938
cel.  734-216-5309

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Ceph watchdog-like thing to reduce IO block during process goes down by abort()
  2016-03-25 15:12       ` Sage Weil
  2016-03-25 15:23         ` Milosz Tanski
  2016-03-25 15:35         ` Matt Benjamin
@ 2016-03-29  6:07         ` Igor.Podoski
  2016-03-29  6:21           ` Igor.Podoski
  2 siblings, 1 reply; 18+ messages in thread
From: Igor.Podoski @ 2016-03-29  6:07 UTC (permalink / raw)
  To: Sage Weil, Milosz Tanski; +Cc: Gregory Farnum, Ilya Dryomov, ceph-devel

> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Friday, March 25, 2016 4:13 PM
> To: Milosz Tanski
> Cc: Gregory Farnum; Ilya Dryomov; Podoski, Igor; ceph-devel
> Subject: Re: Ceph watchdog-like thing to reduce IO block during process goes
> down by abort()
> 
> On Fri, 25 Mar 2016, Milosz Tanski wrote:
> > On Thu, Mar 24, 2016 at 4:25 PM, Gregory Farnum <gfarnum@redhat.com>
> wrote:
> > > On Thu, Mar 24, 2016 at 12:53 PM, Ilya Dryomov <idryomov@gmail.com>
> wrote:
> > >> On Thu, Mar 24, 2016 at 8:00 AM, Igor.Podoski@ts.fujitsu.com
> > >> <Igor.Podoski@ts.fujitsu.com> wrote:
> > >>> Hi Cephers!
> > >>>
> > >>> Currently when we had a disk failure, assert() and then abort() was
> triggered and process was killed (ABRT). Other osds will eventually mark
> dead one as down, but it depends of heartbeat settings and monitor settings
> (mon_osd_min_down_reporters/mon_osd_min_down_reports). During
> dead-not-marked-as-down osd you can see blocked IO during writes and
> reads.
> > >>>
> > >>> Recently I've made https://github.com/ceph/ceph/pull/7740 which is
> about sending MakrMeDown msg to monitor just before osd is going bye-
> bye. It prevents blocked IO in above case, and any other assert that is not on
> message sending path, so I need messenger/pipes/connections working for
> this. I've made some test and it looks good, when I pull out drive from my
> cluster during rados bench, IO blocks for less than 1 second or not at all,
> previously it was > 10 sec (on my cluster settings).
> > >>>
> > >>> Sage pointed me that some time ago was similar PR
> https://github.com/ceph/ceph/pull/6514 and there was a thought about
> ceph-watchdog process, that could monitor osd's and send info directly to
> monitor when they disappear. This would prevent all assert() cases, and
> other ones like kill -9 or similar.
> > >>>
> > >>> I have a few ideas how such functionality could be implemented, so
> my question is - does any of you started already doing something similar?
> > >>>
> > >>> Let's have a brain storm about it!
> > >>>
> > >>> Ideas about improving 7740/6514 MarkMeDown internal mechanism:
> > >>> - I think, I could send message with MarkMeDown payload, but in a
> raw way, not through Messenger path. This could be as good as bad in this
> case.
> > >>> - I could poke osd-neighbor through signal and neighbor will send
> Mark(SignalSender)Down message (this won't work If whole hdd controller
> will be down, all osd will be dead in narrow time window). So it's like instant
> bad-health heartbeat message. Still depends of Messenger send path of
> osd-neighbor.
> > >>
> > >>>
> > >>> External ceph-watchdog:
> > >>> Just like Sage wrote
> https://github.com/ceph/ceph/pull/6514#issuecomment-159372845 Or
> similar: each osd, during start passes its own PID to ceph-watchdog process
> through shared memory/socket/named pipe (whatever). Ceph-watchdog
> checks if current PID exists, by checking changes in /proc/PID or
> /proc/PID/cmd directory/file (maybe Inotify could handle this). When file or
> folder is changed(missing) it sends MarkThisOsdDown to monitor and that's
> all. But this won't be watchdog strict, rather process down notify.
> > >>>
> > >>> Or maybe both ways PR7740 + external ?
> > >>
> > >> I'm not involved in any of this, but since you asked for a brain
> > >> storm... ;)
> > >>
> > >> Is it worth bothering with the corrupted data structures case at all?
> > >> Trying to handle it from within the aborting ceph-osd process is
> > >> not a very easy thing to do ("raw way, not through Messenger",
> > >> signals, etc) and if you do it wrong, you'd mask the original stack
> > >> trace.  An external ceph-watchdog is yet another entity which has
> > >> to be set up, maintained and accounted for.
> > >>
> > >> Why not just distinguish legitimate/expected errors which we check
> > >> for but currently handle with assert(0) and the actual assert
> > >> failures?  In the vast majority of cases that fall into the former
> > >> bucket all of the internal data structures, including the
> > >> messenger, will be in order and so we can send a MarkMeDown
> message
> > >> and fail gracefully.  Implementing it is just a matter of
> > >> identifying those sites, but that's not a bad exercise to do even on its
> own.
> > >>
> > >> The actual assert failures can abort() as they do now.  Any such
> > >> failure is a serious bug and there's hopefully not too many of them
> > >> to worry about shrinking the timeout to a minimum, unless there are
> > >> hard numbers that prove otherwise, of course.  And if you kill -9
> > >> your OSDs, you deserve to wait for MONs to catch up.  Am I missing
> > >> any use cases here?
> > >
> > > This is something Sam and I have talked about in the past, but
> > > apparently Sage didn't like that idea in
> > > https://github.com/ceph/ceph/pull/6514 and suggested a daemon
> > > watcher instead?
> > > Personally I tend towards building that kind of functionality into
> > > the daemon, although he's right it will never be quite as good at
> > > catching all cases as an external manager. The upside is that we
> > > don't have to worry about the failure cases between the two of them.
> > > ;) -Greg
> >
> > There's no reason the watcher process can't be a child that's kicked
> > off when the OSD startups. If there's a pipe between the two, when the
> > parent goes away the child will get a EOF on reading from the pipe. On
> > Linux you can also do a cute trick to have the child notified when
> > parent quits using prctl(PR_SET_PDEATHSIG, SIG???).
> 
> That does simplify the startup/management piece, but it means one watcher
> per OSD, and since we want the watcher to have an active mon session to
> make the notification quick, it doubles the mon session load.

We could also do it like this:

ceph-watchdog creates named pipe in /var/lib/ceph

osd before abort will:
- open file
- wite its own id 0,1,2...
- close file

ceph-watchdog:
- waits for osd ids on named pipe
- issues a mon_command()  e.g. cmd=[{"prefix": "osd down", "ids": ["1"]}]  to the monitor just like ceph osd down, this can be done by librados from C/python, already have small PoC in python for this, seems to work.

Thanks to above we have one watcher per host, no connections from osd -> watchdog, but looking on the downsides:
- we could hit open files limit
- or anything else using open/write/close
- multiple OSD's could write to pipe in the same time (maybe using small constant writes here won't be an issue, currently checking this)

Regards,
Igor.

> Honestly I don't think the separate daemon is that much of an issue--it's a
> systemd unit file and a pretty simple watchdog process.  The key
> management and systemd enable/activate bit is the part that will be
> annoying.
> 
> sage


^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Ceph watchdog-like thing to reduce IO block during process goes down by abort()
  2016-03-29  6:07         ` Igor.Podoski
@ 2016-03-29  6:21           ` Igor.Podoski
  2016-04-04 12:29             ` Igor.Podoski
  0 siblings, 1 reply; 18+ messages in thread
From: Igor.Podoski @ 2016-03-29  6:21 UTC (permalink / raw)
  To: Igor.Podoski, Sage Weil, Milosz Tanski
  Cc: Gregory Farnum, Ilya Dryomov, ceph-devel

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Igor.Podoski@ts.fujitsu.com
> Sent: Tuesday, March 29, 2016 8:07 AM
> To: Sage Weil; Milosz Tanski
> Cc: Gregory Farnum; Ilya Dryomov; ceph-devel
> Subject: RE: Ceph watchdog-like thing to reduce IO block during process goes
> down by abort()
> 
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Friday, March 25, 2016 4:13 PM
> > To: Milosz Tanski
> > Cc: Gregory Farnum; Ilya Dryomov; Podoski, Igor; ceph-devel
> > Subject: Re: Ceph watchdog-like thing to reduce IO block during
> > process goes down by abort()
> >
> > On Fri, 25 Mar 2016, Milosz Tanski wrote:
> > > On Thu, Mar 24, 2016 at 4:25 PM, Gregory Farnum
> <gfarnum@redhat.com>
> > wrote:
> > > > On Thu, Mar 24, 2016 at 12:53 PM, Ilya Dryomov
> > > > <idryomov@gmail.com>
> > wrote:
> > > >> On Thu, Mar 24, 2016 at 8:00 AM, Igor.Podoski@ts.fujitsu.com
> > > >> <Igor.Podoski@ts.fujitsu.com> wrote:
> > > >>> Hi Cephers!
> > > >>>
> > > >>> Currently when we had a disk failure, assert() and then abort()
> > > >>> was
> > triggered and process was killed (ABRT). Other osds will eventually
> > mark dead one as down, but it depends of heartbeat settings and
> > monitor settings
> > (mon_osd_min_down_reporters/mon_osd_min_down_reports). During
> > dead-not-marked-as-down osd you can see blocked IO during writes and
> reads.
> > > >>>
> > > >>> Recently I've made https://github.com/ceph/ceph/pull/7740 which
> > > >>> is
> > about sending MakrMeDown msg to monitor just before osd is going bye-
> > bye. It prevents blocked IO in above case, and any other assert that
> > is not on message sending path, so I need messenger/pipes/connections
> > working for this. I've made some test and it looks good, when I pull
> > out drive from my cluster during rados bench, IO blocks for less than
> > 1 second or not at all, previously it was > 10 sec (on my cluster settings).
> > > >>>
> > > >>> Sage pointed me that some time ago was similar PR
> > https://github.com/ceph/ceph/pull/6514 and there was a thought about
> > ceph-watchdog process, that could monitor osd's and send info directly
> > to monitor when they disappear. This would prevent all assert() cases,
> > and other ones like kill -9 or similar.
> > > >>>
> > > >>> I have a few ideas how such functionality could be implemented,
> > > >>> so
> > my question is - does any of you started already doing something similar?
> > > >>>
> > > >>> Let's have a brain storm about it!
> > > >>>
> > > >>> Ideas about improving 7740/6514 MarkMeDown internal mechanism:
> > > >>> - I think, I could send message with MarkMeDown payload, but in
> > > >>> a
> > raw way, not through Messenger path. This could be as good as bad in
> > this case.
> > > >>> - I could poke osd-neighbor through signal and neighbor will
> > > >>> send
> > Mark(SignalSender)Down message (this won't work If whole hdd
> > controller will be down, all osd will be dead in narrow time window).
> > So it's like instant bad-health heartbeat message. Still depends of
> > Messenger send path of osd-neighbor.
> > > >>
> > > >>>
> > > >>> External ceph-watchdog:
> > > >>> Just like Sage wrote
> > https://github.com/ceph/ceph/pull/6514#issuecomment-159372845 Or
> > similar: each osd, during start passes its own PID to ceph-watchdog
> > process through shared memory/socket/named pipe (whatever).
> > Ceph-watchdog checks if current PID exists, by checking changes in
> > /proc/PID or /proc/PID/cmd directory/file (maybe Inotify could handle
> > this). When file or folder is changed(missing) it sends
> > MarkThisOsdDown to monitor and that's all. But this won't be watchdog
> strict, rather process down notify.
> > > >>>
> > > >>> Or maybe both ways PR7740 + external ?
> > > >>
> > > >> I'm not involved in any of this, but since you asked for a brain
> > > >> storm... ;)
> > > >>
> > > >> Is it worth bothering with the corrupted data structures case at all?
> > > >> Trying to handle it from within the aborting ceph-osd process is
> > > >> not a very easy thing to do ("raw way, not through Messenger",
> > > >> signals, etc) and if you do it wrong, you'd mask the original
> > > >> stack trace.  An external ceph-watchdog is yet another entity
> > > >> which has to be set up, maintained and accounted for.
> > > >>
> > > >> Why not just distinguish legitimate/expected errors which we
> > > >> check for but currently handle with assert(0) and the actual
> > > >> assert failures?  In the vast majority of cases that fall into
> > > >> the former bucket all of the internal data structures, including
> > > >> the messenger, will be in order and so we can send a MarkMeDown
> > message
> > > >> and fail gracefully.  Implementing it is just a matter of
> > > >> identifying those sites, but that's not a bad exercise to do even
> > > >> on its
> > own.
> > > >>
> > > >> The actual assert failures can abort() as they do now.  Any such
> > > >> failure is a serious bug and there's hopefully not too many of
> > > >> them to worry about shrinking the timeout to a minimum, unless
> > > >> there are hard numbers that prove otherwise, of course.  And if
> > > >> you kill -9 your OSDs, you deserve to wait for MONs to catch up.
> > > >> Am I missing any use cases here?
> > > >
> > > > This is something Sam and I have talked about in the past, but
> > > > apparently Sage didn't like that idea in
> > > > https://github.com/ceph/ceph/pull/6514 and suggested a daemon
> > > > watcher instead?
> > > > Personally I tend towards building that kind of functionality into
> > > > the daemon, although he's right it will never be quite as good at
> > > > catching all cases as an external manager. The upside is that we
> > > > don't have to worry about the failure cases between the two of them.
> > > > ;) -Greg
> > >
> > > There's no reason the watcher process can't be a child that's kicked
> > > off when the OSD startups. If there's a pipe between the two, when
> > > the parent goes away the child will get a EOF on reading from the
> > > pipe. On Linux you can also do a cute trick to have the child
> > > notified when parent quits using prctl(PR_SET_PDEATHSIG, SIG???).
> >
> > That does simplify the startup/management piece, but it means one
> > watcher per OSD, and since we want the watcher to have an active mon
> > session to make the notification quick, it doubles the mon session load.
> 
> We could also do it like this:
> 
> ceph-watchdog creates named pipe in /var/lib/ceph
> 
> osd before abort will:
> - open file
> - wite its own id 0,1,2...
> - close file
> 
> ceph-watchdog:
> - waits for osd ids on named pipe
> - issues a mon_command()  e.g. cmd=[{"prefix": "osd down", "ids": ["1"]}]  to
> the monitor just like ceph osd down, this can be done by librados from
> C/python, already have small PoC in python for this, seems to work.

Of course it keeps connection to monitor open/active all the time.

> Thanks to above we have one watcher per host, no connections from osd ->
> watchdog, but looking on the downsides:
> - we could hit open files limit
> - or anything else using open/write/close
> - multiple OSD's could write to pipe in the same time (maybe using small
> constant writes here won't be an issue, currently checking this)

Additional advantage by writing OSD id to fifo - we could easily do a backup mechanism in systemd for open/write/close fail.

> Regards,
> Igor.
> 
> > Honestly I don't think the separate daemon is that much of an
> > issue--it's a systemd unit file and a pretty simple watchdog process.
> > The key management and systemd enable/activate bit is the part that
> > will be annoying.
> >
> > sage
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Ceph watchdog-like thing to reduce IO block during process goes down by abort()
  2016-03-29  6:21           ` Igor.Podoski
@ 2016-04-04 12:29             ` Igor.Podoski
  0 siblings, 0 replies; 18+ messages in thread
From: Igor.Podoski @ 2016-04-04 12:29 UTC (permalink / raw)
  To: ceph-devel

Hello,

First of all I wanted to thank you all for the discussion.

Now, to sum up pros and cons:

1. Separate watchdog process.
Pros:
- prevent of IO block at every case when OSD is going dead by abort(), kill -9, oom kill, whatever
- will work with single hdd down and whole controller
- no additional time to hold process in "bad" state

Cons:
- another process needs to be maintained/tested/documented, cephx keys, files etc.
- additional startups scripts needs to be created/maintained...
- new test needs to be written
- need another connection to monitor (not good for small nodes with one hdd/osd)
- additional open socket from every osd to watchdog  or open/write/close to named pipe before assert()

2. ceph_abort_markmedown() before ceph_abort()
Pros:
- prevent IO block in most cases related with disk access
- done in existing code, nothing new to maintain
- assert cleanups, potential good place from where can gather statistics in the future
- will work with single hdd down and whole controller

Cons:
- tests needs to be written (maybe extending existing ?)
- can be done only in special places when connection to monitor is already been made and working
- needs to be implemented mostly by hand
- some places could be missed
- some additional time to hold process in "bad" state, waiting for MarkMeDown ack

If we go into ceph_abort_markmedown(),  this is basically cxwshawn's past PR 6514, so I would feel bad if I just redo it. He was first with this idea, maybe he could reopen it or somehow we'll do it together. I would like to have clear situation here.

Another idea:
3. Use existing heartbeat for fast-mark dead OSD neighbor.

The idea:
When process goes dead, all sockets are closed, one from HB too. Then OSD::heartbeat_reset(..) is triggered (on other osd) to close old/reopen connections to dead one.

When connection with HB peer was closed and that peer had the same address as we (the same node), we could send MOSDFailure() with out of grace time. This combined with "mon osd min down reporters" set to > 1, could mark the osd down immediately by its neighbors.

So this would change/speedup HB behavior only for local osds.

The question is does it make sense? What would be the case, when all connections on HB from one osd will be closed/restarted, other than abort()/kill/shutdown. Firewall reconfiguration, too many open files/sockets, nf_contrack problem?

If only one/some of those connections will be down (but less than osd min down reporters) and then recreated, osd will start sending new heartbeats and it won't be marked as down.

Pros:
- prevent of IO block at every case when OSD is going dead by abort(), kill -9, oom kill, whatever. BUT only with "mon osd min down reporters" set to proper value and other neighbor OSDs being alive.
- done in one place in existing code
- no additional time to hold process in "bad" state

Cons:
- test need to be written, or maybe existing HB or "min down reporters" could cover this?
- "osd min down reporters" could be wrongly calculated
- will work only when connection to monitor is already been made and working
- messing/changing  already working HB infrastructure, which is tested and stable
- could miss whole drive controller fail, since every osd will be down in small time window
- won't work in environment with one osd per node
- need local neighbors to work
- if something goes wrong (caused by bad implementation of this idea) osd could flap in down/up state

So ... where do we go from here?

Looking on the most visible cons of 1 vs 2 is "maintenance" vs "process at bad state in some time".
Most visible pros:  1 vs 2 is "all osd dead cases covered" vs "better asserts and error tracking"

Regards,
Igor.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2016-04-04 12:29 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-24  7:00 Ceph watchdog-like thing to reduce IO block during process goes down by abort() Igor.Podoski
     [not found] ` <CACJqLyax9Nntz09qJF2E_jjtoVa2moJxzW5T42bHVFGLop31Ug@mail.gmail.com>
2016-03-24 11:47   ` Igor.Podoski
2016-03-24 20:55     ` Sage Weil
2016-03-24 21:15       ` Ilya Dryomov
2016-03-24 21:20         ` Gregory Farnum
2016-03-25 12:54           ` Sage Weil
2016-03-25 14:30             ` Ilya Dryomov
2016-03-24 19:53 ` Ilya Dryomov
2016-03-24 20:25   ` Gregory Farnum
2016-03-24 20:46     ` Ilya Dryomov
2016-03-24 20:47       ` Gregory Farnum
2016-03-25 15:08     ` Milosz Tanski
2016-03-25 15:12       ` Sage Weil
2016-03-25 15:23         ` Milosz Tanski
2016-03-25 15:35         ` Matt Benjamin
2016-03-29  6:07         ` Igor.Podoski
2016-03-29  6:21           ` Igor.Podoski
2016-04-04 12:29             ` Igor.Podoski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.