All of lore.kernel.org
 help / color / mirror / Atom feed
* Fork and RDMA operations
@ 2016-08-11 18:51 Vu Pham
  2016-08-12 13:22 ` Sage Weil
  0 siblings, 1 reply; 12+ messages in thread
From: Vu Pham @ 2016-08-11 18:51 UTC (permalink / raw)
  To: ceph-devel; +Cc: Avner Ben Hanoch, Oren Duer

Hello all,

The background:
We have tested scaling with xio messenger and faced multiple *unknown* 
problems (hard to trace and reproduce). We recently find out that the 
daemonize/fork support isn't full in ibverbs. It assumes that the parent 
process will do the RDMA operations. Any child process try to do rdma 
operations will experience various unexpected problems.

ceph-osd/ceph-mon/ceph-mds daemonize (fork) after creating messengers.
Xio messenger will initialize accelio library and register RDMA memory 
in the 1st call to XioMessenger constructor.
This situation is very problematic where child process do rdma 
operations as described above

http://www.rdmamojo.com/2012/05/24/ibv_fork_init
http://www.spinics.net/lists/linux-rdma/msg03364.html
I create this PR which forces to daemonize/fork before creating 
messenger

https://github.com/ceph/ceph/pull/10600

I have tested this patch by bringing up a cluster with 4 nodes, 8 
osds/node, two monitors and run I/Os (4K - 4M block size) from 4 fio 
clients.

Is there any known problem to daemonize/fork before creating messenger?
Could you help to review and provide feedback?

thanks,
-vu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Fork and RDMA operations
  2016-08-11 18:51 Fork and RDMA operations Vu Pham
@ 2016-08-12 13:22 ` Sage Weil
  2016-08-12 15:22   ` Haomai Wang
  0 siblings, 1 reply; 12+ messages in thread
From: Sage Weil @ 2016-08-12 13:22 UTC (permalink / raw)
  To: Vu Pham; +Cc: ceph-devel, Avner Ben Hanoch, Oren Duer

On Thu, 11 Aug 2016, Vu Pham wrote:
> Hello all,
> 
> The background:
> We have tested scaling with xio messenger and faced multiple *unknown* 
> problems (hard to trace and reproduce). We recently find out that the 
> daemonize/fork support isn't full in ibverbs. It assumes that the parent 
> process will do the RDMA operations. Any child process try to do rdma 
> operations will experience various unexpected problems.
> 
> ceph-osd/ceph-mon/ceph-mds daemonize (fork) after creating messengers.
> Xio messenger will initialize accelio library and register RDMA memory 
> in the 1st call to XioMessenger constructor.
> This situation is very problematic where child process do rdma 
> operations as described above
> 
> http://www.rdmamojo.com/2012/05/24/ibv_fork_init
> http://www.spinics.net/lists/linux-rdma/msg03364.html
> I create this PR which forces to daemonize/fork before creating 
> messenger
> 
> https://github.com/ceph/ceph/pull/10600
> 
> I have tested this patch by bringing up a cluster with 4 nodes, 8 
> osds/node, two monitors and run I/Os (4K - 4M block size) from 4 fio 
> clients.
> 
> Is there any known problem to daemonize/fork before creating messenger?
> Could you help to review and provide feedback?

This more or less works.  The main issue is that we don't catch errors as 
early as we did and a daemon may appear to start and then immediately 
exit without printing an error.

There is a Preforker class in common that is meant to address this (it's 
used by ceph-fuse and ceph-mon already).  It does the fork early, when 
prefork() is called, and keeps stdout/stderr open for an interim period 
until you call preforker.exit() or .daemonize().  Any exit code gets 
passed back to the parent over a socket.  I'm guessing that the mon is 
already working fine since you're just moving the prefork.daemonize() line 
around (the actual fork happened way back at

	https://github.com/vuhuong/ceph-upstream/blob/5cd5546fb7ddb9ef69380476b0a80038ba74a405/src/ceph_mon.cc#L500

) and you just need to make ceph_osd.cc and ceph_mds.cc use Preforker in a 
similar way.

sage

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Fork and RDMA operations
  2016-08-12 13:22 ` Sage Weil
@ 2016-08-12 15:22   ` Haomai Wang
  2016-08-12 16:27     ` Sage Weil
  0 siblings, 1 reply; 12+ messages in thread
From: Haomai Wang @ 2016-08-12 15:22 UTC (permalink / raw)
  To: Sage Weil; +Cc: Vu Pham, ceph-devel, Avner Ben Hanoch, Oren Duer

Hi Vu,

Actually you and me go into the same loop, in my async backend
pr(https://github.com/ceph/ceph/pull/10264) the
commit(https://github.com/ceph/ceph/pull/10264/commits/7055bafbcbf06425e71808cb2b089d1d04706728)
defines a interface for prefork/postfork things. It's much like
global_init_prefork_start/global_init_postfork_start but it's a
generic interface.

Refer to Kefu's comment why we need this:
=============
note to myself, w.r.t. the before/after daemonize hook

1. it's a natural way to do bind/rebind in the event thread
2. we do bind before daemon(2) now
3. the child process after daemon(2) is a single threaded process, and
all event threads are terminated, so no threads is taking care of
bind/rebind after daemon(2),

that's why we need to re-spawn the threads after daemon(2).
=============

So let's resolve alike problem like this


On Fri, Aug 12, 2016 at 9:22 PM, Sage Weil <sage@newdream.net> wrote:
> On Thu, 11 Aug 2016, Vu Pham wrote:
>> Hello all,
>>
>> The background:
>> We have tested scaling with xio messenger and faced multiple *unknown*
>> problems (hard to trace and reproduce). We recently find out that the
>> daemonize/fork support isn't full in ibverbs. It assumes that the parent
>> process will do the RDMA operations. Any child process try to do rdma
>> operations will experience various unexpected problems.
>>
>> ceph-osd/ceph-mon/ceph-mds daemonize (fork) after creating messengers.
>> Xio messenger will initialize accelio library and register RDMA memory
>> in the 1st call to XioMessenger constructor.
>> This situation is very problematic where child process do rdma
>> operations as described above
>>
>> http://www.rdmamojo.com/2012/05/24/ibv_fork_init
>> http://www.spinics.net/lists/linux-rdma/msg03364.html
>> I create this PR which forces to daemonize/fork before creating
>> messenger
>>
>> https://github.com/ceph/ceph/pull/10600
>>
>> I have tested this patch by bringing up a cluster with 4 nodes, 8
>> osds/node, two monitors and run I/Os (4K - 4M block size) from 4 fio
>> clients.
>>
>> Is there any known problem to daemonize/fork before creating messenger?
>> Could you help to review and provide feedback?
>
> This more or less works.  The main issue is that we don't catch errors as
> early as we did and a daemon may appear to start and then immediately
> exit without printing an error.
>
> There is a Preforker class in common that is meant to address this (it's
> used by ceph-fuse and ceph-mon already).  It does the fork early, when
> prefork() is called, and keeps stdout/stderr open for an interim period
> until you call preforker.exit() or .daemonize().  Any exit code gets
> passed back to the parent over a socket.  I'm guessing that the mon is
> already working fine since you're just moving the prefork.daemonize() line
> around (the actual fork happened way back at
>
>         https://github.com/vuhuong/ceph-upstream/blob/5cd5546fb7ddb9ef69380476b0a80038ba74a405/src/ceph_mon.cc#L500
>
> ) and you just need to make ceph_osd.cc and ceph_mds.cc use Preforker in a
> similar way.
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Fork and RDMA operations
  2016-08-12 15:22   ` Haomai Wang
@ 2016-08-12 16:27     ` Sage Weil
  2016-08-12 17:01       ` Haomai Wang
  2016-08-24 21:18       ` Casey Bodley
  0 siblings, 2 replies; 12+ messages in thread
From: Sage Weil @ 2016-08-12 16:27 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Vu Pham, ceph-devel, Avner Ben Hanoch, Oren Duer

On Fri, 12 Aug 2016, Haomai Wang wrote:
> Hi Vu,
> 
> Actually you and me go into the same loop, in my async backend
> pr(https://github.com/ceph/ceph/pull/10264) the
> commit(https://github.com/ceph/ceph/pull/10264/commits/7055bafbcbf06425e71808cb2b089d1d04706728)
> defines a interface for prefork/postfork things. It's much like
> global_init_prefork_start/global_init_postfork_start but it's a
> generic interface.
> 
> Refer to Kefu's comment why we need this:
> =============
> note to myself, w.r.t. the before/after daemonize hook
> 
> 1. it's a natural way to do bind/rebind in the event thread
> 2. we do bind before daemon(2) now
> 3. the child process after daemon(2) is a single threaded process, and
> all event threads are terminated, so no threads is taking care of
> bind/rebind after daemon(2),
> 
> that's why we need to re-spawn the threads after daemon(2).
> =============
> 
> So let's resolve alike problem like this

It seems like it would be simpler to push the fork before any important 
operations.  (And BTW with systemd and upstart we don't fork anyway; it's 
just there for sysvinit.)  The preforker thing is there to make it easy to 
fork early, but keep the parent waiting around so that you can do more 
intialization, print errors, and terminate with an error code if something 
(post-fork) goes wrong.  In theory, there's no reason why we couldn't make 
this almost the very first thing the daemon does so that *all* work is 
done in the child...

sage


> 
> 
> On Fri, Aug 12, 2016 at 9:22 PM, Sage Weil <sage@newdream.net> wrote:
> > On Thu, 11 Aug 2016, Vu Pham wrote:
> >> Hello all,
> >>
> >> The background:
> >> We have tested scaling with xio messenger and faced multiple *unknown*
> >> problems (hard to trace and reproduce). We recently find out that the
> >> daemonize/fork support isn't full in ibverbs. It assumes that the parent
> >> process will do the RDMA operations. Any child process try to do rdma
> >> operations will experience various unexpected problems.
> >>
> >> ceph-osd/ceph-mon/ceph-mds daemonize (fork) after creating messengers.
> >> Xio messenger will initialize accelio library and register RDMA memory
> >> in the 1st call to XioMessenger constructor.
> >> This situation is very problematic where child process do rdma
> >> operations as described above
> >>
> >> http://www.rdmamojo.com/2012/05/24/ibv_fork_init
> >> http://www.spinics.net/lists/linux-rdma/msg03364.html
> >> I create this PR which forces to daemonize/fork before creating
> >> messenger
> >>
> >> https://github.com/ceph/ceph/pull/10600
> >>
> >> I have tested this patch by bringing up a cluster with 4 nodes, 8
> >> osds/node, two monitors and run I/Os (4K - 4M block size) from 4 fio
> >> clients.
> >>
> >> Is there any known problem to daemonize/fork before creating messenger?
> >> Could you help to review and provide feedback?
> >
> > This more or less works.  The main issue is that we don't catch errors as
> > early as we did and a daemon may appear to start and then immediately
> > exit without printing an error.
> >
> > There is a Preforker class in common that is meant to address this (it's
> > used by ceph-fuse and ceph-mon already).  It does the fork early, when
> > prefork() is called, and keeps stdout/stderr open for an interim period
> > until you call preforker.exit() or .daemonize().  Any exit code gets
> > passed back to the parent over a socket.  I'm guessing that the mon is
> > already working fine since you're just moving the prefork.daemonize() line
> > around (the actual fork happened way back at
> >
> >         https://github.com/vuhuong/ceph-upstream/blob/5cd5546fb7ddb9ef69380476b0a80038ba74a405/src/ceph_mon.cc#L500
> >
> > ) and you just need to make ceph_osd.cc and ceph_mds.cc use Preforker in a
> > similar way.
> >
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Fork and RDMA operations
  2016-08-12 16:27     ` Sage Weil
@ 2016-08-12 17:01       ` Haomai Wang
  2016-08-16 17:20         ` Vu Pham
  2016-08-24 21:18       ` Casey Bodley
  1 sibling, 1 reply; 12+ messages in thread
From: Haomai Wang @ 2016-08-12 17:01 UTC (permalink / raw)
  To: Sage Weil; +Cc: Vu Pham, ceph-devel, Avner Ben Hanoch, Oren Duer

On Sat, Aug 13, 2016 at 12:27 AM, Sage Weil <sage@newdream.net> wrote:
> On Fri, 12 Aug 2016, Haomai Wang wrote:
>> Hi Vu,
>>
>> Actually you and me go into the same loop, in my async backend
>> pr(https://github.com/ceph/ceph/pull/10264) the
>> commit(https://github.com/ceph/ceph/pull/10264/commits/7055bafbcbf06425e71808cb2b089d1d04706728)
>> defines a interface for prefork/postfork things. It's much like
>> global_init_prefork_start/global_init_postfork_start but it's a
>> generic interface.
>>
>> Refer to Kefu's comment why we need this:
>> =============
>> note to myself, w.r.t. the before/after daemonize hook
>>
>> 1. it's a natural way to do bind/rebind in the event thread
>> 2. we do bind before daemon(2) now
>> 3. the child process after daemon(2) is a single threaded process, and
>> all event threads are terminated, so no threads is taking care of
>> bind/rebind after daemon(2),
>>
>> that's why we need to re-spawn the threads after daemon(2).
>> =============
>>
>> So let's resolve alike problem like this
>
> It seems like it would be simpler to push the fork before any important
> operations.  (And BTW with systemd and upstart we don't fork anyway; it's
> just there for sysvinit.)  The preforker thing is there to make it easy to
> fork early, but keep the parent waiting around so that you can do more
> intialization, print errors, and terminate with an error code if something
> (post-fork) goes wrong.  In theory, there's no reason why we couldn't make
> this almost the very first thing the daemon does so that *all* work is
> done in the child...

Yes, but I think it would be a big task now. A lot of works need to be
done next.

>
> sage
>
>
>>
>>
>> On Fri, Aug 12, 2016 at 9:22 PM, Sage Weil <sage@newdream.net> wrote:
>> > On Thu, 11 Aug 2016, Vu Pham wrote:
>> >> Hello all,
>> >>
>> >> The background:
>> >> We have tested scaling with xio messenger and faced multiple *unknown*
>> >> problems (hard to trace and reproduce). We recently find out that the
>> >> daemonize/fork support isn't full in ibverbs. It assumes that the parent
>> >> process will do the RDMA operations. Any child process try to do rdma
>> >> operations will experience various unexpected problems.
>> >>
>> >> ceph-osd/ceph-mon/ceph-mds daemonize (fork) after creating messengers.
>> >> Xio messenger will initialize accelio library and register RDMA memory
>> >> in the 1st call to XioMessenger constructor.
>> >> This situation is very problematic where child process do rdma
>> >> operations as described above
>> >>
>> >> http://www.rdmamojo.com/2012/05/24/ibv_fork_init
>> >> http://www.spinics.net/lists/linux-rdma/msg03364.html
>> >> I create this PR which forces to daemonize/fork before creating
>> >> messenger
>> >>
>> >> https://github.com/ceph/ceph/pull/10600
>> >>
>> >> I have tested this patch by bringing up a cluster with 4 nodes, 8
>> >> osds/node, two monitors and run I/Os (4K - 4M block size) from 4 fio
>> >> clients.
>> >>
>> >> Is there any known problem to daemonize/fork before creating messenger?
>> >> Could you help to review and provide feedback?
>> >
>> > This more or less works.  The main issue is that we don't catch errors as
>> > early as we did and a daemon may appear to start and then immediately
>> > exit without printing an error.
>> >
>> > There is a Preforker class in common that is meant to address this (it's
>> > used by ceph-fuse and ceph-mon already).  It does the fork early, when
>> > prefork() is called, and keeps stdout/stderr open for an interim period
>> > until you call preforker.exit() or .daemonize().  Any exit code gets
>> > passed back to the parent over a socket.  I'm guessing that the mon is
>> > already working fine since you're just moving the prefork.daemonize() line
>> > around (the actual fork happened way back at
>> >
>> >         https://github.com/vuhuong/ceph-upstream/blob/5cd5546fb7ddb9ef69380476b0a80038ba74a405/src/ceph_mon.cc#L500
>> >
>> > ) and you just need to make ceph_osd.cc and ceph_mds.cc use Preforker in a
>> > similar way.
>> >
>> > sage
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Fork and RDMA operations
  2016-08-12 17:01       ` Haomai Wang
@ 2016-08-16 17:20         ` Vu Pham
  0 siblings, 0 replies; 12+ messages in thread
From: Vu Pham @ 2016-08-16 17:20 UTC (permalink / raw)
  To: Haomai Wang, Sage Weil; +Cc: ceph-devel, Avner Ben Hanoch, Oren Duer

Thanks Sage & Haomai for your review and feedback

I'll rework the patch to use Preforker for ceph_osd and ceph_mds in the 
similar way as ceph_mon per Sage's recommendation

-vu

On 8/12/2016 10:01:08 AM, "Haomai Wang" <haomai@xsky.com> wrote:

>On Sat, Aug 13, 2016 at 12:27 AM, Sage Weil <sage@newdream.net> wrote:
>>  On Fri, 12 Aug 2016, Haomai Wang wrote:
>>>  Hi Vu,
>>>
>>>  Actually you and me go into the same loop, in my async backend
>>>  pr(https://github.com/ceph/ceph/pull/10264) the
>>>  
>>>commit(https://github.com/ceph/ceph/pull/10264/commits/7055bafbcbf06425e71808cb2b089d1d04706728)
>>>  defines a interface for prefork/postfork things. It's much like
>>>  global_init_prefork_start/global_init_postfork_start but it's a
>>>  generic interface.
>>>
>>>  Refer to Kefu's comment why we need this:
>>>  =============
>>>  note to myself, w.r.t. the before/after daemonize hook
>>>
>>>  1. it's a natural way to do bind/rebind in the event thread
>>>  2. we do bind before daemon(2) now
>>>  3. the child process after daemon(2) is a single threaded process, 
>>>and
>>>  all event threads are terminated, so no threads is taking care of
>>>  bind/rebind after daemon(2),
>>>
>>>  that's why we need to re-spawn the threads after daemon(2).
>>>  =============
>>>
>>>  So let's resolve alike problem like this
>>
>>  It seems like it would be simpler to push the fork before any 
>>important
>>  operations.  (And BTW with systemd and upstart we don't fork anyway; 
>>it's
>>  just there for sysvinit.)  The preforker thing is there to make it 
>>easy to
>>  fork early, but keep the parent waiting around so that you can do 
>>more
>>  intialization, print errors, and terminate with an error code if 
>>something
>>  (post-fork) goes wrong.  In theory, there's no reason why we couldn't 
>>make
>>  this almost the very first thing the daemon does so that *all* work 
>>is
>>  done in the child...
>
>Yes, but I think it would be a big task now. A lot of works need to be
>done next.
>
>>
>>  sage
>>
>>
>>>
>>>
>>>  On Fri, Aug 12, 2016 at 9:22 PM, Sage Weil <sage@newdream.net> 
>>>wrote:
>>>  > On Thu, 11 Aug 2016, Vu Pham wrote:
>>>  >> Hello all,
>>>  >>
>>>  >> The background:
>>>  >> We have tested scaling with xio messenger and faced multiple 
>>>*unknown*
>>>  >> problems (hard to trace and reproduce). We recently find out that 
>>>the
>>>  >> daemonize/fork support isn't full in ibverbs. It assumes that the 
>>>parent
>>>  >> process will do the RDMA operations. Any child process try to do 
>>>rdma
>>>  >> operations will experience various unexpected problems.
>>>  >>
>>>  >> ceph-osd/ceph-mon/ceph-mds daemonize (fork) after creating 
>>>messengers.
>>>  >> Xio messenger will initialize accelio library and register RDMA 
>>>memory
>>>  >> in the 1st call to XioMessenger constructor.
>>>  >> This situation is very problematic where child process do rdma
>>>  >> operations as described above
>>>  >>
>>>  >> http://www.rdmamojo.com/2012/05/24/ibv_fork_init
>>>  >> http://www.spinics.net/lists/linux-rdma/msg03364.html
>>>  >> I create this PR which forces to daemonize/fork before creating
>>>  >> messenger
>>>  >>
>>>  >> https://github.com/ceph/ceph/pull/10600
>>>  >>
>>>  >> I have tested this patch by bringing up a cluster with 4 nodes, 8
>>>  >> osds/node, two monitors and run I/Os (4K - 4M block size) from 4 
>>>fio
>>>  >> clients.
>>>  >>
>>>  >> Is there any known problem to daemonize/fork before creating 
>>>messenger?
>>>  >> Could you help to review and provide feedback?
>>>  >
>>>  > This more or less works.  The main issue is that we don't catch 
>>>errors as
>>>  > early as we did and a daemon may appear to start and then 
>>>immediately
>>>  > exit without printing an error.
>>>  >
>>>  > There is a Preforker class in common that is meant to address this 
>>>(it's
>>>  > used by ceph-fuse and ceph-mon already).  It does the fork early, 
>>>when
>>>  > prefork() is called, and keeps stdout/stderr open for an interim 
>>>period
>>>  > until you call preforker.exit() or .daemonize().  Any exit code 
>>>gets
>>>  > passed back to the parent over a socket.  I'm guessing that the 
>>>mon is
>>>  > already working fine since you're just moving the 
>>>prefork.daemonize() line
>>>  > around (the actual fork happened way back at
>>>  >
>>>  >         
>>>https://github.com/vuhuong/ceph-upstream/blob/5cd5546fb7ddb9ef69380476b0a80038ba74a405/src/ceph_mon.cc#L500
>>>  >
>>>  > ) and you just need to make ceph_osd.cc and ceph_mds.cc use 
>>>Preforker in a
>>>  > similar way.
>>>  >
>>>  > sage
>>>  > --
>>>  > To unsubscribe from this list: send the line "unsubscribe 
>>>ceph-devel" in
>>>  > the body of a message to majordomo@vger.kernel.org
>>>  > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>  --
>>>  To unsubscribe from this list: send the line "unsubscribe 
>>>ceph-devel" in
>>>  the body of a message to majordomo@vger.kernel.org
>>>  More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Fork and RDMA operations
  2016-08-12 16:27     ` Sage Weil
  2016-08-12 17:01       ` Haomai Wang
@ 2016-08-24 21:18       ` Casey Bodley
  2016-08-24 21:32         ` Sage Weil
  2016-08-24 23:21         ` Adam C. Emerson
  1 sibling, 2 replies; 12+ messages in thread
From: Casey Bodley @ 2016-08-24 21:18 UTC (permalink / raw)
  Cc: ceph-devel


On 08/12/2016 12:27 PM, Sage Weil wrote:
> It seems like it would be simpler to push the fork before any important
> operations.  (And BTW with systemd and upstart we don't fork anyway; it's
> just there for sysvinit.)  The preforker thing is there to make it easy to
> fork early, but keep the parent waiting around so that you can do more
> intialization, print errors, and terminate with an error code if something
> (post-fork) goes wrong.  In theory, there's no reason why we couldn't make
> this almost the very first thing the daemon does so that *all* work is
> done in the child...
>
> sage
>

I've recently run into issues related to fork as well (see my "memory 
leaks related to CephContext and global_init_daemonize()" email). Trying 
to manage resources across a fork is difficult and error-prone, so 
changing how we daemonize could eliminate a whole class of these bugs. 
And as Haomai points out, we're going to great lengths to make things 
work in the current model.

I'm a big fan of Sage's theory that all work could be done in the child 
process, and I'm willing to take on the project if we can reach a 
consensus on the design.

Whether or not we're interested in long-term support for SysV, the 
ability to daemonize is useful our for development workflow (vstart.sh 
in particular). To fill this role, some basic requirements for the 
parent process are:
* don't exit until initialization is finished
* return an error code if initialization failed

As Sage pointed out, this is exactly what Preforker is doing for 
ceph-mon. So we can start by changing the other daemons to use that 
instead of global_init_daemonize().

The next step is to prevent global_init()/common_init()/CephContext from 
doing any work in the parent process (esp. spawning threads for Log, 
CephContextServiceThread, and AdminSocket). Decoupling the config 
parsing from CephContext initialization seems like a natural way to 
accomplish that. So the parent would create and initialize the 
md_config_t object, then after fork, the child would pass that as an 
argument when creating the CephContext.

How does that sound for a start?

Casey

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Fork and RDMA operations
  2016-08-24 21:18       ` Casey Bodley
@ 2016-08-24 21:32         ` Sage Weil
  2016-08-25 12:43           ` Casey Bodley
  2016-08-24 23:21         ` Adam C. Emerson
  1 sibling, 1 reply; 12+ messages in thread
From: Sage Weil @ 2016-08-24 21:32 UTC (permalink / raw)
  To: Casey Bodley; +Cc: ceph-devel

On Wed, 24 Aug 2016, Casey Bodley wrote:
> On 08/12/2016 12:27 PM, Sage Weil wrote:
> > It seems like it would be simpler to push the fork before any important
> > operations.  (And BTW with systemd and upstart we don't fork anyway; it's
> > just there for sysvinit.)  The preforker thing is there to make it easy to
> > fork early, but keep the parent waiting around so that you can do more
> > intialization, print errors, and terminate with an error code if something
> > (post-fork) goes wrong.  In theory, there's no reason why we couldn't make
> > this almost the very first thing the daemon does so that *all* work is
> > done in the child...
> > 
> > sage
> > 
> 
> I've recently run into issues related to fork as well (see my "memory leaks
> related to CephContext and global_init_daemonize()" email). Trying to manage
> resources across a fork is difficult and error-prone, so changing how we
> daemonize could eliminate a whole class of these bugs. And as Haomai points
> out, we're going to great lengths to make things work in the current model.
> 
> I'm a big fan of Sage's theory that all work could be done in the child
> process, and I'm willing to take on the project if we can reach a consensus on
> the design.
> 
> Whether or not we're interested in long-term support for SysV, the ability to
> daemonize is useful our for development workflow (vstart.sh in particular). To
> fill this role, some basic requirements for the parent process are:
> * don't exit until initialization is finished
> * return an error code if initialization failed
> 
> As Sage pointed out, this is exactly what Preforker is doing for ceph-mon. So
> we can start by changing the other daemons to use that instead of
> global_init_daemonize().
> 
> The next step is to prevent global_init()/common_init()/CephContext from doing
> any work in the parent process (esp. spawning threads for Log,
> CephContextServiceThread, and AdminSocket). Decoupling the config parsing from
> CephContext initialization seems like a natural way to accomplish that. So the
> parent would create and initialize the md_config_t object, then after fork,
> the child would pass that as an argument when creating the CephContext.
> 
> How does that sound for a start?

Sounds good to me!  The last step (decoupling) sounds like it's not 
strictly necessary but is probably worthwhile.  It will get deep into a 
bunch of crufty code that isn't much fun to deal with, so I suspect you'll 
either run away screaming or come up with something pretty satisfying that 
removes a bunch of ugly code.

sage

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Fork and RDMA operations
  2016-08-24 21:18       ` Casey Bodley
  2016-08-24 21:32         ` Sage Weil
@ 2016-08-24 23:21         ` Adam C. Emerson
  1 sibling, 0 replies; 12+ messages in thread
From: Adam C. Emerson @ 2016-08-24 23:21 UTC (permalink / raw)
  To: Casey Bodley; +Cc: ceph-devel

On 24/08/2016, Casey Bodley wrote:
[snip]
> The next step is to prevent global_init()/common_init()/CephContext from
> doing any work in the parent process (esp. spawning threads for Log,
> CephContextServiceThread, and AdminSocket). Decoupling the config parsing
> from CephContext initialization seems like a natural way to accomplish that.
> So the parent would create and initialize the md_config_t object, then after
> fork, the child would pass that as an argument when creating the
> CephContext.
> 
> How does that sound for a start?

I'm very much in favor of the decoupling idea. If we did that we could
simply not create the CephContext at all until after the fork.

-- 
Senior Software Engineer           Red Hat Storage, Ann Arbor, MI, US
IRC: Aemerson@{RedHat, OFTC, Freenode}
0x80F7544B90EDBFB9 E707 86BA 0C1B 62CC 152C  7C12 80F7 544B 90ED BFB9

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Fork and RDMA operations
  2016-08-24 21:32         ` Sage Weil
@ 2016-08-25 12:43           ` Casey Bodley
  2016-08-26  3:16             ` Brad Hubbard
  0 siblings, 1 reply; 12+ messages in thread
From: Casey Bodley @ 2016-08-25 12:43 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel



On 08/24/2016 05:32 PM, Sage Weil wrote:
> On Wed, 24 Aug 2016, Casey Bodley wrote:
>> On 08/12/2016 12:27 PM, Sage Weil wrote:
>>> It seems like it would be simpler to push the fork before any important
>>> operations.  (And BTW with systemd and upstart we don't fork anyway; it's
>>> just there for sysvinit.)  The preforker thing is there to make it easy to
>>> fork early, but keep the parent waiting around so that you can do more
>>> intialization, print errors, and terminate with an error code if something
>>> (post-fork) goes wrong.  In theory, there's no reason why we couldn't make
>>> this almost the very first thing the daemon does so that *all* work is
>>> done in the child...
>>>
>>> sage
>>>
>> I've recently run into issues related to fork as well (see my "memory leaks
>> related to CephContext and global_init_daemonize()" email). Trying to manage
>> resources across a fork is difficult and error-prone, so changing how we
>> daemonize could eliminate a whole class of these bugs. And as Haomai points
>> out, we're going to great lengths to make things work in the current model.
>>
>> I'm a big fan of Sage's theory that all work could be done in the child
>> process, and I'm willing to take on the project if we can reach a consensus on
>> the design.
>>
>> Whether or not we're interested in long-term support for SysV, the ability to
>> daemonize is useful our for development workflow (vstart.sh in particular). To
>> fill this role, some basic requirements for the parent process are:
>> * don't exit until initialization is finished
>> * return an error code if initialization failed
>>
>> As Sage pointed out, this is exactly what Preforker is doing for ceph-mon. So
>> we can start by changing the other daemons to use that instead of
>> global_init_daemonize().
>>
>> The next step is to prevent global_init()/common_init()/CephContext from doing
>> any work in the parent process (esp. spawning threads for Log,
>> CephContextServiceThread, and AdminSocket). Decoupling the config parsing from
>> CephContext initialization seems like a natural way to accomplish that. So the
>> parent would create and initialize the md_config_t object, then after fork,
>> the child would pass that as an argument when creating the CephContext.
>>
>> How does that sound for a start?
> Sounds good to me!  The last step (decoupling) sounds like it's not
> strictly necessary but is probably worthwhile.  It will get deep into a
> bunch of crufty code that isn't much fun to deal with, so I suspect you'll
> either run away screaming or come up with something pretty satisfying that
> removes a bunch of ugly code.
>
> sage

Thanks Sage. I'll have to investigate some more to see how messy it will 
be, but I'm more worried about trying to construct a 'partial' 
CephContext for parsing, and exposing that to the application when most 
of its capabilities (logging, especially) are missing. The parent 
process won't be able to call any code with dout()s, so I think it's 
safest to avoid constructing a CephContext in the first place.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Fork and RDMA operations
  2016-08-25 12:43           ` Casey Bodley
@ 2016-08-26  3:16             ` Brad Hubbard
  2016-08-30  0:15               ` Brad Hubbard
  0 siblings, 1 reply; 12+ messages in thread
From: Brad Hubbard @ 2016-08-26  3:16 UTC (permalink / raw)
  To: Casey Bodley; +Cc: Sage Weil, ceph-devel, Adam Emerson



On Thu, Aug 25, 2016 at 10:43 PM, Casey Bodley <cbodley@redhat.com> wrote:
>
>
> On 08/24/2016 05:32 PM, Sage Weil wrote:
>>
>> On Wed, 24 Aug 2016, Casey Bodley wrote:
>>>
>>> On 08/12/2016 12:27 PM, Sage Weil wrote:
>>>>
>>>> It seems like it would be simpler to push the fork before any important
>>>> operations.  (And BTW with systemd and upstart we don't fork anyway;
>>>> it's
>>>> just there for sysvinit.)  The preforker thing is there to make it easy
>>>> to
>>>> fork early, but keep the parent waiting around so that you can do more
>>>> intialization, print errors, and terminate with an error code if
>>>> something
>>>> (post-fork) goes wrong.  In theory, there's no reason why we couldn't
>>>> make
>>>> this almost the very first thing the daemon does so that *all* work is
>>>> done in the child...
>>>>
>>>> sage
>>>>
>>> I've recently run into issues related to fork as well (see my "memory
>>> leaks
>>> related to CephContext and global_init_daemonize()" email). Trying to
>>> manage
>>> resources across a fork is difficult and error-prone, so changing how we
>>> daemonize could eliminate a whole class of these bugs. And as Haomai
>>> points
>>> out, we're going to great lengths to make things work in the current
>>> model.
>>>
>>> I'm a big fan of Sage's theory that all work could be done in the child
>>> process, and I'm willing to take on the project if we can reach a
>>> consensus on
>>> the design.
>>>
>>> Whether or not we're interested in long-term support for SysV, the
>>> ability to
>>> daemonize is useful our for development workflow (vstart.sh in
>>> particular). To
>>> fill this role, some basic requirements for the parent process are:
>>> * don't exit until initialization is finished
>>> * return an error code if initialization failed
>>>
>>> As Sage pointed out, this is exactly what Preforker is doing for
>>> ceph-mon. So
>>> we can start by changing the other daemons to use that instead of
>>> global_init_daemonize().
>>>
>>> The next step is to prevent global_init()/common_init()/CephContext from
>>> doing
>>> any work in the parent process (esp. spawning threads for Log,
>>> CephContextServiceThread, and AdminSocket). Decoupling the config parsing
>>> from
>>> CephContext initialization seems like a natural way to accomplish that.
>>> So the
>>> parent would create and initialize the md_config_t object, then after
>>> fork,
>>> the child would pass that as an argument when creating the CephContext.
>>>
>>> How does that sound for a start?
>>
>> Sounds good to me!  The last step (decoupling) sounds like it's not
>> strictly necessary but is probably worthwhile.  It will get deep into a
>> bunch of crufty code that isn't much fun to deal with, so I suspect you'll
>> either run away screaming or come up with something pretty satisfying that
>> removes a bunch of ugly code.
>>
>> sage
>
>
> Thanks Sage. I'll have to investigate some more to see how messy it will be,
> but I'm more worried about trying to construct a 'partial' CephContext for
> parsing, and exposing that to the application when most of its capabilities
> (logging, especially) are missing. The parent process won't be able to call
> any code with dout()s, so I think it's safest to avoid constructing a
> CephContext in the first place.

I'd like to add some thoughts here if I may?

Recently Kefu and I were discussing uses of fork in the ceph codebase at length
due to the issue seen in http://tracker.ceph.com/issues/16556 Over recent years
supporting multiple products, including glibc, I have seen many problems arising
from what I consider to be misuse of fork.

There is a saying, "The only safe function call to make after fork() is exec()"
and this is reflected in the fork man page.

"After a fork(2) in a multithreaded program, the child can safely call only
async-signal-safe functions (see signal(7)) until such time as it calls
execve(2)."

The list of async-signal-safe functions is not very extensive and does not
include things like malloc and free and I have seen on multiple occasions
deadlocks where the child is trying to acquire a malloc arena lock for example
that was locked when fork was called and hanging forever. Basically, if a lock
is locked when you fork the child will get a copy (COW) of the lock in the
locked state with no possibility of the lock being unlocked since the child is
single threaded. if the child then tries to acquire this lock it will hang
forever. The same can happen with other resources. Note that even if your code
is single threaded some libraries may use threads in the background and this can
also catch you out. Sometimes these bugs can be hideously difficult to pin down
and happen extremely rarely and only on certain machines, etc. due to timing.

These types of problems with fork are well documented.

https://bugs.chromium.org/p/chromium/issues/detail?id=35374#c51
http://www.evanjones.ca/fork-is-dangerous.html
http://thorstenball.com/blog/2014/10/13/why-threads-cant-fork/
https://github.com/gperftools/gperftools/issues/178 (tcmalloc and its use of
locks)

Google shows a preference for clone in this document, posix_spawn is another
alternative.

https://github.com/google/googletest/blob/master/googletest/docs/AdvancedGuide.md#death-tests-and-threads

There are partial solutions like pthread_atfork but it is not a comprehensive
solution, nor is it without its own problems.

Anyway, I just wanted to get this information out there for discussion so we can
make an educated decision going forward and so that everyone is aware of the
potential issues with uses of fork in ceph.

-- 
Cheers,
Brad

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Fork and RDMA operations
  2016-08-26  3:16             ` Brad Hubbard
@ 2016-08-30  0:15               ` Brad Hubbard
  0 siblings, 0 replies; 12+ messages in thread
From: Brad Hubbard @ 2016-08-30  0:15 UTC (permalink / raw)
  To: Casey Bodley; +Cc: Sage Weil, ceph-devel, Adam Emerson

Could boost.process be an alternative here?

On Fri, Aug 26, 2016 at 1:16 PM, Brad Hubbard <bhubbard@redhat.com> wrote:
>
>
> On Thu, Aug 25, 2016 at 10:43 PM, Casey Bodley <cbodley@redhat.com> wrote:
>>
>>
>> On 08/24/2016 05:32 PM, Sage Weil wrote:
>>>
>>> On Wed, 24 Aug 2016, Casey Bodley wrote:
>>>>
>>>> On 08/12/2016 12:27 PM, Sage Weil wrote:
>>>>>
>>>>> It seems like it would be simpler to push the fork before any important
>>>>> operations.  (And BTW with systemd and upstart we don't fork anyway;
>>>>> it's
>>>>> just there for sysvinit.)  The preforker thing is there to make it easy
>>>>> to
>>>>> fork early, but keep the parent waiting around so that you can do more
>>>>> intialization, print errors, and terminate with an error code if
>>>>> something
>>>>> (post-fork) goes wrong.  In theory, there's no reason why we couldn't
>>>>> make
>>>>> this almost the very first thing the daemon does so that *all* work is
>>>>> done in the child...
>>>>>
>>>>> sage
>>>>>
>>>> I've recently run into issues related to fork as well (see my "memory
>>>> leaks
>>>> related to CephContext and global_init_daemonize()" email). Trying to
>>>> manage
>>>> resources across a fork is difficult and error-prone, so changing how we
>>>> daemonize could eliminate a whole class of these bugs. And as Haomai
>>>> points
>>>> out, we're going to great lengths to make things work in the current
>>>> model.
>>>>
>>>> I'm a big fan of Sage's theory that all work could be done in the child
>>>> process, and I'm willing to take on the project if we can reach a
>>>> consensus on
>>>> the design.
>>>>
>>>> Whether or not we're interested in long-term support for SysV, the
>>>> ability to
>>>> daemonize is useful our for development workflow (vstart.sh in
>>>> particular). To
>>>> fill this role, some basic requirements for the parent process are:
>>>> * don't exit until initialization is finished
>>>> * return an error code if initialization failed
>>>>
>>>> As Sage pointed out, this is exactly what Preforker is doing for
>>>> ceph-mon. So
>>>> we can start by changing the other daemons to use that instead of
>>>> global_init_daemonize().
>>>>
>>>> The next step is to prevent global_init()/common_init()/CephContext from
>>>> doing
>>>> any work in the parent process (esp. spawning threads for Log,
>>>> CephContextServiceThread, and AdminSocket). Decoupling the config parsing
>>>> from
>>>> CephContext initialization seems like a natural way to accomplish that.
>>>> So the
>>>> parent would create and initialize the md_config_t object, then after
>>>> fork,
>>>> the child would pass that as an argument when creating the CephContext.
>>>>
>>>> How does that sound for a start?
>>>
>>> Sounds good to me!  The last step (decoupling) sounds like it's not
>>> strictly necessary but is probably worthwhile.  It will get deep into a
>>> bunch of crufty code that isn't much fun to deal with, so I suspect you'll
>>> either run away screaming or come up with something pretty satisfying that
>>> removes a bunch of ugly code.
>>>
>>> sage
>>
>>
>> Thanks Sage. I'll have to investigate some more to see how messy it will be,
>> but I'm more worried about trying to construct a 'partial' CephContext for
>> parsing, and exposing that to the application when most of its capabilities
>> (logging, especially) are missing. The parent process won't be able to call
>> any code with dout()s, so I think it's safest to avoid constructing a
>> CephContext in the first place.
>
> I'd like to add some thoughts here if I may?
>
> Recently Kefu and I were discussing uses of fork in the ceph codebase at length
> due to the issue seen in http://tracker.ceph.com/issues/16556 Over recent years
> supporting multiple products, including glibc, I have seen many problems arising
> from what I consider to be misuse of fork.
>
> There is a saying, "The only safe function call to make after fork() is exec()"
> and this is reflected in the fork man page.
>
> "After a fork(2) in a multithreaded program, the child can safely call only
> async-signal-safe functions (see signal(7)) until such time as it calls
> execve(2)."
>
> The list of async-signal-safe functions is not very extensive and does not
> include things like malloc and free and I have seen on multiple occasions
> deadlocks where the child is trying to acquire a malloc arena lock for example
> that was locked when fork was called and hanging forever. Basically, if a lock
> is locked when you fork the child will get a copy (COW) of the lock in the
> locked state with no possibility of the lock being unlocked since the child is
> single threaded. if the child then tries to acquire this lock it will hang
> forever. The same can happen with other resources. Note that even if your code
> is single threaded some libraries may use threads in the background and this can
> also catch you out. Sometimes these bugs can be hideously difficult to pin down
> and happen extremely rarely and only on certain machines, etc. due to timing.
>
> These types of problems with fork are well documented.
>
> https://bugs.chromium.org/p/chromium/issues/detail?id=35374#c51
> http://www.evanjones.ca/fork-is-dangerous.html
> http://thorstenball.com/blog/2014/10/13/why-threads-cant-fork/
> https://github.com/gperftools/gperftools/issues/178 (tcmalloc and its use of
> locks)
>
> Google shows a preference for clone in this document, posix_spawn is another
> alternative.
>
> https://github.com/google/googletest/blob/master/googletest/docs/AdvancedGuide.md#death-tests-and-threads
>
> There are partial solutions like pthread_atfork but it is not a comprehensive
> solution, nor is it without its own problems.
>
> Anyway, I just wanted to get this information out there for discussion so we can
> make an educated decision going forward and so that everyone is aware of the
> potential issues with uses of fork in ceph.
>
> --
> Cheers,
> Brad



-- 
Cheers,
Brad

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-08-30  0:15 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-11 18:51 Fork and RDMA operations Vu Pham
2016-08-12 13:22 ` Sage Weil
2016-08-12 15:22   ` Haomai Wang
2016-08-12 16:27     ` Sage Weil
2016-08-12 17:01       ` Haomai Wang
2016-08-16 17:20         ` Vu Pham
2016-08-24 21:18       ` Casey Bodley
2016-08-24 21:32         ` Sage Weil
2016-08-25 12:43           ` Casey Bodley
2016-08-26  3:16             ` Brad Hubbard
2016-08-30  0:15               ` Brad Hubbard
2016-08-24 23:21         ` Adam C. Emerson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.