linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFD] reboot / shutdown of a container
@ 2011-01-13 16:34 Daniel Lezcano
  2011-01-13 20:09 ` Bruno Prémont
  0 siblings, 1 reply; 7+ messages in thread
From: Daniel Lezcano @ 2011-01-13 16:34 UTC (permalink / raw)
  To: Linux Containers, Linux Kernel Mailing List


Hi all,

in the container implementation, we are facing the problem of a process 
calling the sys_reboot syscall which of course makes the host to 
poweroff/reboot.

If we drop the cap_sys_reboot capability, sys_reboot fails and the 
container reach a shutdown state but the init process stay there, hence 
the container becomes stuck waiting indefinitely the process '1' to exit.

The current implementation to make the shutdown / reboot of the 
container to work is we watch, from a process outside of the container, 
the <rootfs>/var/run/utmp file and check the runlevel each time the file 
changes. When the 'reboot' or 'shutdown' level is detected, we wait for 
a single remaining in the container and then we kill it.

That works but this is not efficient in case of a large number of 
containers as we will have to watch a lot of utmp files. In addition, 
the /var/run directory must *not* mounted as tmpfs in the distro. 
Unfortunately, it is the default setup on most of the distros and tends 
to generalize. That implies, the rootfs init's scripts must be modified 
for the container when we put in place its rootfs and as /var/run is 
supposed to be a tmpfs, most of the applications do not cleanup the 
directory, so we need to add extra services to wipeout the files.

More problems arise when we do an upgrade of the distro inside the 
container, because all the setup we made at creation time will be lost. 
The upgrade overwrite the scripts, the fstab and so on.

We did what was possible to solve the problem from userspace but we 
reach always a limit because there are different implementations of the 
'init' process and the init's scripts differ from a distro to another 
and the same with the versions.

We think this problem can only be solved from the kernel.

The idea was to send a signal SIGPWR to the parent of the pid '1' of the 
pid namespace when the sys_reboot is called. Of course that won't occur 
for the init pid namespace.

Does it make sense ?

Any idea is very welcome :)

   -- Daniel





^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFD] reboot / shutdown of a container
  2011-01-13 16:34 [RFD] reboot / shutdown of a container Daniel Lezcano
@ 2011-01-13 20:09 ` Bruno Prémont
  2011-01-13 21:32   ` Daniel Lezcano
  0 siblings, 1 reply; 7+ messages in thread
From: Bruno Prémont @ 2011-01-13 20:09 UTC (permalink / raw)
  To: Daniel Lezcano; +Cc: Linux Containers, Linux Kernel Mailing List

On Thu, 13 January 2011 Daniel Lezcano <daniel.lezcano@free.fr> wrote:
> in the container implementation, we are facing the problem of a process 
> calling the sys_reboot syscall which of course makes the host to 
> poweroff/reboot.
> 
> If we drop the cap_sys_reboot capability, sys_reboot fails and the 
> container reach a shutdown state but the init process stay there, hence 
> the container becomes stuck waiting indefinitely the process '1' to exit.
> 
> The current implementation to make the shutdown / reboot of the 
> container to work is we watch, from a process outside of the container, 
> the <rootfs>/var/run/utmp file and check the runlevel each time the file 
> changes. When the 'reboot' or 'shutdown' level is detected, we wait for 
> a single remaining in the container and then we kill it.
> 
> That works but this is not efficient in case of a large number of 
> containers as we will have to watch a lot of utmp files. In addition, 
> the /var/run directory must *not* mounted as tmpfs in the distro. 
> Unfortunately, it is the default setup on most of the distros and tends 
> to generalize. That implies, the rootfs init's scripts must be modified 
> for the container when we put in place its rootfs and as /var/run is 
> supposed to be a tmpfs, most of the applications do not cleanup the 
> directory, so we need to add extra services to wipeout the files.
> 
> More problems arise when we do an upgrade of the distro inside the 
> container, because all the setup we made at creation time will be lost. 
> The upgrade overwrite the scripts, the fstab and so on.
> 
> We did what was possible to solve the problem from userspace but we 
> reach always a limit because there are different implementations of the 
> 'init' process and the init's scripts differ from a distro to another 
> and the same with the versions.
> 
> We think this problem can only be solved from the kernel.
> 
> The idea was to send a signal SIGPWR to the parent of the pid '1' of the 
> pid namespace when the sys_reboot is called. Of course that won't occur 
> for the init pid namespace.

Wouldn't sending SIGKILL to the pid '1' process of the originating PID
namespace be sufficient (that would trigger a SIGCHLD for the parent
process in the outer PID namespace.
(as far as I remember the PID namespace is killed when its 'init' exits,
if this is not the case all other processes in the given namespace would
have to be killed as well)

Only issue is how to differentiate the various reboot() modes (restart, 
power-off/halt) from outside, though that one also exists with the SIGPWR
signal.

Bruno


> Does it make sense ?
> 
> Any idea is very welcome :)
> 
>    -- Daniel


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFD] reboot / shutdown of a container
  2011-01-13 20:09 ` Bruno Prémont
@ 2011-01-13 21:32   ` Daniel Lezcano
  2011-01-13 21:50     ` Bruno Prémont
  0 siblings, 1 reply; 7+ messages in thread
From: Daniel Lezcano @ 2011-01-13 21:32 UTC (permalink / raw)
  To: Bruno Prémont; +Cc: Linux Containers, Linux Kernel Mailing List

On 01/13/2011 09:09 PM, Bruno Prémont wrote:
> On Thu, 13 January 2011 Daniel Lezcano<daniel.lezcano@free.fr>  wrote:
>> in the container implementation, we are facing the problem of a process
>> calling the sys_reboot syscall which of course makes the host to
>> poweroff/reboot.
>>
>> If we drop the cap_sys_reboot capability, sys_reboot fails and the
>> container reach a shutdown state but the init process stay there, hence
>> the container becomes stuck waiting indefinitely the process '1' to exit.
>>
>> The current implementation to make the shutdown / reboot of the
>> container to work is we watch, from a process outside of the container,
>> the<rootfs>/var/run/utmp file and check the runlevel each time the file
>> changes. When the 'reboot' or 'shutdown' level is detected, we wait for
>> a single remaining in the container and then we kill it.
>>
>> That works but this is not efficient in case of a large number of
>> containers as we will have to watch a lot of utmp files. In addition,
>> the /var/run directory must *not* mounted as tmpfs in the distro.
>> Unfortunately, it is the default setup on most of the distros and tends
>> to generalize. That implies, the rootfs init's scripts must be modified
>> for the container when we put in place its rootfs and as /var/run is
>> supposed to be a tmpfs, most of the applications do not cleanup the
>> directory, so we need to add extra services to wipeout the files.
>>
>> More problems arise when we do an upgrade of the distro inside the
>> container, because all the setup we made at creation time will be lost.
>> The upgrade overwrite the scripts, the fstab and so on.
>>
>> We did what was possible to solve the problem from userspace but we
>> reach always a limit because there are different implementations of the
>> 'init' process and the init's scripts differ from a distro to another
>> and the same with the versions.
>>
>> We think this problem can only be solved from the kernel.
>>
>> The idea was to send a signal SIGPWR to the parent of the pid '1' of the
>> pid namespace when the sys_reboot is called. Of course that won't occur
>> for the init pid namespace.
> Wouldn't sending SIGKILL to the pid '1' process of the originating PID
> namespace be sufficient (that would trigger a SIGCHLD for the parent
> process in the outer PID namespace.

This is already the case. The question is : when do we send this signal ?
We have to wait for the container system shutdown before killing it.

> (as far as I remember the PID namespace is killed when its 'init' exits,
> if this is not the case all other processes in the given namespace would
> have to be killed as well)

Yes, absolutely but this is not the point, reaping the container is not 
a problem.

What we are trying to achieve is to shutdown properly the container from 
inside (from outside will be possible too with the setns syscall).

Assuming the process '1234' creates a new process in a new namespace set 
and wait for it.

The new process '1' will exec /sbin/init and the system will boot up. 
But, when the system is shutdown or rebooted, after the down scripts are 
executed the kill -15 -1 will be invoked, killing all the processes 
expect the process '1' and the caller. This one will then call 
'sys_reboot' and exit. Hence we still have the init process idle and its 
parent '1234' waiting for it to die.

If we are able to receive the information in the process '1234' : "the 
sys_reboot was called in the child pid namespace", we can take then kill 
our child pid.  If this information is raised via a signal sent by the 
kernel with the proper information in the siginfo_t (eg. si_code 
contains "LINUX_REBOOT_CMD_RESTART", "LINUX_REBOOT_CMD_HALT", ... ), the 
solution will be generic for all the shutdown/reboot of any kind of 
container and init version.

> Only issue is how to differentiate the various reboot() modes (restart,
> power-off/halt) from outside, though that one also exists with the SIGPWR
> signal.


<javascript:void(0);>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFD] reboot / shutdown of a container
  2011-01-13 21:32   ` Daniel Lezcano
@ 2011-01-13 21:50     ` Bruno Prémont
  2011-01-13 22:09       ` Daniel Lezcano
  0 siblings, 1 reply; 7+ messages in thread
From: Bruno Prémont @ 2011-01-13 21:50 UTC (permalink / raw)
  To: Daniel Lezcano; +Cc: Linux Containers, Linux Kernel Mailing List

On Thu, 13 January 2011 Daniel Lezcano <daniel.lezcano@free.fr> wrote:

> On 01/13/2011 09:09 PM, Bruno Prémont wrote:
> > On Thu, 13 January 2011 Daniel Lezcano<daniel.lezcano@free.fr>  wrote:
> >> in the container implementation, we are facing the problem of a process
> >> calling the sys_reboot syscall which of course makes the host to
> >> poweroff/reboot.
> >>
> >> If we drop the cap_sys_reboot capability, sys_reboot fails and the
> >> container reach a shutdown state but the init process stay there, hence
> >> the container becomes stuck waiting indefinitely the process '1' to exit.
> >>
> >> The current implementation to make the shutdown / reboot of the
> >> container to work is we watch, from a process outside of the container,
> >> the<rootfs>/var/run/utmp file and check the runlevel each time the file
> >> changes. When the 'reboot' or 'shutdown' level is detected, we wait for
> >> a single remaining in the container and then we kill it.
> >>
> >> That works but this is not efficient in case of a large number of
> >> containers as we will have to watch a lot of utmp files. In addition,
> >> the /var/run directory must *not* mounted as tmpfs in the distro.
> >> Unfortunately, it is the default setup on most of the distros and tends
> >> to generalize. That implies, the rootfs init's scripts must be modified
> >> for the container when we put in place its rootfs and as /var/run is
> >> supposed to be a tmpfs, most of the applications do not cleanup the
> >> directory, so we need to add extra services to wipeout the files.
> >>
> >> More problems arise when we do an upgrade of the distro inside the
> >> container, because all the setup we made at creation time will be lost.
> >> The upgrade overwrite the scripts, the fstab and so on.
> >>
> >> We did what was possible to solve the problem from userspace but we
> >> reach always a limit because there are different implementations of the
> >> 'init' process and the init's scripts differ from a distro to another
> >> and the same with the versions.
> >>
> >> We think this problem can only be solved from the kernel.
> >>
> >> The idea was to send a signal SIGPWR to the parent of the pid '1' of the
> >> pid namespace when the sys_reboot is called. Of course that won't occur
> >> for the init pid namespace.
> > Wouldn't sending SIGKILL to the pid '1' process of the originating PID
> > namespace be sufficient (that would trigger a SIGCHLD for the parent
> > process in the outer PID namespace.
> 
> This is already the case. The question is : when do we send this signal ?
> We have to wait for the container system shutdown before killing it.

I meant that sys_reboot() would kill the namespace's init if it's not
called from boot namespace.

See below

> > (as far as I remember the PID namespace is killed when its 'init' exits,
> > if this is not the case all other processes in the given namespace would
> > have to be killed as well)
> 
> Yes, absolutely but this is not the point, reaping the container is not 
> a problem.
> 
> What we are trying to achieve is to shutdown properly the container from 
> inside (from outside will be possible too with the setns syscall).
> 
> Assuming the process '1234' creates a new process in a new namespace set 
> and wait for it.
> 
> The new process '1' will exec /sbin/init and the system will boot up. 
> But, when the system is shutdown or rebooted, after the down scripts are 
> executed the kill -15 -1 will be invoked, killing all the processes 
> expect the process '1' and the caller. This one will then call 
> 'sys_reboot' and exit. Hence we still have the init process idle and its 
> parent '1234' waiting for it to die.

This call to sys_reboot() would kill "new process '1'" instead of trying to
operate on the HW box.
This also has the advantage that a container would not require an informed
parent "monitoring" it from outside (though it would not be restarted even if
requested without such informed outside parent).

> If we are able to receive the information in the process '1234' : "the 
> sys_reboot was called in the child pid namespace", we can take then kill 
> our child pid.  If this information is raised via a signal sent by the 
> kernel with the proper information in the siginfo_t (eg. si_code 
> contains "LINUX_REBOOT_CMD_RESTART", "LINUX_REBOOT_CMD_HALT", ... ), the 
> solution will be generic for all the shutdown/reboot of any kind of 
> container and init version.

Could this be passed for a SIGCHLD? (when namespace is reaped, and received
by 1234 from above example assuming sys_reboot() kills the "new process '1'")

Looks like yes, but with the need to define new values for si_code (reusing
LINUX_REBOOT_CMD_* would certainly clash, no matter which signal is choosen).

> > Only issue is how to differentiate the various reboot() modes (restart,
> > power-off/halt) from outside, though that one also exists with the SIGPWR
> > signal.

Bruno

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFD] reboot / shutdown of a container
  2011-01-13 21:50     ` Bruno Prémont
@ 2011-01-13 22:09       ` Daniel Lezcano
  2011-01-14 23:11         ` Bruno Prémont
  0 siblings, 1 reply; 7+ messages in thread
From: Daniel Lezcano @ 2011-01-13 22:09 UTC (permalink / raw)
  To: Bruno Prémont; +Cc: Linux Containers, Linux Kernel Mailing List

On 01/13/2011 10:50 PM, Bruno Prémont wrote:
> On Thu, 13 January 2011 Daniel Lezcano<daniel.lezcano@free.fr>  wrote:
>
>> On 01/13/2011 09:09 PM, Bruno Prémont wrote:
>>> On Thu, 13 January 2011 Daniel Lezcano<daniel.lezcano@free.fr>   wrote:
>>>> in the container implementation, we are facing the problem of a process
>>>> calling the sys_reboot syscall which of course makes the host to
>>>> poweroff/reboot.
>>>>
>>>> If we drop the cap_sys_reboot capability, sys_reboot fails and the
>>>> container reach a shutdown state but the init process stay there, hence
>>>> the container becomes stuck waiting indefinitely the process '1' to exit.
>>>>
>>>> The current implementation to make the shutdown / reboot of the
>>>> container to work is we watch, from a process outside of the container,
>>>> the<rootfs>/var/run/utmp file and check the runlevel each time the file
>>>> changes. When the 'reboot' or 'shutdown' level is detected, we wait for
>>>> a single remaining in the container and then we kill it.
>>>>
>>>> That works but this is not efficient in case of a large number of
>>>> containers as we will have to watch a lot of utmp files. In addition,
>>>> the /var/run directory must *not* mounted as tmpfs in the distro.
>>>> Unfortunately, it is the default setup on most of the distros and tends
>>>> to generalize. That implies, the rootfs init's scripts must be modified
>>>> for the container when we put in place its rootfs and as /var/run is
>>>> supposed to be a tmpfs, most of the applications do not cleanup the
>>>> directory, so we need to add extra services to wipeout the files.
>>>>
>>>> More problems arise when we do an upgrade of the distro inside the
>>>> container, because all the setup we made at creation time will be lost.
>>>> The upgrade overwrite the scripts, the fstab and so on.
>>>>
>>>> We did what was possible to solve the problem from userspace but we
>>>> reach always a limit because there are different implementations of the
>>>> 'init' process and the init's scripts differ from a distro to another
>>>> and the same with the versions.
>>>>
>>>> We think this problem can only be solved from the kernel.
>>>>
>>>> The idea was to send a signal SIGPWR to the parent of the pid '1' of the
>>>> pid namespace when the sys_reboot is called. Of course that won't occur
>>>> for the init pid namespace.
>>> Wouldn't sending SIGKILL to the pid '1' process of the originating PID
>>> namespace be sufficient (that would trigger a SIGCHLD for the parent
>>> process in the outer PID namespace.
>> This is already the case. The question is : when do we send this signal ?
>> We have to wait for the container system shutdown before killing it.
> I meant that sys_reboot() would kill the namespace's init if it's not
> called from boot namespace.
>
> See below
>
>>> (as far as I remember the PID namespace is killed when its 'init' exits,
>>> if this is not the case all other processes in the given namespace would
>>> have to be killed as well)
>> Yes, absolutely but this is not the point, reaping the container is not
>> a problem.
>>
>> What we are trying to achieve is to shutdown properly the container from
>> inside (from outside will be possible too with the setns syscall).
>>
>> Assuming the process '1234' creates a new process in a new namespace set
>> and wait for it.
>>
>> The new process '1' will exec /sbin/init and the system will boot up.
>> But, when the system is shutdown or rebooted, after the down scripts are
>> executed the kill -15 -1 will be invoked, killing all the processes
>> expect the process '1' and the caller. This one will then call
>> 'sys_reboot' and exit. Hence we still have the init process idle and its
>> parent '1234' waiting for it to die.
> This call to sys_reboot() would kill "new process '1'" instead of trying to
> operate on the HW box.
> This also has the advantage that a container would not require an informed
> parent "monitoring" it from outside (though it would not be restarted even if
> requested without such informed outside parent).

Oh, ok. Sorry I misunderstood.

Yes, that could be better than crossing the namespace boundaries.

>> If we are able to receive the information in the process '1234' : "the
>> sys_reboot was called in the child pid namespace", we can take then kill
>> our child pid.  If this information is raised via a signal sent by the
>> kernel with the proper information in the siginfo_t (eg. si_code
>> contains "LINUX_REBOOT_CMD_RESTART", "LINUX_REBOOT_CMD_HALT", ... ), the
>> solution will be generic for all the shutdown/reboot of any kind of
>> container and init version.
> Could this be passed for a SIGCHLD? (when namespace is reaped, and received
> by 1234 from above example assuming sys_reboot() kills the "new process '1'")

Yes, that sounds a good idea.

> Looks like yes, but with the need to define new values for si_code (reusing
> LINUX_REBOOT_CMD_* would certainly clash, no matter which signal is choosen).

CLD_REBOOT_CMD_RESTART
CLD_REBOOT_CMD_HALT
CLD_REBOOT_CMD_POWER_OFF
CLD_REBOOT_CMD_RESTART2 (what about the cmd buffer, shall we ignore it ?)
CLD_REBOOT_CMD_KEXEC (?)
CLD_REBOOT_CMD_SW_SUSPEND (useful for the future checkpoint/restart)

LINUX_REBOOT_CMD_CAD_ON and LINUX_REBOOT_CMD_CAD_OFF could be disabled 
for a non-init pid namespace, no ?



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFD] reboot / shutdown of a container
  2011-01-13 22:09       ` Daniel Lezcano
@ 2011-01-14 23:11         ` Bruno Prémont
  2011-01-15  7:54           ` Daniel Lezcano
  0 siblings, 1 reply; 7+ messages in thread
From: Bruno Prémont @ 2011-01-14 23:11 UTC (permalink / raw)
  To: Daniel Lezcano; +Cc: Linux Containers, Linux Kernel Mailing List

On Thu, 13 January 2011 Daniel Lezcano <daniel.lezcano@free.fr> wrote:
> On 01/13/2011 10:50 PM, Bruno Prémont wrote:
> > On Thu, 13 January 2011 Daniel Lezcano<daniel.lezcano@free.fr>  wrote:
> >> On 01/13/2011 09:09 PM, Bruno Prémont wrote:
> >>> On Thu, 13 January 2011 Daniel Lezcano<daniel.lezcano@free.fr>   wrote:
> >>>> in the container implementation, we are facing the problem of a process
> >>>> calling the sys_reboot syscall which of course makes the host to
> >>>> poweroff/reboot.
> >>>>
> >>>> If we drop the cap_sys_reboot capability, sys_reboot fails and the
> >>>> container reach a shutdown state but the init process stay there, hence
> >>>> the container becomes stuck waiting indefinitely the process '1' to exit.
> >>>>
> >>>> The current implementation to make the shutdown / reboot of the
> >>>> container to work is we watch, from a process outside of the container,
> >>>> the<rootfs>/var/run/utmp file and check the runlevel each time the file
> >>>> changes. When the 'reboot' or 'shutdown' level is detected, we wait for
> >>>> a single remaining in the container and then we kill it.
> >>>>
> >>>> That works but this is not efficient in case of a large number of
> >>>> containers as we will have to watch a lot of utmp files. In addition,
> >>>> the /var/run directory must *not* mounted as tmpfs in the distro.
> >>>> Unfortunately, it is the default setup on most of the distros and tends
> >>>> to generalize. That implies, the rootfs init's scripts must be modified
> >>>> for the container when we put in place its rootfs and as /var/run is
> >>>> supposed to be a tmpfs, most of the applications do not cleanup the
> >>>> directory, so we need to add extra services to wipeout the files.
> >>>>
> >>>> More problems arise when we do an upgrade of the distro inside the
> >>>> container, because all the setup we made at creation time will be lost.
> >>>> The upgrade overwrite the scripts, the fstab and so on.
> >>>>
> >>>> We did what was possible to solve the problem from userspace but we
> >>>> reach always a limit because there are different implementations of the
> >>>> 'init' process and the init's scripts differ from a distro to another
> >>>> and the same with the versions.
> >>>>
> >>>> We think this problem can only be solved from the kernel.
> >>>>
> >>>> The idea was to send a signal SIGPWR to the parent of the pid '1' of the
> >>>> pid namespace when the sys_reboot is called. Of course that won't occur
> >>>> for the init pid namespace.
> >>> Wouldn't sending SIGKILL to the pid '1' process of the originating PID
> >>> namespace be sufficient (that would trigger a SIGCHLD for the parent
> >>> process in the outer PID namespace.
> >> This is already the case. The question is : when do we send this signal ?
> >> We have to wait for the container system shutdown before killing it.
> > I meant that sys_reboot() would kill the namespace's init if it's not
> > called from boot namespace.
> >
> > See below
> >
> >>> (as far as I remember the PID namespace is killed when its 'init' exits,
> >>> if this is not the case all other processes in the given namespace would
> >>> have to be killed as well)
> >> Yes, absolutely but this is not the point, reaping the container is not
> >> a problem.
> >>
> >> What we are trying to achieve is to shutdown properly the container from
> >> inside (from outside will be possible too with the setns syscall).
> >>
> >> Assuming the process '1234' creates a new process in a new namespace set
> >> and wait for it.
> >>
> >> The new process '1' will exec /sbin/init and the system will boot up.
> >> But, when the system is shutdown or rebooted, after the down scripts are
> >> executed the kill -15 -1 will be invoked, killing all the processes
> >> expect the process '1' and the caller. This one will then call
> >> 'sys_reboot' and exit. Hence we still have the init process idle and its
> >> parent '1234' waiting for it to die.
> > This call to sys_reboot() would kill "new process '1'" instead of trying to
> > operate on the HW box.
> > This also has the advantage that a container would not require an informed
> > parent "monitoring" it from outside (though it would not be restarted even if
> > requested without such informed outside parent).
> 
> Oh, ok. Sorry I misunderstood.
> 
> Yes, that could be better than crossing the namespace boundaries.
> 
> >> If we are able to receive the information in the process '1234' : "the
> >> sys_reboot was called in the child pid namespace", we can take then kill
> >> our child pid.  If this information is raised via a signal sent by the
> >> kernel with the proper information in the siginfo_t (eg. si_code
> >> contains "LINUX_REBOOT_CMD_RESTART", "LINUX_REBOOT_CMD_HALT", ... ), the
> >> solution will be generic for all the shutdown/reboot of any kind of
> >> container and init version.
> > Could this be passed for a SIGCHLD? (when namespace is reaped, and received
> > by 1234 from above example assuming sys_reboot() kills the "new process '1'")
> 
> Yes, that sounds a good idea.
> 
> > Looks like yes, but with the need to define new values for si_code (reusing
> > LINUX_REBOOT_CMD_* would certainly clash, no matter which signal is choosen).
> 
> CLD_REBOOT_CMD_RESTART

> CLD_REBOOT_CMD_HALT
> CLD_REBOOT_CMD_POWER_OFF

I would just map both to the same thing...

> CLD_REBOOT_CMD_RESTART2 (what about the cmd buffer, shall we ignore it ?)

The cmd buffer could be passed via si_ptr if we want it, otherwise it would
be the same as for CLD_REBOOT_CMD_RESTART (which would have si_ptr set to NULL
in case no si_code differentiation is needed)

> CLD_REBOOT_CMD_KEXEC (?)

I don't think kexec makes any sense inside a container, such a sys_reboot()
call should probably fail or fallback to _RESTART

> CLD_REBOOT_CMD_SW_SUSPEND (useful for the future checkpoint/restart)

Looks reasonable

> LINUX_REBOOT_CMD_CAD_ON and LINUX_REBOOT_CMD_CAD_OFF could be disabled 
> for a non-init pid namespace, no ?

I haven't looked at how/when the state set by these is checked, but it could
keep its meaning and a CAD shortcut would act on the container to which the
active task on the given tty belongs. (so as if the process which would have
gotten SIGINT had issued sys_reboot(LINUX_REBOOT_CMD_RESTART), permissions
set aside)


Bruno

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFD] reboot / shutdown of a container
  2011-01-14 23:11         ` Bruno Prémont
@ 2011-01-15  7:54           ` Daniel Lezcano
  0 siblings, 0 replies; 7+ messages in thread
From: Daniel Lezcano @ 2011-01-15  7:54 UTC (permalink / raw)
  To: Bruno Prémont; +Cc: Linux Containers, Linux Kernel Mailing List

On 01/15/2011 12:11 AM, Bruno Prémont wrote:
> On Thu, 13 January 2011 Daniel Lezcano<daniel.lezcano@free.fr>  wrote:
>> On 01/13/2011 10:50 PM, Bruno Prémont wrote:
>>> On Thu, 13 January 2011 Daniel Lezcano<daniel.lezcano@free.fr>   wrote:
>>>> On 01/13/2011 09:09 PM, Bruno Prémont wrote:
>>>>> On Thu, 13 January 2011 Daniel Lezcano<daniel.lezcano@free.fr>    wrote:
>>>>>> in the container implementation, we are facing the problem of a process
>>>>>> calling the sys_reboot syscall which of course makes the host to
>>>>>> poweroff/reboot.
>>>>>>
>>>>>> If we drop the cap_sys_reboot capability, sys_reboot fails and the
>>>>>> container reach a shutdown state but the init process stay there, hence
>>>>>> the container becomes stuck waiting indefinitely the process '1' to exit.
>>>>>>
>>>>>> The current implementation to make the shutdown / reboot of the
>>>>>> container to work is we watch, from a process outside of the container,
>>>>>> the<rootfs>/var/run/utmp file and check the runlevel each time the file
>>>>>> changes. When the 'reboot' or 'shutdown' level is detected, we wait for
>>>>>> a single remaining in the container and then we kill it.
>>>>>>
>>>>>> That works but this is not efficient in case of a large number of
>>>>>> containers as we will have to watch a lot of utmp files. In addition,
>>>>>> the /var/run directory must *not* mounted as tmpfs in the distro.
>>>>>> Unfortunately, it is the default setup on most of the distros and tends
>>>>>> to generalize. That implies, the rootfs init's scripts must be modified
>>>>>> for the container when we put in place its rootfs and as /var/run is
>>>>>> supposed to be a tmpfs, most of the applications do not cleanup the
>>>>>> directory, so we need to add extra services to wipeout the files.
>>>>>>
>>>>>> More problems arise when we do an upgrade of the distro inside the
>>>>>> container, because all the setup we made at creation time will be lost.
>>>>>> The upgrade overwrite the scripts, the fstab and so on.
>>>>>>
>>>>>> We did what was possible to solve the problem from userspace but we
>>>>>> reach always a limit because there are different implementations of the
>>>>>> 'init' process and the init's scripts differ from a distro to another
>>>>>> and the same with the versions.
>>>>>>
>>>>>> We think this problem can only be solved from the kernel.
>>>>>>
>>>>>> The idea was to send a signal SIGPWR to the parent of the pid '1' of the
>>>>>> pid namespace when the sys_reboot is called. Of course that won't occur
>>>>>> for the init pid namespace.
>>>>> Wouldn't sending SIGKILL to the pid '1' process of the originating PID
>>>>> namespace be sufficient (that would trigger a SIGCHLD for the parent
>>>>> process in the outer PID namespace.
>>>> This is already the case. The question is : when do we send this signal ?
>>>> We have to wait for the container system shutdown before killing it.
>>> I meant that sys_reboot() would kill the namespace's init if it's not
>>> called from boot namespace.
>>>
>>> See below
>>>
>>>>> (as far as I remember the PID namespace is killed when its 'init' exits,
>>>>> if this is not the case all other processes in the given namespace would
>>>>> have to be killed as well)
>>>> Yes, absolutely but this is not the point, reaping the container is not
>>>> a problem.
>>>>
>>>> What we are trying to achieve is to shutdown properly the container from
>>>> inside (from outside will be possible too with the setns syscall).
>>>>
>>>> Assuming the process '1234' creates a new process in a new namespace set
>>>> and wait for it.
>>>>
>>>> The new process '1' will exec /sbin/init and the system will boot up.
>>>> But, when the system is shutdown or rebooted, after the down scripts are
>>>> executed the kill -15 -1 will be invoked, killing all the processes
>>>> expect the process '1' and the caller. This one will then call
>>>> 'sys_reboot' and exit. Hence we still have the init process idle and its
>>>> parent '1234' waiting for it to die.
>>> This call to sys_reboot() would kill "new process '1'" instead of trying to
>>> operate on the HW box.
>>> This also has the advantage that a container would not require an informed
>>> parent "monitoring" it from outside (though it would not be restarted even if
>>> requested without such informed outside parent).
>> Oh, ok. Sorry I misunderstood.
>>
>> Yes, that could be better than crossing the namespace boundaries.
>>
>>>> If we are able to receive the information in the process '1234' : "the
>>>> sys_reboot was called in the child pid namespace", we can take then kill
>>>> our child pid.  If this information is raised via a signal sent by the
>>>> kernel with the proper information in the siginfo_t (eg. si_code
>>>> contains "LINUX_REBOOT_CMD_RESTART", "LINUX_REBOOT_CMD_HALT", ... ), the
>>>> solution will be generic for all the shutdown/reboot of any kind of
>>>> container and init version.
>>> Could this be passed for a SIGCHLD? (when namespace is reaped, and received
>>> by 1234 from above example assuming sys_reboot() kills the "new process '1'")
>> Yes, that sounds a good idea.
>>
>>> Looks like yes, but with the need to define new values for si_code (reusing
>>> LINUX_REBOOT_CMD_* would certainly clash, no matter which signal is choosen).
>> CLD_REBOOT_CMD_RESTART
>> CLD_REBOOT_CMD_HALT
>> CLD_REBOOT_CMD_POWER_OFF
> I would just map both to the same thing...
>
>> CLD_REBOOT_CMD_RESTART2 (what about the cmd buffer, shall we ignore it ?)
> The cmd buffer could be passed via si_ptr if we want it, otherwise it would
> be the same as for CLD_REBOOT_CMD_RESTART (which would have si_ptr set to NULL
> in case no si_code differentiation is needed)
>
>> CLD_REBOOT_CMD_KEXEC (?)
> I don't think kexec makes any sense inside a container, such a sys_reboot()
> call should probably fail or fallback to _RESTART
>
>> CLD_REBOOT_CMD_SW_SUSPEND (useful for the future checkpoint/restart)
> Looks reasonable
>
>> LINUX_REBOOT_CMD_CAD_ON and LINUX_REBOOT_CMD_CAD_OFF could be disabled
>> for a non-init pid namespace, no ?
> I haven't looked at how/when the state set by these is checked, but it could
> keep its meaning and a CAD shortcut would act on the container to which the
> active task on the given tty belongs. (so as if the process which would have
> gotten SIGINT had issued sys_reboot(LINUX_REBOOT_CMD_RESTART), permissions
> set aside)

That makes sense.

Thanks Bruno !

   -- Daniel (cooking a patch ... ;)


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2011-01-15  7:54 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-13 16:34 [RFD] reboot / shutdown of a container Daniel Lezcano
2011-01-13 20:09 ` Bruno Prémont
2011-01-13 21:32   ` Daniel Lezcano
2011-01-13 21:50     ` Bruno Prémont
2011-01-13 22:09       ` Daniel Lezcano
2011-01-14 23:11         ` Bruno Prémont
2011-01-15  7:54           ` Daniel Lezcano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).