All of lore.kernel.org
 help / color / mirror / Atom feed
* systemd kills mdmon if it was started manually by user
@ 2010-12-04  8:41 Andrey Borzenkov
  2010-12-04  9:12 ` Christian Parpart
  2011-01-07  0:38 ` Lennart Poettering
  0 siblings, 2 replies; 50+ messages in thread
From: Andrey Borzenkov @ 2010-12-04  8:41 UTC (permalink / raw)
  To: linux-raid, SystemD Devel

If user starts array manually (mdadm -A -s as example) from within
user session and array needs mdmon, mdmon becomes part of user session
control group:

├ user
│ └ root
│   └ 1
│     ├ 1916 login -- root
│     ├ 1930 -bash
│     ├ 1964 gpg-agent --keep-display --daemon --write-env-file /root/.gnup...
│     └ 2062 mdmon md127


It is then killed by systemd during shutdown as part of user session.
It results in dirty array on next boot.

Is there any magic that allows daemon to be exempted from killing?

TIA

-andrey
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: systemd kills mdmon if it was started manually by user
  2010-12-04  8:41 systemd kills mdmon if it was started manually by user Andrey Borzenkov
@ 2010-12-04  9:12 ` Christian Parpart
  2010-12-04 12:08   ` Andrey Borzenkov
  2011-01-07  0:38 ` Lennart Poettering
  1 sibling, 1 reply; 50+ messages in thread
From: Christian Parpart @ 2010-12-04  9:12 UTC (permalink / raw)
  To: systemd-devel; +Cc: Andrey Borzenkov, linux-raid

On Saturday, December 04, 2010 09:41:26 am Andrey Borzenkov wrote:
> If user starts array manually (mdadm -A -s as example) from within
> user session and array needs mdmon, mdmon becomes part of user session
> control group:
> 
> ├ user
> │ └ root
> │   └ 1
> │     ├ 1916 login -- root
> │     ├ 1930 -bash
> │     ├ 1964 gpg-agent --keep-display --daemon --write-env-file
> /root/.gnup... │     └ 2062 mdmon md127
> 
> 
> It is then killed by systemd during shutdown as part of user session.
> It results in dirty array on next boot.
> 
> Is there any magic that allows daemon to be exempted from killing?

While your raid should absolutely not be corrupted on next reboot 
when mdmon receives a SIGTERM, I suspect you're using pam_systemd.so 
(/etc/pam.d/system-auth), which automatically creates cgroups by login 
session, which in turn gets killed when the user has "completely logged out".
That is why your mdadm gets terminated, too.
You can avoid that by adding create-session=0 to it, like:

# grep pam_systemd /etc/pam.d/systemd-auth
session     optional    pam_systemd.so create-session=0

Which is the recommented way if you want processes (created by the "user") to 
live on even when this user has fully logged out.

Regards,
Christian Parpart.

p.s.: see pam_systemd(8)
_______________________________________________
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: systemd kills mdmon if it was started manually by user
  2010-12-04  9:12 ` Christian Parpart
@ 2010-12-04 12:08   ` Andrey Borzenkov
  2010-12-12 13:20     ` [systemd-devel] " Luca Berra
                       ` (2 more replies)
  0 siblings, 3 replies; 50+ messages in thread
From: Andrey Borzenkov @ 2010-12-04 12:08 UTC (permalink / raw)
  To: Christian Parpart; +Cc: linux-raid, systemd-devel

On Sat, Dec 4, 2010 at 12:12 PM, Christian Parpart <trapni@gentoo.org> wrote:
> On Saturday, December 04, 2010 09:41:26 am Andrey Borzenkov wrote:
>> If user starts array manually (mdadm -A -s as example) from within
>> user session and array needs mdmon, mdmon becomes part of user session
>> control group:
>>
>> ├ user
>> │ └ root
>> │   └ 1
>> │     ├ 1916 login -- root
>> │     ├ 1930 -bash
>> │     ├ 1964 gpg-agent --keep-display --daemon --write-env-file
>> /root/.gnup... │     └ 2062 mdmon md127
>>
>>
>> It is then killed by systemd during shutdown as part of user session.
>> It results in dirty array on next boot.
>>
>> Is there any magic that allows daemon to be exempted from killing?
>
> While your raid should absolutely not be corrupted on next reboot
> when mdmon receives a SIGTERM,

This won't be corrupted but it will initiate rebuilt. I have reports
that such rebuild may take hours, costing performance and loss of
redundancy.

>                                                       I suspect you're using pam_systemd.so

Yes

> (/etc/pam.d/system-auth), which automatically creates cgroups by login
> session, which in turn gets killed when the user has "completely logged out".
> That is why your mdadm gets terminated, too.

Sure.

> You can avoid that by adding create-session=0 to it, like:
>
> # grep pam_systemd /etc/pam.d/systemd-auth
> session     optional    pam_systemd.so create-session=0
>

But I do want user session to be created; and systemd was specifically
extended to properly terminate user sessions on shutdown. It is just
that mdmon does not belong to user session at all.

> Which is the recommented way if you want processes (created by the "user") to
> live on even when this user has fully logged out.
>

mdmon does not belong to user. User is not even aware that it is
started. And it is likely not the last case. So systemd does need some
framework which can move such processes out of user session. It
probably needs some sd_daemon API to notify systemd that it is system
level task even if it was started as result of user interaction.
_______________________________________________
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2010-12-04 12:08   ` Andrey Borzenkov
@ 2010-12-12 13:20     ` Luca Berra
  2011-01-07  0:40     ` Lennart Poettering
       [not found]     ` <20101204121413.GC11336@mother.pipebreaker.pl>
  2 siblings, 0 replies; 50+ messages in thread
From: Luca Berra @ 2010-12-12 13:20 UTC (permalink / raw)
  To: linux-raid

On Sat, Dec 04, 2010 at 03:08:05PM +0300, Andrey Borzenkov wrote:
>mdmon does not belong to user. User is not even aware that it is
>started. And it is likely not the last case. So systemd does need some
>framework which can move such processes out of user session. It
>probably needs some sd_daemon API to notify systemd that it is system
>level task even if it was started as result of user interaction.

what about running mdmon --all --takeover outside of user context at
shutdown, it should replace all mdmon processes with new ones that won't
be killed when user sessions are being closed?

Regards,
L.

-- 
Luca Berra -- bluca@comedia.it

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: systemd kills mdmon if it was started manually by user
  2010-12-04  8:41 systemd kills mdmon if it was started manually by user Andrey Borzenkov
  2010-12-04  9:12 ` Christian Parpart
@ 2011-01-07  0:38 ` Lennart Poettering
  2011-01-07  1:09   ` [systemd-devel] " Michael Biebl
  2011-01-07  1:16   ` NeilBrown
  1 sibling, 2 replies; 50+ messages in thread
From: Lennart Poettering @ 2011-01-07  0:38 UTC (permalink / raw)
  To: Andrey Borzenkov; +Cc: linux-raid, SystemD Devel

On Sat, 04.12.10 11:41, Andrey Borzenkov (arvidjaar@gmail.com) wrote:

> If user starts array manually (mdadm -A -s as example) from within
> user session and array needs mdmon, mdmon becomes part of user session
> control group:

Are you suggesting that mdadm forks off mdmon from within the user
session? This is horribly ugly and broken and they shouldn't do that.

> 
> ├ user
> │ └ root
> │   └ 1
> │     ├ 1916 login -- root
> │     ├ 1930 -bash
> │     ├ 1964 gpg-agent --keep-display --daemon --write-env-file /root/.gnup...
> │     └ 2062 mdmon md127
> 
> 
> It is then killed by systemd during shutdown as part of user session.

Well, only if you enable that the user session is completely killed on
logout, which we currently don't do by default.

I wonder if it would make sense to add an option which kills user
sessions on log out only for uid != 0. This might help here, but only
half-way, since sudo would still break. But anyway, I'll add this to the
todo list.

> It results in dirty array on next boot.

Hmm, that shouldn't happen.

> Is there any magic that allows daemon to be exempted from killing?

Well, I have been discussing this with Kay and we'll most likely add
something like DontKillOnShutdown=yes or so, which if added to a unit
file will exempt it from killing during the normal service shutdown
phase, and the first killing spree (but not the second, post-umount
killing spree). But that of course would require mdmon to be started
like any other daemon, and not forked off mdadm.

That should mostly fix the problem, but then again I do believe that the
whole idea of mdmon is just borked, since it will necessarily pin page
from the root fs into memory which will create all kinds of problems,
for example after upgrades (i.e. mdmon maps libc into memory, libc gets
updated, the old libc deleted, which cannot be written to disk as long
as mdmon stays running pinning it, which will disallow the ultimate
unmounting/remounting of the fs).

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
_______________________________________________
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: systemd kills mdmon if it was started manually by user
  2010-12-04 12:08   ` Andrey Borzenkov
  2010-12-12 13:20     ` [systemd-devel] " Luca Berra
@ 2011-01-07  0:40     ` Lennart Poettering
       [not found]     ` <20101204121413.GC11336@mother.pipebreaker.pl>
  2 siblings, 0 replies; 50+ messages in thread
From: Lennart Poettering @ 2011-01-07  0:40 UTC (permalink / raw)
  To: Andrey Borzenkov; +Cc: linux-raid, systemd-devel

On Sat, 04.12.10 15:08, Andrey Borzenkov (arvidjaar@gmail.com) wrote:

> >> It is then killed by systemd during shutdown as part of user session.
> >> It results in dirty array on next boot.
> >>
> >> Is there any magic that allows daemon to be exempted from killing?
> >
> > While your raid should absolutely not be corrupted on next reboot
> > when mdmon receives a SIGTERM,
> 
> This won't be corrupted but it will initiate rebuilt. I have reports
> that such rebuild may take hours, costing performance and loss of
> redundancy.

Well, eventually we need to be able to kill mdmon. Otherwise we might
not be able to remount the root dir r/o. How exactly is mdmon supposed
to behave on shutdown?

Lennart

-- 
Lennart Poettering - Red Hat, Inc.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-01-07  0:38 ` Lennart Poettering
@ 2011-01-07  1:09   ` Michael Biebl
  2011-01-07  1:17     ` Roman Mamedov
  2011-01-07  1:16   ` NeilBrown
  1 sibling, 1 reply; 50+ messages in thread
From: Michael Biebl @ 2011-01-07  1:09 UTC (permalink / raw)
  To: Lennart Poettering; +Cc: Andrey Borzenkov, linux-raid, SystemD Devel

2011/1/7 Lennart Poettering <lennart@poettering.net>:
>
> Well, I have been discussing this with Kay and we'll most likely add
> something like DontKillOnShutdown=yes or so, which if added to a unit

Make that KillOnShutdown=no, please.


-- 
Why is it that all of the instruments seeking intelligent life in the
universe are pointed away from Earth?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-01-07  0:38 ` Lennart Poettering
  2011-01-07  1:09   ` [systemd-devel] " Michael Biebl
@ 2011-01-07  1:16   ` NeilBrown
  2011-01-07  1:42     ` Lennart Poettering
  1 sibling, 1 reply; 50+ messages in thread
From: NeilBrown @ 2011-01-07  1:16 UTC (permalink / raw)
  To: Lennart Poettering; +Cc: Andrey Borzenkov, linux-raid, SystemD Devel

On Fri, 7 Jan 2011 01:38:27 +0100 Lennart Poettering <lennart@poettering.net>
wrote:

> On Sat, 04.12.10 11:41, Andrey Borzenkov (arvidjaar@gmail.com) wrote:
> 
> > If user starts array manually (mdadm -A -s as example) from within
> > user session and array needs mdmon, mdmon becomes part of user session
> > control group:
> 
> Are you suggesting that mdadm forks off mdmon from within the user
> session? This is horribly ugly and broken and they shouldn't do that.

What alternative would you suggest?

A daemon needs to be running while certain md arrays are running and writable.

NeilBrown

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-01-07  1:09   ` [systemd-devel] " Michael Biebl
@ 2011-01-07  1:17     ` Roman Mamedov
  0 siblings, 0 replies; 50+ messages in thread
From: Roman Mamedov @ 2011-01-07  1:17 UTC (permalink / raw)
  To: Michael Biebl
  Cc: Lennart Poettering, Andrey Borzenkov, linux-raid, SystemD Devel

[-- Attachment #1: Type: text/plain, Size: 458 bytes --]

On Fri, 7 Jan 2011 02:09:32 +0100
Michael Biebl <mbiebl@gmail.com> wrote:

> 2011/1/7 Lennart Poettering <lennart@poettering.net>:
> >
> > Well, I have been discussing this with Kay and we'll most likely add
> > something like DontKillOnShutdown=yes or so, which if added to a unit
> 
> Make that KillOnShutdown=no, please.

Agreed :) That reminds me of "hal-disable-polling --enable-polling"
( http://ur1.ca/2rmis )

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-01-07  1:16   ` NeilBrown
@ 2011-01-07  1:42     ` Lennart Poettering
  0 siblings, 0 replies; 50+ messages in thread
From: Lennart Poettering @ 2011-01-07  1:42 UTC (permalink / raw)
  To: NeilBrown; +Cc: Andrey Borzenkov, linux-raid, SystemD Devel

On Fri, 07.01.11 12:16, NeilBrown (neilb@suse.de) wrote:

> 
> On Fri, 7 Jan 2011 01:38:27 +0100 Lennart Poettering <lennart@poettering.net>
> wrote:
> 
> > On Sat, 04.12.10 11:41, Andrey Borzenkov (arvidjaar@gmail.com) wrote:
> > 
> > > If user starts array manually (mdadm -A -s as example) from within
> > > user session and array needs mdmon, mdmon becomes part of user session
> > > control group:
> > 
> > Are you suggesting that mdadm forks off mdmon from within the user
> > session? This is horribly ugly and broken and they shouldn't do that.
> 
> What alternative would you suggest?

Start it as a normal service like any other. But if you fork off the
daemon from the user session then the daemon will run in a very broken
context: the resource limits of the user apply, the audit trail will
point to the user (i.e. /proc/self/loginuid), the cgroup will be of the
user, the daemon cannot be supervised as every other daemon. Also, the
daemon will inherit all the other process properties from the user,
which is almost definitely wrong. i.e. the env block and so
on, the sig mask. gazillions of small little properties. Of course, a
big bunch of them you can reset in your code, but that's a race you
cannot win: the kernel adds new process properties all the time, and
you'd have to reset them manually.

It's is really essential that daemons are started from a clean process
environment, and are detached from the user session. SysV kinda provides
that, for everything started on boot and in a limited way for stuff
started via /sbin/service. systemd provides that too and much more
correct. But just forking off things just like that is not a good
solution.

A thinkable, relatively simple solution in a systemd world is to pull in
the mdmon service from the udev device. The udev device would do all the
necessary matching to figure out whether mdmon is needed or not. If you
care about non-systemd environments something like this of course
becomes a lot more complex.

> A daemon needs to be running while certain md arrays are running and writable.

Well, but auto-spawning it from the user session is not really a usable solution.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: systemd kills mdmon if it was started manually by user
       [not found]             ` <20110125042814.GA9727@tango.0pointer.de>
@ 2011-02-04 19:55               ` Andrey Borzenkov
  2011-02-08  9:48                 ` [systemd-devel] " Lennart Poettering
  0 siblings, 1 reply; 50+ messages in thread
From: Andrey Borzenkov @ 2011-02-04 19:55 UTC (permalink / raw)
  To: Lennart Poettering; +Cc: linux-raid, systemd-devel

On Tue, Jan 25, 2011 at 7:28 AM, Lennart Poettering
<lennart@poettering.net> wrote:
> On Tue, 25.01.11 06:58, Andrey Borzenkov (arvidjaar@mail.ru) wrote:
>
>> > systemd supports instantiated services, for example to deal with the
>> > gettys (e.g. "getty@tty5.service"). It should be trivial to use the same
>> > for mdmon (e.g. "mdmon@md3.service").
>> >
>> That's right, but the names are not known in advance and can change
>> between reboots. This means such units have to be generated
>> dynamically, exist until reboot (ramfs?) and be removed when array is
>> destroyed. Not sure it is really manageable.
>
> Hmm? It should be sufficient to just write the service template properly
> ("mdmon@.service") and then instantiate it when needed with "systemctl
> start mdmon@xyz.service" or something equivalent. itMs a matter of
> issuing a single dbus call.
>
>> And which instance should generate them? mdadm?
>
> i think it is much nicer to spawn the necessary mdadm service instance
> from a udev rule,

Yes, this can be done relatively easily; as proof of concept:

SUBSYSTEM!="block", GOTO="systemd_md_end"
ACTION!="change", GOTO="systemd_md_end"
KERNEL!="md*", GOTO="systemd_md_end"
ATTR{md/metadata_version}=="external:[A-Za-z]*", RUN+="/bin/systemctl
start mdmon@%k.service"
LABEL="systemd_md_end"

where mdon@.service is


[Unit]
Description=mdmon service
BindTo=dev-%i.device
After=dev-%i.device

[Service]
Type=forking
PIDFile=/dev/.mdadm/%i.pid
ExecStart=/sbin/mdmon --takeover %i


With the result

[root@localhost ~]# systemctl status mdmon@md127.service
mdmon@md127.service - mdmon service
          Loaded: loaded (/etc/systemd/system/mdmon@.service)
          Active: active (running) since Tue, 08 Feb 2011 09:43:30
-0500; 5min ago
         Process: 1467 ExecStart=/sbin/mdmon --takeover %i
(code=exited, status=0/SUCCESS)
        Main PID: 1468 (mdmon)
          CGroup: name=systemd:/system/mdmon@.service/md127
                  └ 1468 /sbin/mdmon --takeover md127

Setting SYSTEMD_WANTS would be more elegant solution, but it does not
work with current systemd implementation. It is capable of starting
requested units only on "add" event (effectively the very first time
device becomes plugged), while mdmon must be started on "change"
event, as only then we know whether mdmon is required at all.

Running mdmon via systemd in this way opens up interesting
possibility. E.g. service could be declared "immortal" and be exempt
from usual shutdown sequence ... or is it possible to do already?

Actually it can be implemented even without mdadm patches; apparently
it is possible to suppress normal starting of mdmon by setting
MDADM_NO_MDMON=1

>                           but you could even run it from mdadm by invoking one
> dbus call from it.
>

It may turn out to be necessary still. If container needs mdmon,
arrays it contains won't become read-write until mdmon is started. If
mdmon is started asynchronously by udev, there is window where someone
may try to use array before it is rw. As trivial example, mount unit
which depends on md device unit.

I do not think mdadm maintainer will be happy to add D-Bus dependency
to something that is likely to be included in initrd though :) But may
be we could simply try execl("/bin/systemctl", "start", ...) before
current execl("/sbin/mdmon",... )?
_______________________________________________
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-02-04 19:55               ` Andrey Borzenkov
@ 2011-02-08  9:48                 ` Lennart Poettering
  2011-02-08 10:52                   ` Andrey Borzenkov
  0 siblings, 1 reply; 50+ messages in thread
From: Lennart Poettering @ 2011-02-08  9:48 UTC (permalink / raw)
  To: Andrey Borzenkov; +Cc: Tomasz Torcz, systemd-devel, linux-raid

On Fri, 04.02.11 22:55, Andrey Borzenkov (arvidjaar@mail.ru) wrote:

> >> That's right, but the names are not known in advance and can change
> >> between reboots. This means such units have to be generated
> >> dynamically, exist until reboot (ramfs?) and be removed when array is
> >> destroyed. Not sure it is really manageable.
> >
> > Hmm? It should be sufficient to just write the service template properly
> > ("mdmon@.service") and then instantiate it when needed with "systemctl
> > start mdmon@xyz.service" or something equivalent. itMs a matter of
> > issuing a single dbus call.
> >
> >> And which instance should generate them? mdadm?
> >
> > i think it is much nicer to spawn the necessary mdadm service instance
> > from a udev rule,
> 
> Yes, this can be done relatively easily; as proof of concept:
> 
> SUBSYSTEM!="block", GOTO="systemd_md_end"
> ACTION!="change", GOTO="systemd_md_end"
> KERNEL!="md*", GOTO="systemd_md_end"
> ATTR{md/metadata_version}=="external:[A-Za-z]*", RUN+="/bin/systemctl
> start mdmon@%k.service"
> LABEL="systemd_md_end"

Nah, it's much better to simply use the SYSTEMD_WANTS var on the device.

Something like this:

...., ENV{SYSTEMD_WANTS}="mdmon@%k.service"

That way the device unit will simply have a wants dep on the service
unit, and this is prefectly discoverable.

> Setting SYSTEMD_WANTS would be more elegant solution, but it does not
> work with current systemd implementation. It is capable of starting
> requested units only on "add" event (effectively the very first time
> device becomes plugged), while mdmon must be started on "change"
> event, as only then we know whether mdmon is required at all.

Oha, so you are actually aware of SYSTEMD_WANTS. Hmm. I need to think
about this. Why does md employ the change event? Is this really
necessary, smells a bit foul.

> Running mdmon via systemd in this way opens up interesting
> possibility. E.g. service could be declared "immortal" and be exempt
> from usual shutdown sequence ... or is it possible to do already?

A service needs to conflict with shutdown.target to be shut down when we
go down normally. If your service does not conflict with shutdown.target
then it will stay around and be killed only after systemd is gone and
PID1 is systemd-shutdown which then kills all processes remaining
(independent of any idea of "service") and the unmounts all file
systems. Normally all services conflict with shutdown.target implicitly,
which you can turn off by setting DefaultDependencies=.

> Actually it can be implemented even without mdadm patches; apparently
> it is possible to suppress normal starting of mdmon by setting
> MDADM_NO_MDMON=1

A this point mdmon is simply broken: if glibc or mdmon itself (or any
lib it is using) is upgraded, then mdmon will keep referencing the old
.so or binary as long as it is running. This means that the fs these
files are on cannot be remounted r/o. However mdmon insists on being
shutdown only after all fs got remounted ro. So you have a cyclic
ordering loop here: mdmon wants to be shut down after the remount, but
we need to shut it down before the remount. 

This is unfixable unless a) mdmon learns reexecution of itself without
losing state (like most init systems so), or b) mdmon would stop
insisting on being shutdown only after the remount.

In my eyes b) is very much preferebale: It should be possible to shut
down mdmon like any other service. And if then some md related code
still needs to be run on late shutdown this should be done from a new
process. I would be willing to add some hooks for this, so that we can
execute arbitrary drop-in processes as part of the final shutdown loop.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: systemd kills mdmon if it was started manually by user
  2011-02-08  9:48                 ` [systemd-devel] " Lennart Poettering
@ 2011-02-08 10:52                   ` Andrey Borzenkov
  2011-02-08 11:07                     ` Lennart Poettering
  0 siblings, 1 reply; 50+ messages in thread
From: Andrey Borzenkov @ 2011-02-08 10:52 UTC (permalink / raw)
  To: Lennart Poettering; +Cc: linux-raid, systemd-devel

On Tue, Feb 8, 2011 at 12:48 PM, Lennart Poettering
<lennart@poettering.net> wrote:
> On Fri, 04.02.11 22:55, Andrey Borzenkov (arvidjaar@mail.ru) wrote:
>
>> >> That's right, but the names are not known in advance and can change
>> >> between reboots. This means such units have to be generated
>> >> dynamically, exist until reboot (ramfs?) and be removed when array is
>> >> destroyed. Not sure it is really manageable.
>> >
>> > Hmm? It should be sufficient to just write the service template properly
>> > ("mdmon@.service") and then instantiate it when needed with "systemctl
>> > start mdmon@xyz.service" or something equivalent. itMs a matter of
>> > issuing a single dbus call.
>> >
>> >> And which instance should generate them? mdadm?
>> >
>> > i think it is much nicer to spawn the necessary mdadm service instance
>> > from a udev rule,
>>
>> Yes, this can be done relatively easily; as proof of concept:
>>
>> SUBSYSTEM!="block", GOTO="systemd_md_end"
>> ACTION!="change", GOTO="systemd_md_end"
>> KERNEL!="md*", GOTO="systemd_md_end"
>> ATTR{md/metadata_version}=="external:[A-Za-z]*", RUN+="/bin/systemctl
>> start mdmon@%k.service"
>> LABEL="systemd_md_end"
>
> Nah, it's much better to simply use the SYSTEMD_WANTS var on the device.
>
> Something like this:
>
> ...., ENV{SYSTEMD_WANTS}="mdmon@%k.service"
>
> That way the device unit will simply have a wants dep on the service
> unit, and this is prefectly discoverable.
>
>> Setting SYSTEMD_WANTS would be more elegant solution, but it does not
>> work with current systemd implementation. It is capable of starting
>> requested units only on "add" event (effectively the very first time
>> device becomes plugged), while mdmon must be started on "change"
>> event, as only then we know whether mdmon is required at all.
>
> Oha, so you are actually aware of SYSTEMD_WANTS. Hmm. I need to think
> about this. Why does md employ the change event? Is this really
> necessary, smells a bit foul.
>

I am probably the wrong one to ask, but here is what happens when
array is started (from udev perspective)

UDEV  [1297507039.109828] add      /devices/virtual/block/md127 (block)
UDEV_LOG=3
ACTION=add
DEVPATH=/devices/virtual/block/md127
SUBSYSTEM=block
DEVNAME=/dev/md127
DEVTYPE=disk
SEQNUM=1742
UDISKS_PRESENTATION_NOPOLICY=1
MAJOR=9
MINOR=127
TAGS=:systemd:

After this event device goes "plugged" and SYSTEMD_WANTS (if any) are
triggered. But at this point we have zero information about array to
decide anything.

UDEV  [1297507039.211940] change   /devices/virtual/block/md127 (block)
UDEV_LOG=3
ACTION=change
DEVPATH=/devices/virtual/block/md127
SUBSYSTEM=block
DEVNAME=/dev/md127
DEVTYPE=disk
SEQNUM=1743
MD_LEVEL=container
MD_DEVICES=2
MD_METADATA=ddf
MD_UUID=f8362f39:0436b20f:cf338104:afec436e
MD_DEVNAME=ddf0
UDISKS_PRESENTATION_NOPOLICY=1
MAJOR=9
MINOR=127
DEVLINKS=/dev/disk/by-id/md-uuid-f8362f39:0436b20f:cf338104:afec436e
/dev/md/ddf0
TAGS=:systemd:

At this point we know it is container, know that it has external
metadata and know that we need external metadata handler (mdmon). But
it is too late for systemd.

>
>> Actually it can be implemented even without mdadm patches; apparently
>> it is possible to suppress normal starting of mdmon by setting
>> MDADM_NO_MDMON=1
>
> A this point mdmon is simply broken: if glibc or mdmon itself (or any
> lib it is using) is upgraded, then mdmon will keep referencing the old
> .so or binary as long as it is running. This means that the fs these
> files are on cannot be remounted r/o. However mdmon insists on being
> shutdown only after all fs got remounted ro. So you have a cyclic
> ordering loop here: mdmon wants to be shut down after the remount, but
> we need to shut it down before the remount.
>

Ehh ...

a) mdmon is perfectly capable of restarting, it is already used to
take over mdmon launched in initrd. The problem is to know when to
restart - i.e. when respective libraries are changed. This is a job
for package management in distribution. It is already employed for
glibc, systemd and some others and can just as well be employed for
mdmon. And this is totally unrelated to systemd :)

b) having binary launched off some fs should not prevent this fs to be
remountd ro - binaries are not opened rw

> This is unfixable unless a) mdmon learns reexecution of itself without
> losing state (like most init systems so), or b) mdmon would stop
> insisting on being shutdown only after the remount.
>

As far as I can tell, both is true today; but remounting is not
enough, unfortunately.

> In my eyes b) is very much preferebale: It should be possible to shut
> down mdmon like any other service. And if then some md related code
> still needs to be run on late shutdown this should be done from a new
> process. I would be willing to add some hooks for this, so that we can
> execute arbitrary drop-in processes as part of the final shutdown loop.
>

mdmon is needed to ensure metadata were correctly updated. So it needs
to exist as long as metadata *may* be updated. For practical purposes
it means - until file system is unmounted and flushed to disks. I am
not sure that remounting ro stops all activity (at least, mounting ro
definitely *writes* to device using some filesystems).

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: systemd kills mdmon if it was started manually by user
  2011-02-08 10:52                   ` Andrey Borzenkov
@ 2011-02-08 11:07                     ` Lennart Poettering
  2011-02-08 13:54                       ` Andrey Borzenkov
  2011-02-09 14:01                       ` Lennart Poettering
  0 siblings, 2 replies; 50+ messages in thread
From: Lennart Poettering @ 2011-02-08 11:07 UTC (permalink / raw)
  To: Andrey Borzenkov; +Cc: linux-raid, systemd-devel

On Tue, 08.02.11 13:52, Andrey Borzenkov (arvidjaar@mail.ru) wrote:

> I am probably the wrong one to ask, but here is what happens when
> array is started (from udev perspective)

[...]

> After this event device goes "plugged" and SYSTEMD_WANTS (if any) are
> triggered. But at this point we have zero information about array to
> decide anything.

[...]

> At this point we know it is container, know that it has external
> metadata and know that we need external metadata handler (mdmon). But
> it is too late for systemd.

Kay, do you know why this "change" event is used here? Any chance we can
get rid of it?

> 
> >
> >> Actually it can be implemented even without mdadm patches; apparently
> >> it is possible to suppress normal starting of mdmon by setting
> >> MDADM_NO_MDMON=1
> >
> > A this point mdmon is simply broken: if glibc or mdmon itself (or any
> > lib it is using) is upgraded, then mdmon will keep referencing the old
> > .so or binary as long as it is running. This means that the fs these
> > files are on cannot be remounted r/o. However mdmon insists on being
> > shutdown only after all fs got remounted ro. So you have a cyclic
> > ordering loop here: mdmon wants to be shut down after the remount, but
> > we need to shut it down before the remount.
> >
> 
> Ehh ...
> 
> a) mdmon is perfectly capable of restarting, it is already used to
> take over mdmon launched in initrd. The problem is to know when to
> restart - i.e. when respective libraries are changed. This is a job
> for package management in distribution. It is already employed for
> glibc, systemd and some others and can just as well be employed for
> mdmon. And this is totally unrelated to systemd :)

Really, you are sying there is a synchronous way to make mdmon reexec
itself? How does that work?

> b) having binary launched off some fs should not prevent this fs to be
> remountd ro - binaries are not opened rw

If you run a binary and then the package manager replaces it then the
running instance will still refer to the old copy and this will have the
effect that the file isn't actually deleted until the proces
exits/execs. And because that is the way it is the kernel will refuse
unmounting of the fs until you terminated/reexeced your process.

> > This is unfixable unless a) mdmon learns reexecution of itself without
> > losing state (like most init systems so), or b) mdmon would stop
> > insisting on being shutdown only after the remount.
> 
> As far as I can tell, both is true today; but remounting is not
> enough, unfortunately.

So, you are saying we can shut down mdmon without ill effects early?

> > In my eyes b) is very much preferebale: It should be possible to shut
> > down mdmon like any other service. And if then some md related code
> > still needs to be run on late shutdown this should be done from a new
> > process. I would be willing to add some hooks for this, so that we can
> > execute arbitrary drop-in processes as part of the final shutdown loop.
> 
> mdmon is needed to ensure metadata were correctly updated. So it needs
> to exist as long as metadata *may* be updated. For practical purposes
> it means - until file system is unmounted and flushed to disks. I am
> not sure that remounting ro stops all activity (at least, mounting ro
> definitely *writes* to device using some filesystems).

Well, the root file systems cannot be unmounted, only remounted.

So, is there a way to invoke mdmon so that it flushes all metadata
changes to disk and immediately terminates then this should be all we
need for a clean solution. We'd then shutdown the normal instances of
mdmon down like any other daemon and simply invoke this metadata
flushing command as part of late shutdown.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: systemd kills mdmon if it was started manually by user
  2011-02-08 11:07                     ` Lennart Poettering
@ 2011-02-08 13:54                       ` Andrey Borzenkov
  2011-02-08 17:28                         ` [systemd-devel] " Lennart Poettering
  2011-02-09 14:01                       ` Lennart Poettering
  1 sibling, 1 reply; 50+ messages in thread
From: Andrey Borzenkov @ 2011-02-08 13:54 UTC (permalink / raw)
  To: Lennart Poettering; +Cc: linux-raid, systemd-devel

On Tue, Feb 8, 2011 at 2:07 PM, Lennart Poettering
<lennart@poettering.net> wrote:
> On Tue, 08.02.11 13:52, Andrey Borzenkov (arvidjaar@mail.ru) wrote:
>
>> I am probably the wrong one to ask, but here is what happens when
>> array is started (from udev perspective)
>
> [...]
>
>> After this event device goes "plugged" and SYSTEMD_WANTS (if any) are
>> triggered. But at this point we have zero information about array to
>> decide anything.
>
> [...]
>
>> At this point we know it is container, know that it has external
>> metadata and know that we need external metadata handler (mdmon). But
>> it is too late for systemd.
>
> Kay, do you know why this "change" event is used here? Any chance we can
> get rid of it?
>
>>
>> >
>> >> Actually it can be implemented even without mdadm patches; apparently
>> >> it is possible to suppress normal starting of mdmon by setting
>> >> MDADM_NO_MDMON=1
>> >
>> > A this point mdmon is simply broken: if glibc or mdmon itself (or any
>> > lib it is using) is upgraded, then mdmon will keep referencing the old
>> > .so or binary as long as it is running. This means that the fs these
>> > files are on cannot be remounted r/o. However mdmon insists on being
>> > shutdown only after all fs got remounted ro. So you have a cyclic
>> > ordering loop here: mdmon wants to be shut down after the remount, but
>> > we need to shut it down before the remount.
>> >
>>
>> Ehh ...
>>
>> a) mdmon is perfectly capable of restarting, it is already used to
>> take over mdmon launched in initrd. The problem is to know when to
>> restart - i.e. when respective libraries are changed. This is a job
>> for package management in distribution. It is already employed for
>> glibc, systemd and some others and can just as well be employed for
>> mdmon. And this is totally unrelated to systemd :)
>
> Really, you are sying there is a synchronous way to make mdmon reexec
> itself? How does that work?
>

I am not sure whether it qualifies as synchronous, but "mdmon
--takeover" will kill any existing mdmon for this and start monitoring
itself.

>> b) having binary launched off some fs should not prevent this fs to be
>> remountd ro - binaries are not opened rw
>
> If you run a binary and then the package manager replaces it then the
> running instance will still refer to the old copy and this will have the
> effect that the file isn't actually deleted until the proces
> exits/execs. And because that is the way it is the kernel will refuse
> unmounting of the fs until you terminated/reexeced your process.
>
>> > This is unfixable unless a) mdmon learns reexecution of itself without
>> > losing state (like most init systems so), or b) mdmon would stop
>> > insisting on being shutdown only after the remount.
>>
>> As far as I can tell, both is true today; but remounting is not
>> enough, unfortunately.
>
> So, you are saying we can shut down mdmon without ill effects early?
>

At least that's what I see. You can shutdown mdmon and continue to
work with file system, even if it is mounted rw. Under some conditions
mount will hang; i.e.

start array
kill mdmon
try to mount

mount will hang. If you start mdmon, it is mounted. But if you now

umount
kill mdmon
mount

it is mounted just fine.

>> > In my eyes b) is very much preferebale: It should be possible to shut
>> > down mdmon like any other service. And if then some md related code
>> > still needs to be run on late shutdown this should be done from a new
>> > process. I would be willing to add some hooks for this, so that we can
>> > execute arbitrary drop-in processes as part of the final shutdown loop.
>>
>> mdmon is needed to ensure metadata were correctly updated. So it needs
>> to exist as long as metadata *may* be updated. For practical purposes
>> it means - until file system is unmounted and flushed to disks. I am
>> not sure that remounting ro stops all activity (at least, mounting ro
>> definitely *writes* to device using some filesystems).
>
> Well, the root file systems cannot be unmounted, only remounted.
>
> So, is there a way to invoke mdmon so that it flushes all metadata
> changes to disk and immediately terminates then this should be all we
> need for a clean solution. We'd then shutdown the normal instances of
> mdmon down like any other daemon and simply invoke this metadata
> flushing command as part of late shutdown.


Hmm ... it looks like you just need to

start mdmon
do mdadm --wait-clean

After this you can kill mdmon again (assuming decide is no more in use).

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-02-08 13:54                       ` Andrey Borzenkov
@ 2011-02-08 17:28                         ` Lennart Poettering
  2011-10-23  8:00                           ` Dan Williams
  0 siblings, 1 reply; 50+ messages in thread
From: Lennart Poettering @ 2011-02-08 17:28 UTC (permalink / raw)
  To: Andrey Borzenkov; +Cc: Tomasz Torcz, systemd-devel, linux-raid

On Tue, 08.02.11 16:54, Andrey Borzenkov (arvidjaar@mail.ru) wrote:

> >> a) mdmon is perfectly capable of restarting, it is already used to
> >> take over mdmon launched in initrd. The problem is to know when to
> >> restart - i.e. when respective libraries are changed. This is a job
> >> for package management in distribution. It is already employed for
> >> glibc, systemd and some others and can just as well be employed for
> >> mdmon. And this is totally unrelated to systemd :)
> >
> > Really, you are sying there is a synchronous way to make mdmon reexec
> > itself? How does that work?
> >
> 
> I am not sure whether it qualifies as synchronous, but "mdmon
> --takeover" will kill any existing mdmon for this and start monitoring
> itself.

I wonder if this is really fully synchronous, i.e. that a) there is no
point in time where mdmon is not running during this restart and b) the
mdmom --takeover command returns when the new daemon is fully up, and
not right-away.

> > Well, the root file systems cannot be unmounted, only remounted.
> >
> > So, is there a way to invoke mdmon so that it flushes all metadata
> > changes to disk and immediately terminates then this should be all we
> > need for a clean solution. We'd then shutdown the normal instances of
> > mdmon down like any other daemon and simply invoke this metadata
> > flushing command as part of late shutdown.
> 
> 
> Hmm ... it looks like you just need to
> 
> start mdmon
> do mdadm --wait-clean
> 
> After this you can kill mdmon again (assuming decide is no more in
> use).


Well, it would be nice if the md utils would offer something doing this
without spawning multiple processes and killing them again. 

Lennart

-- 
Lennart Poettering - Red Hat, Inc.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-02-08 11:07                     ` Lennart Poettering
  2011-02-08 13:54                       ` Andrey Borzenkov
@ 2011-02-09 14:01                       ` Lennart Poettering
  1 sibling, 0 replies; 50+ messages in thread
From: Lennart Poettering @ 2011-02-09 14:01 UTC (permalink / raw)
  To: Andrey Borzenkov; +Cc: Tomasz Torcz, systemd-devel, linux-raid

On Tue, 08.02.11 12:07, Lennart Poettering (lennart@poettering.net) wrote:

> > At this point we know it is container, know that it has external
> > metadata and know that we need external metadata handler (mdmon). But
> > it is too late for systemd.
> 
> Kay, do you know why this "change" event is used here? Any chance we can
> get rid of it?

So, it seems that the "change" event does make some sense here. I have
now added a new property to systemd: if you set SYSTEMD_READY=0 on a
udev device then systemd will consider it unplugged even if it shows up
in the udev tree. If this property is not set for a device, or is set to
1 we will conisder the device plugged.

To make this md stuff compatible with systemd we hence just need to set
SYSTEMD_READY=0 during the "new" event and drop it when the device is
fully set up. 

Andrey, since you are playing around with this, do you happen to know
which attribute we should check to set SYSTEMD_READY=0 properly? It
would be cool if we could come up with a default rule for inclusion in
our systemd rules file that will ensure the device only shows up when it
is ready.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-02-08 17:28                         ` [systemd-devel] " Lennart Poettering
@ 2011-10-23  8:00                           ` Dan Williams
  2011-10-24  8:04                             ` Thomas Jarosch
                                               ` (2 more replies)
  0 siblings, 3 replies; 50+ messages in thread
From: Dan Williams @ 2011-10-23  8:00 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Andrey Borzenkov, Tomasz Torcz, systemd-devel, linux-raid, NeilBrown

On Tue, Feb 8, 2011 at 9:28 AM, Lennart Poettering
<lennart@poettering.net> wrote:
> On Tue, 08.02.11 16:54, Andrey Borzenkov (arvidjaar@mail.ru) wrote:
>
>> >> a) mdmon is perfectly capable of restarting, it is already used to
>> >> take over mdmon launched in initrd. The problem is to know when to
>> >> restart - i.e. when respective libraries are changed. This is a job
>> >> for package management in distribution. It is already employed for
>> >> glibc, systemd and some others and can just as well be employed for
>> >> mdmon. And this is totally unrelated to systemd :)
>> >
>> > Really, you are sying there is a synchronous way to make mdmon reexec
>> > itself? How does that work?
>> >
>>
>> I am not sure whether it qualifies as synchronous, but "mdmon
>> --takeover" will kill any existing mdmon for this and start monitoring
>> itself.
>
> I wonder if this is really fully synchronous, i.e. that a) there is no
> point in time where mdmon is not running during this restart and b) the
> mdmom --takeover command returns when the new daemon is fully up, and
> not right-away.
>
>> > Well, the root file systems cannot be unmounted, only remounted.
>> >
>> > So, is there a way to invoke mdmon so that it flushes all metadata
>> > changes to disk and immediately terminates then this should be all we
>> > need for a clean solution. We'd then shutdown the normal instances of
>> > mdmon down like any other daemon and simply invoke this metadata
>> > flushing command as part of late shutdown.
>>
>>
>> Hmm ... it looks like you just need to
>>
>> start mdmon
>> do mdadm --wait-clean
>>
>> After this you can kill mdmon again (assuming decide is no more in
>> use).
>
>
> Well, it would be nice if the md utils would offer something doing this
> without spawning multiple processes and killing them again.
>

/me wonders why his raid5 resyncs every boot on Fedora 15 and has
found this old thread.

I'm tempted to:

1/ teach ignore_proc() to scan for pid files in /dev/md/ (MDMON_DIR on Fedora)
2/ arrange for mdadm --wait-clean --scan to be called after all
filesytems have been mounted read only

...but a few things strike me.  This does not seem to be what was
being proposed above.  Systemd does not treat dm devices like a
service and takes care to shut them down explicitly (but in that case
there is an api that it can call).  Is it time for a libmd.so,  so
systemd can invoke the "--wait-clean --scan" process itself?  Probably
simpler to just SIGTERM mdmon and wait for it.

--
Dan

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-10-23  8:00                           ` Dan Williams
@ 2011-10-24  8:04                             ` Thomas Jarosch
  2011-10-25  1:40                             ` NeilBrown
  2011-10-31 11:06                             ` Lennart Poettering
  2 siblings, 0 replies; 50+ messages in thread
From: Thomas Jarosch @ 2011-10-24  8:04 UTC (permalink / raw)
  To: Dan Williams
  Cc: Lennart Poettering, Andrey Borzenkov, Tomasz Torcz,
	systemd-devel, linux-raid, NeilBrown

On Sunday, 23. October 2011 10:00:36 Dan Williams wrote:
> Is it time for a libmd.so, so systemd can invoke the "--wait-clean --scan"
> process itself?  Probably simpler to just SIGTERM mdmon and wait for it.

The mdadm code makes good use of non-reentrant functions like ctime(), 
readdir() and others. Luckily systemd is single threaded.

If we provide a "public" interface, that would need fixing though.

Cheers,
Thomas

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-10-23  8:00                           ` Dan Williams
  2011-10-24  8:04                             ` Thomas Jarosch
@ 2011-10-25  1:40                             ` NeilBrown
  2011-10-31 11:06                             ` Lennart Poettering
  2 siblings, 0 replies; 50+ messages in thread
From: NeilBrown @ 2011-10-25  1:40 UTC (permalink / raw)
  To: Dan Williams
  Cc: Lennart Poettering, Andrey Borzenkov, Tomasz Torcz,
	systemd-devel, linux-raid

[-- Attachment #1: Type: text/plain, Size: 3919 bytes --]

On Sun, 23 Oct 2011 01:00:36 -0700 Dan Williams <dan.j.williams@intel.com>
wrote:

> On Tue, Feb 8, 2011 at 9:28 AM, Lennart Poettering
> <lennart@poettering.net> wrote:
> > On Tue, 08.02.11 16:54, Andrey Borzenkov (arvidjaar@mail.ru) wrote:
> >
> >> >> a) mdmon is perfectly capable of restarting, it is already used to
> >> >> take over mdmon launched in initrd. The problem is to know when to
> >> >> restart - i.e. when respective libraries are changed. This is a job
> >> >> for package management in distribution. It is already employed for
> >> >> glibc, systemd and some others and can just as well be employed for
> >> >> mdmon. And this is totally unrelated to systemd :)
> >> >
> >> > Really, you are sying there is a synchronous way to make mdmon reexec
> >> > itself? How does that work?
> >> >
> >>
> >> I am not sure whether it qualifies as synchronous, but "mdmon
> >> --takeover" will kill any existing mdmon for this and start monitoring
> >> itself.
> >
> > I wonder if this is really fully synchronous, i.e. that a) there is no
> > point in time where mdmon is not running during this restart and b) the
> > mdmom --takeover command returns when the new daemon is fully up, and
> > not right-away.
> >
> >> > Well, the root file systems cannot be unmounted, only remounted.
> >> >
> >> > So, is there a way to invoke mdmon so that it flushes all metadata
> >> > changes to disk and immediately terminates then this should be all we
> >> > need for a clean solution. We'd then shutdown the normal instances of
> >> > mdmon down like any other daemon and simply invoke this metadata
> >> > flushing command as part of late shutdown.
> >>
> >>
> >> Hmm ... it looks like you just need to
> >>
> >> start mdmon
> >> do mdadm --wait-clean
> >>
> >> After this you can kill mdmon again (assuming decide is no more in
> >> use).
> >
> >
> > Well, it would be nice if the md utils would offer something doing this
> > without spawning multiple processes and killing them again.
> >
> 
> /me wonders why his raid5 resyncs every boot on Fedora 15 and has
> found this old thread.
> 
> I'm tempted to:
> 
> 1/ teach ignore_proc() to scan for pid files in /dev/md/ (MDMON_DIR on Fedora)
> 2/ arrange for mdadm --wait-clean --scan to be called after all
> filesytems have been mounted read only
> 
> ...but a few things strike me.  This does not seem to be what was
> being proposed above.  Systemd does not treat dm devices like a
> service and takes care to shut them down explicitly (but in that case
> there is an api that it can call).  Is it time for a libmd.so,  so
> systemd can invoke the "--wait-clean --scan" process itself?  Probably
> simpler to just SIGTERM mdmon and wait for it.
> 
> --
> Dan

Hi Dan,
  could you please explain in a bit more detail exactly what you think it is
  that is going wrong for you?

  I don't think it is anything like the original problem, as I don't think
  you are starting array manually.

  I think your problem is that 'mdmon' is being killed too early at shutdown.
  Clear we need to get whatever-kills-user-processes to skip mdmon - maybe by
  writing the pid to some magic file that 'ignore_proc' already knows about?

  Ultimately we probably want to get udev to start mdmon for us and have
  mdadm notice and not start it itself.
  We also need to get udev to notice arrays that are being reshaped and to
  start the mdadm which montiors the reshape so that mdadm doesn't have to
  fork it itself.

  That should fix the original problem, but I don't think it addresses your
  problem at all.

  I don't have a Fedora install so I cannot hunt around to see what is
  happening.

  I don't like the idea for a 'libmd.so' at all - certainly not until the
  problem is properly understood and other solutions (like running
  scripts) prove ineffective.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: systemd kills mdmon if it was started manually by user
  2011-10-23  8:00                           ` Dan Williams
  2011-10-24  8:04                             ` Thomas Jarosch
  2011-10-25  1:40                             ` NeilBrown
@ 2011-10-31 11:06                             ` Lennart Poettering
  2011-10-31 11:15                               ` [systemd-devel] " Lennart Poettering
  2011-11-02  0:44                               ` NeilBrown
  2 siblings, 2 replies; 50+ messages in thread
From: Lennart Poettering @ 2011-10-31 11:06 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-raid, NeilBrown, Andrey Borzenkov, systemd-devel

On Sun, 23.10.11 01:00, Dan Williams (dan.j.williams@intel.com) wrote:

> > Well, it would be nice if the md utils would offer something doing this
> > without spawning multiple processes and killing them again.
> >
> 
> /me wonders why his raid5 resyncs every boot on Fedora 15 and has
> found this old thread.
> 
> I'm tempted to:
> 
> 1/ teach ignore_proc() to scan for pid files in /dev/md/ (MDMON_DIR on
> Fedora)

This will not help you.

We nowadays jump back into the initrd when we shut down, so that the
initrd disassembles everything it assembled at boot time. This for the
first time enables us to ensure that all layers of our stack are in a
sane state (i.e. fully offline) when we shut down, regardless in which
way you stack it.

However, just excluding mdmom from being killed will not make this work,
simply because jumping into initrd only works sensibly if we can drop
all references to all previous mounts which requires us to have only one
process running at that time, and one process only.

It always boils down to the same thing: mdmon must be something we can
shutdown cleanly like every other process. Excluding it from that will
just move the problem around, but not fix it.

> 2/ arrange for mdadm --wait-clean --scan to be called after all
> filesytems have been mounted read only

Won't help you really either, since we have to kill all processes before
we jump into the initrd to fully disassemble mounts and storage.

There'll always be this chicken and egg problem: we cannot disassmble
all storage until all processes are gone and we are back in the
initrd. But mdmon wants to stay running after we 

> ...but a few things strike me.  This does not seem to be what was
> being proposed above.  Systemd does not treat dm devices like a
> service and takes care to shut them down explicitly (but in that case
> there is an api that it can call).  Is it time for a libmd.so,  so
> systemd can invoke the "--wait-clean --scan" process itself?  Probably
> simpler to just SIGTERM mdmon and wait for it.

We actually try to disassemble md already, i.e. we call the
DM_DEV_REMOVE ioctl for all left-over devices. I am not really
interested to link against libdm itself.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-10-31 11:06                             ` Lennart Poettering
@ 2011-10-31 11:15                               ` Lennart Poettering
  2011-11-02  0:44                               ` NeilBrown
  1 sibling, 0 replies; 50+ messages in thread
From: Lennart Poettering @ 2011-10-31 11:15 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrey Borzenkov, Tomasz Torcz, systemd-devel, linux-raid, NeilBrown

On Mon, 31.10.11 12:06, Lennart Poettering (lennart@poettering.net) wrote:

> We actually try to disassemble md already, i.e. we call the
> DM_DEV_REMOVE ioctl for all left-over devices. I am not really
> interested to link against libdm itself.

Sorry, wasn't fully woken up yet and mixed up dm and md here. Ignore
this sentence...

Lennart

-- 
Lennart Poettering - Red Hat, Inc.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-10-31 11:06                             ` Lennart Poettering
  2011-10-31 11:15                               ` [systemd-devel] " Lennart Poettering
@ 2011-11-02  0:44                               ` NeilBrown
  2011-11-02  1:16                                 ` Lennart Poettering
  1 sibling, 1 reply; 50+ messages in thread
From: NeilBrown @ 2011-11-02  0:44 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Dan Williams, Andrey Borzenkov, Tomasz Torcz, systemd-devel, linux-raid

[-- Attachment #1: Type: text/plain, Size: 3508 bytes --]

On Mon, 31 Oct 2011 12:06:13 +0100 Lennart Poettering
<lennart@poettering.net> wrote:

> On Sun, 23.10.11 01:00, Dan Williams (dan.j.williams@intel.com) wrote:
> 
> > > Well, it would be nice if the md utils would offer something doing this
> > > without spawning multiple processes and killing them again.
> > >
> > 
> > /me wonders why his raid5 resyncs every boot on Fedora 15 and has
> > found this old thread.
> > 
> > I'm tempted to:
> > 
> > 1/ teach ignore_proc() to scan for pid files in /dev/md/ (MDMON_DIR on
> > Fedora)
> 
> This will not help you.
> 
> We nowadays jump back into the initrd when we shut down, so that the
> initrd disassembles everything it assembled at boot time. This for the
> first time enables us to ensure that all layers of our stack are in a
> sane state (i.e. fully offline) when we shut down, regardless in which
> way you stack it.

This sounds particularly elegant.
Is there some part of the filesystem, that survives through the whole process
- from before / is mounted until after it is unmounted?
Presumably this would be /run if anything.

mdmon must be running from the time that / becomes writable until after it
becomes readonly.
If we can have it from before it is mounted until after it is unmounted, that
might be even better.
(It is possible to start a new one which replaces the old one but if that was
only used for version upgrades, that would be best).

So if mdmon has a 'cwd' and all open files in /run (and the executable
elsewhere in the same filesystem), could it easily survive the 'kill all
processes before unmounting /' thing?

> 
> However, just excluding mdmom from being killed will not make this work,
> simply because jumping into initrd only works sensibly if we can drop
> all references to all previous mounts which requires us to have only one
> process running at that time, and one process only.
> 
> It always boils down to the same thing: mdmon must be something we can
> shutdown cleanly like every other process. Excluding it from that will
> just move the problem around, but not fix it.

My ideal would be that you just ignore mdmon.
After unmounting '/', you shutdown md arrays with "mdadm -Ss" and then mdmon
will spontaneously disappear.


> 
> > 2/ arrange for mdadm --wait-clean --scan to be called after all
> > filesytems have been mounted read only
> 
> Won't help you really either, since we have to kill all processes before
> we jump into the initrd to fully disassemble mounts and storage.
> 
> There'll always be this chicken and egg problem: we cannot disassmble
> all storage until all processes are gone and we are back in the
> initrd. But mdmon wants to stay running after we 
> 
> > ...but a few things strike me.  This does not seem to be what was
> > being proposed above.  Systemd does not treat dm devices like a
> > service and takes care to shut them down explicitly (but in that case
> > there is an api that it can call).  Is it time for a libmd.so,  so
> > systemd can invoke the "--wait-clean --scan" process itself?  Probably
> > simpler to just SIGTERM mdmon and wait for it.
> 
> We actually try to disassemble md already, i.e. we call the
> DM_DEV_REMOVE ioctl for all left-over devices. I am not really
> interested to link against libdm itself.

:-)
I get used to this .. people confusing md and dm, people confusing nfs-client
with nfs-server, people confusing me with some other Mr Brown :-)

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-11-02  0:44                               ` NeilBrown
@ 2011-11-02  1:16                                 ` Lennart Poettering
  2011-11-02  2:03                                   ` NeilBrown
  0 siblings, 1 reply; 50+ messages in thread
From: Lennart Poettering @ 2011-11-02  1:16 UTC (permalink / raw)
  To: NeilBrown
  Cc: Dan Williams, Andrey Borzenkov, Tomasz Torcz, systemd-devel, linux-raid

On Wed, 02.11.11 11:44, NeilBrown (neilb@suse.de) wrote:

> > We nowadays jump back into the initrd when we shut down, so that the
> > initrd disassembles everything it assembled at boot time. This for the
> > first time enables us to ensure that all layers of our stack are in a
> > sane state (i.e. fully offline) when we shut down, regardless in which
> > way you stack it.
> 
> This sounds particularly elegant.
> Is there some part of the filesystem, that survives through the whole process
> - from before / is mounted until after it is unmounted?
>
> Presumably this would be /run if anything.

Yes. /run is usually mounted by the initrd these days, and the initrd
itself places its binaries in /run/initramfs/ which systemd then
pivot_root()s into at shutdown.

> mdmon must be running from the time that / becomes writable until after it
> becomes readonly.

I'd really prefer if we could somehow make it something that isn't
special and we could just shutdown

> If we can have it from before it is mounted until after it is unmounted, that
> might be even better.

Well, that could work if mdmon is invoked in the initrd only. If mdmon
is always running from the initrd this would solve the issue that it
keeps files on the real root referenced thus making unmounting of /
impossible.

However, there might be complexities here: what happens if the user
creates an MD device during normal operation, so that mdmon is started
at runtime, and not from the initrd?

That said I definitely prefer that if mdmon really wants to avoid
systemd and live independent of it that it does so by being invoked from
the initrd, so that it runs completely independently from all systemd
book keeping. 

If this is what you want, then we could come up with a simple scheme
like "a process owned by root who has +t set on /proc/$PID/stat" is
excluded from systemd's killing.

But again, I really think that mdmon should just be fixed to become a
daemon that can be shtu down at any time.

> (It is possible to start a new one which replaces the old one but if that was
> only used for version upgrades, that would be best).

If you do upgrades like that then you end up with a version of mdmon
running that is still referencing the root dir. That means the initrd
disassembling will break.

> So if mdmon has a 'cwd' and all open files in /run (and the executable
> elsewhere in the same filesystem), could it easily survive the 'kill all
> processes before unmounting /' thing?

Right now no. But if the +t scheme would work for you we could at
that. But you'd need a good story how to handle upgrades and arrays that
are assembled during ruintime (i.e. after initrd)?

> > However, just excluding mdmom from being killed will not make this work,
> > simply because jumping into initrd only works sensibly if we can drop
> > all references to all previous mounts which requires us to have only one
> > process running at that time, and one process only.
> > 
> > It always boils down to the same thing: mdmon must be something we can
> > shutdown cleanly like every other process. Excluding it from that will
> > just move the problem around, but not fix it.
> 
> My ideal would be that you just ignore mdmon.
> After unmounting '/', you shutdown md arrays with "mdadm -Ss" and then mdmon
> will spontaneously disappear.

That's still a chicken and egg problem. We cannot unmount / until all
references to files on / are dropped. For that we need all processes
running from it terminated. That means mdmon needs to go first, and only
then we can unmount /.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-11-02  1:16                                 ` Lennart Poettering
@ 2011-11-02  2:03                                   ` NeilBrown
  2011-11-02 13:32                                     ` Lennart Poettering
  0 siblings, 1 reply; 50+ messages in thread
From: NeilBrown @ 2011-11-02  2:03 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Dan Williams, Andrey Borzenkov, Tomasz Torcz, systemd-devel, linux-raid

[-- Attachment #1: Type: text/plain, Size: 5683 bytes --]

On Wed, 2 Nov 2011 02:16:15 +0100 Lennart Poettering <lennart@poettering.net>
wrote:

> On Wed, 02.11.11 11:44, NeilBrown (neilb@suse.de) wrote:
> 
> > > We nowadays jump back into the initrd when we shut down, so that the
> > > initrd disassembles everything it assembled at boot time. This for the
> > > first time enables us to ensure that all layers of our stack are in a
> > > sane state (i.e. fully offline) when we shut down, regardless in which
> > > way you stack it.
> > 
> > This sounds particularly elegant.
> > Is there some part of the filesystem, that survives through the whole process
> > - from before / is mounted until after it is unmounted?
> >
> > Presumably this would be /run if anything.
> 
> Yes. /run is usually mounted by the initrd these days, and the initrd
> itself places its binaries in /run/initramfs/ which systemd then
> pivot_root()s into at shutdown.
> 
> > mdmon must be running from the time that / becomes writable until after it
> > becomes readonly.
> 
> I'd really prefer if we could somehow make it something that isn't
> special and we could just shutdown

It must remain running until the array that it manages is read-only and will
never be written to again.  Then it can be shutdown gracefully.
It may be awkward to shut it down gracefully at the moment - I'm not sure.  I
can certainly fix that.


> 
> > If we can have it from before it is mounted until after it is unmounted, that
> > might be even better.
> 
> Well, that could work if mdmon is invoked in the initrd only. If mdmon
> is always running from the initrd this would solve the issue that it
> keeps files on the real root referenced thus making unmounting of /
> impossible.
> 
> However, there might be complexities here: what happens if the user
> creates an MD device during normal operation, so that mdmon is started
> at runtime, and not from the initrd?

Each instance of mdmon manages a set of arrays and must remain running
until all of those arrays are readonly (or shut down).  This allows it to
record that all writes have completed and mark the array as 'clean' so a
resync isn't needed at next boot.

If a user creates an array while the system it running, it will not have the
root filesystem on it.  So between unmounting the last non-root filesystem
and unmounting root it is perfectly OK to stop that mdmon.


> 
> That said I definitely prefer that if mdmon really wants to avoid
> systemd and live independent of it that it does so by being invoked from
> the initrd, so that it runs completely independently from all systemd
> book keeping. 
> 
> If this is what you want, then we could come up with a simple scheme
> like "a process owned by root who has +t set on /proc/$PID/stat" is
> excluded from systemd's killing.

You couldn't just do the equivalent of
  fuser -k /some/filesystem
  umount /some/filesystem

iterating over filesystems with '/' last?

Then anything that only uses the /run filesystem will survive.


> 
> But again, I really think that mdmon should just be fixed to become a
> daemon that can be shtu down at any time.
> 
> > (It is possible to start a new one which replaces the old one but if that was
> > only used for version upgrades, that would be best).
> 
> If you do upgrades like that then you end up with a version of mdmon
> running that is still referencing the root dir. That means the initrd
> disassembling will break.

True.  A version upgrade would need to stash the binary in /run.
It might be better to go the 'remount-readonly - then stop mdmon' route.

> 
> > So if mdmon has a 'cwd' and all open files in /run (and the executable
> > elsewhere in the same filesystem), could it easily survive the 'kill all
> > processes before unmounting /' thing?
> 
> Right now no. But if the +t scheme would work for you we could at
> that. But you'd need a good story how to handle upgrades and arrays that
> are assembled during ruintime (i.e. after initrd)?
> 
> > > However, just excluding mdmom from being killed will not make this work,
> > > simply because jumping into initrd only works sensibly if we can drop
> > > all references to all previous mounts which requires us to have only one
> > > process running at that time, and one process only.
> > > 
> > > It always boils down to the same thing: mdmon must be something we can
> > > shutdown cleanly like every other process. Excluding it from that will
> > > just move the problem around, but not fix it.
> > 
> > My ideal would be that you just ignore mdmon.
> > After unmounting '/', you shutdown md arrays with "mdadm -Ss" and then mdmon
> > will spontaneously disappear.
> 
> That's still a chicken and egg problem. We cannot unmount / until all
> references to files on / are dropped. For that we need all processes
> running from it terminated. That means mdmon needs to go first, and only
> then we can unmount /.
> 
> Lennart
> 

Does, or can, systemd remount '/' readonly before trying to unmount it and
allow some task to run at that point?

I guess it still needs to be able to differentiate processes that are holding
write-access to the filesystem and so need to be killed, from processes are
only holding read-access and so can be permitted to remain.

Probably easiest for  mdmon just register itself as "Leave this until / is
readonly" - maybe by putting it's pid file in
    /run/preserve-until-readonly/mdmon-devname.pid

I don't quite get your "+t on /proc/$PID/stat" suggestion:

# chmod +t /proc/self/stat
chmod: changing permissions of `/proc/self/stat': Operation not permitted


NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-11-02  2:03                                   ` NeilBrown
@ 2011-11-02 13:32                                     ` Lennart Poettering
  2011-11-02 14:33                                       ` Kay Sievers
                                                         ` (2 more replies)
  0 siblings, 3 replies; 50+ messages in thread
From: Lennart Poettering @ 2011-11-02 13:32 UTC (permalink / raw)
  To: NeilBrown
  Cc: Dan Williams, Andrey Borzenkov, Tomasz Torcz, systemd-devel, linux-raid

On Wed, 02.11.11 13:03, NeilBrown (neilb@suse.de) wrote:

> > I'd really prefer if we could somehow make it something that isn't
> > special and we could just shutdown
> 
> It must remain running until the array that it manages is read-only and will
> never be written to again.  Then it can be shutdown gracefully.
> It may be awkward to shut it down gracefully at the moment - I'm not sure.  I
> can certainly fix that.

The big thing is that if things are done that way you'll always have the
chicken and egg problem: you really need to shut down mdmon before
unmounting root, but currently you require us to do it in the other
order too.

> > > If we can have it from before it is mounted until after it is unmounted, that
> > > might be even better.
> > 
> > Well, that could work if mdmon is invoked in the initrd only. If mdmon
> > is always running from the initrd this would solve the issue that it
> > keeps files on the real root referenced thus making unmounting of /
> > impossible.
> > 
> > However, there might be complexities here: what happens if the user
> > creates an MD device during normal operation, so that mdmon is started
> > at runtime, and not from the initrd?
> 
> Each instance of mdmon manages a set of arrays and must remain running
> until all of those arrays are readonly (or shut down).  This allows it to
> record that all writes have completed and mark the array as 'clean' so a
> resync isn't needed at next boot.

Why doesn't the kernel do that on its own?

> If a user creates an array while the system it running, it will not have the
> root filesystem on it.  So between unmounting the last non-root filesystem
> and unmounting root it is perfectly OK to stop that mdmon.

Well, that complicates things quite a bit, since that way the shutdown
logic has two very different paths.

> > That said I definitely prefer that if mdmon really wants to avoid
> > systemd and live independent of it that it does so by being invoked from
> > the initrd, so that it runs completely independently from all systemd
> > book keeping. 
> > 
> > If this is what you want, then we could come up with a simple scheme
> > like "a process owned by root who has +t set on /proc/$PID/stat" is
> > excluded from systemd's killing.
> 
> You couldn't just do the equivalent of
>   fuser -k /some/filesystem
>   umount /some/filesystem
> 
> iterating over filesystems with '/' last?
>
> Then anything that only uses the /run filesystem will survive.

What we do right now is this:

kill_all_processes();
do {
     umount_all_file_systems_we_can();
     read_only_mount_all_remaining_file_systems();
} while (we_had_some_success_with_that());
jump_into_initrd();

As long as mdmon references a file from the root disk we cannot umount
it, so the loop wouldn't be effective.

> > > (It is possible to start a new one which replaces the old one but if that was
> > > only used for version upgrades, that would be best).
> > 
> > If you do upgrades like that then you end up with a version of mdmon
> > running that is still referencing the root dir. That means the initrd
> > disassembling will break.
> 
> True.  A version upgrade would need to stash the binary in /run.
> It might be better to go the 'remount-readonly - then stop mdmon'
> route.

It is not sufficient to stash the binary in /run, you'd also need to
include your own libc and in fact every single other library or file you
use.

Why? If a system is upgraded library files are deleted and replaced by
new ones. If a process stays running with the original libraries mapped
the file system cannot be remounted read-only, since the file is only
deleted in theory, but needs to be deleted on disk, which can only
happen if the file is not referenced anymore. Hence, if the user does an
upgrade of *any* of the files mdmon has open we will not be able to
remount the fs these files are from read-only if the user did an upgrade
of any of the files. 

> > That's still a chicken and egg problem. We cannot unmount / until all
> > references to files on / are dropped. For that we need all processes
> > running from it terminated. That means mdmon needs to go first, and only
> > then we can unmount /.
> > 
> > Lennart
> > 
> 
> Does, or can, systemd remount '/' readonly before trying to unmount it and
> allow some task to run at that point?

Well, we try that as last resort.

> I guess it still needs to be able to differentiate processes that are holding
> write-access to the filesystem and so need to be killed, from processes are
> only holding read-access and so can be permitted to remain.

Basically what I saying here is that it's a really bad idea that mdmon
insists to stay around until after the file system is unmounted, even
though it itself is running from it. And the fact that mdmon doesn't
have any of those files open for writing doesn't help you very much
here, due to the upgrade/delete issue.

> I don't quite get your "+t on /proc/$PID/stat" suggestion:
> 
> # chmod +t /proc/self/stat
> chmod: changing permissions of `/proc/self/stat': Operation not permitted

Uh oh, I was sure that one could actually change the access mode of
files in /proc. Seems I was wrong. An alternative solution might be to
do argv[0][0]='!' in your code, to tell systemd to exclude your process
from killing. THis wouldbe inspired from shells changing the first char
of argv to "-" for login shells.

But again, I believe the right solution is to fix mdmon to make it
something that can be shut down normally at any time. That might mean
that some of its code has to move to the kernel, but otherwise you'll
always have this chicken and egg problem, and you cannot fix it properly.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-11-02 13:32                                     ` Lennart Poettering
@ 2011-11-02 14:33                                       ` Kay Sievers
  2011-11-02 15:17                                         ` Lennart Poettering
  2011-11-02 18:16                                         ` Williams, Dan J
  2011-11-07  2:52                                       ` NeilBrown
  2011-11-08  0:11                                       ` Michal Soltys
  2 siblings, 2 replies; 50+ messages in thread
From: Kay Sievers @ 2011-11-02 14:33 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: NeilBrown, linux-raid, Dan Williams, Andrey Borzenkov, systemd-devel

On Wed, Nov 2, 2011 at 14:32, Lennart Poettering <lennart@poettering.net> wrote:
> On Wed, 02.11.11 13:03, NeilBrown (neilb@suse.de) wrote:
>
>> > I'd really prefer if we could somehow make it something that isn't
>> > special and we could just shutdown
>>
>> It must remain running until the array that it manages is read-only and will
>> never be written to again.  Then it can be shutdown gracefully.
>> It may be awkward to shut it down gracefully at the moment - I'm not sure.  I
>> can certainly fix that.
>
> The big thing is that if things are done that way you'll always have the
> chicken and egg problem: you really need to shut down mdmon before
> unmounting root, but currently you require us to do it in the other
> order too.

Yeah, that's just madness.

I talked to Harald, and the currently preferred idea is the version
where we start mdmon in the initramfs and never touch it again, and it
runs until the initramfs unmounts the rootfs and shuts down the box.

In that picture, the mdmon process is conceptually more like a kernel
thread than a userspace process. It can not be updated, can not be
restarted. The only way to update it is to rebuild initramfs and
reboot the box.

>> Each instance of mdmon manages a set of arrays and must remain running
>> until all of those arrays are readonly (or shut down).  This allows it to
>> record that all writes have completed and mark the array as 'clean' so a
>> resync isn't needed at next boot.
>
> Why doesn't the kernel do that on its own?

Because somebody was naive enough to think that userspace can tear
down the base it lives on, which in reality is just a total mess in
the real world. :)

>> True.  A version upgrade would need to stash the binary in /run.
>> It might be better to go the 'remount-readonly - then stop mdmon'
>> route.
>
> It is not sufficient to stash the binary in /run, you'd also need to
> include your own libc and in fact every single other library or file you
> use.

I don't think any of these update games in a running system make much
sense in the end.

>> I guess it still needs to be able to differentiate processes that are holding
>> write-access to the filesystem and so need to be killed, from processes are
>> only holding read-access and so can be permitted to remain.

> But again, I believe the right solution is to fix mdmon to make it
> something that can be shut down normally at any time. That might mean
> that some of its code has to move to the kernel, but otherwise you'll
> always have this chicken and egg problem, and you cannot fix it properly.

That would be the ideal solution. Having the roofs depending on a
tools that runs off the rootfs just asks for serious trouble. If all
that can't move to the kernel, the initramfs-only solution with the
above mentioned constrains, seems like the best option.

People who like to put their rootfs on a userspace managed raid device
just get what they asked for. :)

Kay
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-11-02 14:33                                       ` Kay Sievers
@ 2011-11-02 15:17                                         ` Lennart Poettering
  2011-11-02 15:21                                           ` Kay Sievers
  2011-11-02 17:21                                           ` Williams, Dan J
  2011-11-02 18:16                                         ` Williams, Dan J
  1 sibling, 2 replies; 50+ messages in thread
From: Lennart Poettering @ 2011-11-02 15:17 UTC (permalink / raw)
  To: Kay Sievers
  Cc: NeilBrown, linux-raid, Dan Williams, Andrey Borzenkov, systemd-devel

On Wed, 02.11.11 15:33, Kay Sievers (kay.sievers@vrfy.org) wrote:

> > The big thing is that if things are done that way you'll always have the
> > chicken and egg problem: you really need to shut down mdmon before
> > unmounting root, but currently you require us to do it in the other
> > order too.
> 
> Yeah, that's just madness.
> 
> I talked to Harald, and the currently preferred idea is the version
> where we start mdmon in the initramfs and never touch it again, and it
> runs until the initramfs unmounts the rootfs and shuts down the box.
> 
> In that picture, the mdmon process is conceptually more like a kernel
> thread than a userspace process. It can not be updated, can not be
> restarted. The only way to update it is to rebuild initramfs and
> reboot the box.

OK, I guess that means we'll need to define a way how we can recognize
the process then, to avoid killing it by systemd, similar to how we
exclude kernel threads from killing.

Kernel threads we detect by checking whether /proc/$PID/cmdline is
empty, hence I'd suggest we use the first char of argv[0][0] here, to
detect whether something is a process to avoid killing. Question is
which char to choose for that. I am tempted to use '@'. 

That means we'd:

a) patch systemd to check whether argv[0][0] of a process is '@' and
owned by root and exclude it from killing on shutdown.

b) patch mdmon to set argv[0][0] of itself to '@' iff it is running from
the initrd. If it is run from the main system it should not set that and
just be shut down like any other service.

c) make sure that mdmon run from the initrd is never upgrade during
normal operation, only via dracut rebuild and reboot.

If this is acceptable I am will cook up the patch for a).

Lennart

-- 
Lennart Poettering - Red Hat, Inc.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: systemd kills mdmon if it was started manually by user
  2011-11-02 15:17                                         ` Lennart Poettering
@ 2011-11-02 15:21                                           ` Kay Sievers
  2011-11-02 15:29                                             ` [systemd-devel] " Lennart Poettering
  2011-11-02 17:21                                           ` Williams, Dan J
  1 sibling, 1 reply; 50+ messages in thread
From: Kay Sievers @ 2011-11-02 15:21 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: NeilBrown, linux-raid, Dan Williams, Andrey Borzenkov, systemd-devel

On Wed, Nov 2, 2011 at 16:17, Lennart Poettering <lennart@poettering.net> wrote:
> Kernel threads we detect by checking whether /proc/$PID/cmdline is
> empty, hence I'd suggest we use the first char of argv[0][0] here, to
> detect whether something is a process to avoid killing. Question is
> which char to choose for that. I am tempted to use '@'.

Maybe introduce a 'initramfs' cgroup and move the pids there?

Kay

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-11-02 15:21                                           ` Kay Sievers
@ 2011-11-02 15:29                                             ` Lennart Poettering
  2011-11-02 22:18                                               ` Williams, Dan J
  0 siblings, 1 reply; 50+ messages in thread
From: Lennart Poettering @ 2011-11-02 15:29 UTC (permalink / raw)
  To: Kay Sievers
  Cc: NeilBrown, linux-raid, Dan Williams, Andrey Borzenkov, systemd-devel

On Wed, 02.11.11 16:21, Kay Sievers (kay.sievers@vrfy.org) wrote:

> 
> On Wed, Nov 2, 2011 at 16:17, Lennart Poettering <lennart@poettering.net> wrote:
> > Kernel threads we detect by checking whether /proc/$PID/cmdline is
> > empty, hence I'd suggest we use the first char of argv[0][0] here, to
> > detect whether something is a process to avoid killing. Question is
> > which char to choose for that. I am tempted to use '@'.
> 
> Maybe introduce a 'initramfs' cgroup and move the pids there?

Well, in which hierarchy? I am a bit concerned about having other
subsystems muck with the systemd cgroup hierarchy, before systemd has
set it up.

I can see some elegance in having all code from the initrd that remains
running during boot in a cgroup of its own, but that's probably
orthogonal to finding a way to recognize processes not to kill at
shutdown. Why? Because there's stuff like Plymouth which also stays
around from the initramfs, but actually is something we *do* want to
kill on shutdown.


Lennart

-- 
Lennart Poettering - Red Hat, Inc.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-11-02 15:17                                         ` Lennart Poettering
  2011-11-02 15:21                                           ` Kay Sievers
@ 2011-11-02 17:21                                           ` Williams, Dan J
  2011-11-02 23:35                                             ` Lennart Poettering
  1 sibling, 1 reply; 50+ messages in thread
From: Williams, Dan J @ 2011-11-02 17:21 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Kay Sievers, NeilBrown, linux-raid, Andrey Borzenkov, systemd-devel

On Wed, Nov 2, 2011 at 8:17 AM, Lennart Poettering
<lennart@poettering.net> wrote:
> On Wed, 02.11.11 15:33, Kay Sievers (kay.sievers@vrfy.org) wrote:
>
>> > The big thing is that if things are done that way you'll always have the
>> > chicken and egg problem: you really need to shut down mdmon before
>> > unmounting root, but currently you require us to do it in the other
>> > order too.
>>
>> Yeah, that's just madness.
>>
>> I talked to Harald, and the currently preferred idea is the version
>> where we start mdmon in the initramfs and never touch it again, and it
>> runs until the initramfs unmounts the rootfs and shuts down the box.
>>
>> In that picture, the mdmon process is conceptually more like a kernel
>> thread than a userspace process. It can not be updated, can not be
>> restarted. The only way to update it is to rebuild initramfs and
>> reboot the box.
>
> OK, I guess that means we'll need to define a way how we can recognize
> the process then, to avoid killing it by systemd, similar to how we
> exclude kernel threads from killing.
>
> Kernel threads we detect by checking whether /proc/$PID/cmdline is
> empty, hence I'd suggest we use the first char of argv[0][0] here, to
> detect whether something is a process to avoid killing. Question is
> which char to choose for that. I am tempted to use '@'.
>
> That means we'd:
>
> a) patch systemd to check whether argv[0][0] of a process is '@' and
> owned by root and exclude it from killing on shutdown.
>
> b) patch mdmon to set argv[0][0] of itself to '@' iff it is running from
> the initrd. If it is run from the main system it should not set that and
> just be shut down like any other service.

Well, there are two cases to consider:

1/ user starts the array manually and stops it with mdadm -Ss (mdmon
automatically shuts down).  No need for a service mdmon just follows
the lifespan of the array.

2/ user starts the array but then expects it to be around until system shutdown

In the latter case let the initramfs-mdmon takeover all arrays with
"mdmon --takeover --all".  But if all arrays may eventually be
re-parented to an mdmon instance from /run, why not always start mdmon
from there?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-11-02 14:33                                       ` Kay Sievers
  2011-11-02 15:17                                         ` Lennart Poettering
@ 2011-11-02 18:16                                         ` Williams, Dan J
  2011-11-02 18:49                                           ` Kay Sievers
  1 sibling, 1 reply; 50+ messages in thread
From: Williams, Dan J @ 2011-11-02 18:16 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Lennart Poettering, NeilBrown, linux-raid, Andrey Borzenkov,
	systemd-devel

On Wed, Nov 2, 2011 at 7:33 AM, Kay Sievers <kay.sievers@vrfy.org> wrote:
> People who like to put their rootfs on a userspace managed raid device
> just get what they asked for. :)

Proper care and feeding of mdmon and userspace managed block devices /
filesystems is a solvable problem.  To me the ":)" runs the risk of
implying we don't think we can get this right.

--
Dan

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-11-02 18:16                                         ` Williams, Dan J
@ 2011-11-02 18:49                                           ` Kay Sievers
  2011-11-02 19:31                                             ` Williams, Dan J
  0 siblings, 1 reply; 50+ messages in thread
From: Kay Sievers @ 2011-11-02 18:49 UTC (permalink / raw)
  To: Williams, Dan J
  Cc: Lennart Poettering, NeilBrown, linux-raid, Andrey Borzenkov,
	systemd-devel

On Wed, Nov 2, 2011 at 19:16, Williams, Dan J <dan.j.williams@intel.com> wrote:
> On Wed, Nov 2, 2011 at 7:33 AM, Kay Sievers <kay.sievers@vrfy.org> wrote:
>> People who like to put their rootfs on a userspace managed raid device
>> just get what they asked for. :)
>
> Proper care and feeding of mdmon and userspace managed block devices /
> filesystems is a solvable problem.  To me the ":)" runs the risk of
> implying we don't think we can get this right.

It implied that I think it is totally insane what you guys try to
accomplish. Managing the rootfs blockdev with tools contained in the
rootfs itself is just crazy. No smiley this time.

Kay
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-11-02 18:49                                           ` Kay Sievers
@ 2011-11-02 19:31                                             ` Williams, Dan J
  2011-11-02 19:51                                               ` Kay Sievers
  0 siblings, 1 reply; 50+ messages in thread
From: Williams, Dan J @ 2011-11-02 19:31 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Lennart Poettering, NeilBrown, linux-raid, Andrey Borzenkov,
	systemd-devel

On Wed, Nov 2, 2011 at 11:49 AM, Kay Sievers <kay.sievers@vrfy.org> wrote:
> On Wed, Nov 2, 2011 at 19:16, Williams, Dan J <dan.j.williams@intel.com> wrote:
>> On Wed, Nov 2, 2011 at 7:33 AM, Kay Sievers <kay.sievers@vrfy.org> wrote:
>>> People who like to put their rootfs on a userspace managed raid device
>>> just get what they asked for. :)
>>
>> Proper care and feeding of mdmon and userspace managed block devices /
>> filesystems is a solvable problem.  To me the ":)" runs the risk of
>> implying we don't think we can get this right.
>
> It implied that I think it is totally insane what you guys try to
> accomplish. Managing the rootfs blockdev with tools contained in the
> rootfs itself is just crazy. No smiley this time.
>

Yes, much clearer.  Which is why the "never let mdmon run from an fs
it is managing" is better than the current dance that was implemented
to address the need to drop initramfs memory and get around a lack of
having a filesystem (like /run) that persisted from early boot.  But
we now run back into the problem of pinning initramfs memory.  Does
systemd already expect that the full initramfs sticks around to handle
shutdown?  If so then we have come full circle and don't really need
the "mdmon --takeover" functionality versus just letting the
initramfs-mdmon handle their entire lifetime of the rootfs blockdev.

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-11-02 19:31                                             ` Williams, Dan J
@ 2011-11-02 19:51                                               ` Kay Sievers
  0 siblings, 0 replies; 50+ messages in thread
From: Kay Sievers @ 2011-11-02 19:51 UTC (permalink / raw)
  To: Williams, Dan J
  Cc: Lennart Poettering, NeilBrown, linux-raid, Andrey Borzenkov,
	systemd-devel

On Wed, Nov 2, 2011 at 20:31, Williams, Dan J <dan.j.williams@intel.com> wrote:
> On Wed, Nov 2, 2011 at 11:49 AM, Kay Sievers <kay.sievers@vrfy.org> wrote:
>> On Wed, Nov 2, 2011 at 19:16, Williams, Dan J <dan.j.williams@intel.com> wrote:
>>> On Wed, Nov 2, 2011 at 7:33 AM, Kay Sievers <kay.sievers@vrfy.org> wrote:
>>>> People who like to put their rootfs on a userspace managed raid device
>>>> just get what they asked for. :)
>>>
>>> Proper care and feeding of mdmon and userspace managed block devices /
>>> filesystems is a solvable problem.  To me the ":)" runs the risk of
>>> implying we don't think we can get this right.
>>
>> It implied that I think it is totally insane what you guys try to
>> accomplish. Managing the rootfs blockdev with tools contained in the
>> rootfs itself is just crazy. No smiley this time.
>>
>
> Yes, much clearer.  Which is why the "never let mdmon run from an fs
> it is managing" is better than the current dance that was implemented
> to address the need to drop initramfs memory and get around a lack of
> having a filesystem (like /run) that persisted from early boot.  But
> we now run back into the problem of pinning initramfs memory.  Does
> systemd already expect that the full initramfs sticks around to handle
> shutdown?  If so then we have come full circle and don't really need
> the "mdmon --takeover" functionality versus just letting the
> initramfs-mdmon handle their entire lifetime of the rootfs blockdev.

It all depends on the initramfs implementation. Systemd is not
involved here and has no knowledge about what was left behind, it just
checks if there is binary in /run provided by initramfs, and then it
calls this binary instead of just bringing down the box itself.

So far only dracut implements this shutdown logic, which is just a
go-back-to initramfs and disassemble/shut down everything that was
assembled before the initramfs started the real init.

I wouldn't be surprised if we see more of these use cases from
subsystems which put their rootfs on something that needs to be
managed from userspace.

The pinned memory for the tools in initramfs that stay around in tmpfs
is probably the price to pay for these kinds of setups of the rootfs,
when subsystems want to avoid adding the needed logic to the kernel to
safely shut down the rootfs.

Kay
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-11-02 15:29                                             ` [systemd-devel] " Lennart Poettering
@ 2011-11-02 22:18                                               ` Williams, Dan J
  2011-11-02 23:39                                                 ` Lennart Poettering
  0 siblings, 1 reply; 50+ messages in thread
From: Williams, Dan J @ 2011-11-02 22:18 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Kay Sievers, NeilBrown, linux-raid, Andrey Borzenkov, systemd-devel

On Wed, Nov 2, 2011 at 8:29 AM, Lennart Poettering
<lennart@poettering.net> wrote:
> On Wed, 02.11.11 16:21, Kay Sievers (kay.sievers@vrfy.org) wrote:
>
>>
>> On Wed, Nov 2, 2011 at 16:17, Lennart Poettering <lennart@poettering.net> wrote:
>> > Kernel threads we detect by checking whether /proc/$PID/cmdline is
>> > empty, hence I'd suggest we use the first char of argv[0][0] here, to
>> > detect whether something is a process to avoid killing. Question is
>> > which char to choose for that. I am tempted to use '@'.
>>
>> Maybe introduce a 'initramfs' cgroup and move the pids there?
>
> Well, in which hierarchy? I am a bit concerned about having other
> subsystems muck with the systemd cgroup hierarchy, before systemd has
> set it up.
>
> I can see some elegance in having all code from the initrd that remains
> running during boot in a cgroup of its own, but that's probably
> orthogonal to finding a way to recognize processes not to kill at
> shutdown. Why? Because there's stuff like Plymouth which also stays
> around from the initramfs, but actually is something we *do* want to
> kill on shutdown.

So how about rather than binaries self modifying themselves as "please
don't kill me" with argv[][], shutdown can just avoid process where
/proc/$PID/cmdline starts with /run/initramfs?  Then it's up to  where
the initramfs runs the binary to determine which instances it wants
provenance over versus leaving to the init system.

For manually started arrays maybe we should arrange for an
initramfs-started-mdmon to spawn new instances for user started
containers, rather than using the local /sbin/mdmon.  Then the "mdadm
-Ss" initiated by /run/initramfs/shutdown can reliably stop any md
device regardless of how it was started.

--
Dan

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-11-02 17:21                                           ` Williams, Dan J
@ 2011-11-02 23:35                                             ` Lennart Poettering
  0 siblings, 0 replies; 50+ messages in thread
From: Lennart Poettering @ 2011-11-02 23:35 UTC (permalink / raw)
  To: Williams, Dan J
  Cc: Kay Sievers, NeilBrown, linux-raid, Andrey Borzenkov, systemd-devel

On Wed, 02.11.11 10:21, Williams, Dan J (dan.j.williams@intel.com) wrote:

> > That means we'd:
> >
> > a) patch systemd to check whether argv[0][0] of a process is '@' and
> > owned by root and exclude it from killing on shutdown.
> >
> > b) patch mdmon to set argv[0][0] of itself to '@' iff it is running from
> > the initrd. If it is run from the main system it should not set that and
> > just be shut down like any other service.
> 
> Well, there are two cases to consider:
> 
> 1/ user starts the array manually and stops it with mdadm -Ss (mdmon
> automatically shuts down).  No need for a service mdmon just follows
> the lifespan of the array.
> 
> 2/ user starts the array but then expects it to be around until system shutdown
> 
> In the latter case let the initramfs-mdmon takeover all arrays with
> "mdmon --takeover --all".  But if all arrays may eventually be
> re-parented to an mdmon instance from /run, why not always start mdmon
> from there?

Well I am not sure how mdmon works, but let's say you booted up with an
initrd lacking mdmon. Then, while the machine is up you set up a some md
device, and start mdmon for that. At this point it will be independent
of the initrd. But that should be OK since at shutdown time it can be
detached cleanly without any special magic, too, since mdmon is not
stored on that md device. So if you boot from md you need mdmon in the
initrd. If you just use md outside of the root disk, then running mdmon
as a normal service (i.e. one that is shut down like any other) should
be perfectly fine.

This why I suggested that only mdmon run from the initrd should set
argv[0][0] = '@', because only that one needs the special handling that
it cannot be terminated properly on shut down. The one running from the
normal system can be treated like a standard systemd service.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-11-02 22:18                                               ` Williams, Dan J
@ 2011-11-02 23:39                                                 ` Lennart Poettering
  2011-11-03  0:28                                                   ` Williams, Dan J
  0 siblings, 1 reply; 50+ messages in thread
From: Lennart Poettering @ 2011-11-02 23:39 UTC (permalink / raw)
  To: Williams, Dan J
  Cc: Kay Sievers, NeilBrown, linux-raid, Andrey Borzenkov, systemd-devel

On Wed, 02.11.11 15:18, Williams, Dan J (dan.j.williams@intel.com) wrote:

> 
> On Wed, Nov 2, 2011 at 8:29 AM, Lennart Poettering
> <lennart@poettering.net> wrote:
> > On Wed, 02.11.11 16:21, Kay Sievers (kay.sievers@vrfy.org) wrote:
> >
> >>
> >> On Wed, Nov 2, 2011 at 16:17, Lennart Poettering <lennart@poettering.net> wrote:
> >> > Kernel threads we detect by checking whether /proc/$PID/cmdline is
> >> > empty, hence I'd suggest we use the first char of argv[0][0] here, to
> >> > detect whether something is a process to avoid killing. Question is
> >> > which char to choose for that. I am tempted to use '@'.
> >>
> >> Maybe introduce a 'initramfs' cgroup and move the pids there?
> >
> > Well, in which hierarchy? I am a bit concerned about having other
> > subsystems muck with the systemd cgroup hierarchy, before systemd has
> > set it up.
> >
> > I can see some elegance in having all code from the initrd that remains
> > running during boot in a cgroup of its own, but that's probably
> > orthogonal to finding a way to recognize processes not to kill at
> > shutdown. Why? Because there's stuff like Plymouth which also stays
> > around from the initramfs, but actually is something we *do* want to
> > kill on shutdown.
> 
> So how about rather than binaries self modifying themselves as "please
> don't kill me" with argv[][], shutdown can just avoid process where
> /proc/$PID/cmdline starts with /run/initramfs?  Then it's up to  where
> the initramfs runs the binary to determine which instances it wants
> provenance over versus leaving to the init system.

Nope, whether something should be excluded of killing during shutdown is
orthogonal to being part of the initramfs. For example, Plymouth
(i.e. the graphical boot splash thingy) is started form initrd too, but
we definitely want to kill it on shut down.

I am a bit concerned about checks against paths since initrd might play
some namespacing games and the paths might not appear to the main system
they way you'd expect.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-11-02 23:39                                                 ` Lennart Poettering
@ 2011-11-03  0:28                                                   ` Williams, Dan J
  0 siblings, 0 replies; 50+ messages in thread
From: Williams, Dan J @ 2011-11-03  0:28 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Kay Sievers, NeilBrown, linux-raid, Andrey Borzenkov, systemd-devel

On Wed, Nov 2, 2011 at 4:39 PM, Lennart Poettering
<lennart@poettering.net> wrote:
> On Wed, 02.11.11 15:18, Williams, Dan J (dan.j.williams@intel.com) wrote:
>
>>
>> On Wed, Nov 2, 2011 at 8:29 AM, Lennart Poettering
>> <lennart@poettering.net> wrote:
>> > On Wed, 02.11.11 16:21, Kay Sievers (kay.sievers@vrfy.org) wrote:
>> >
>> >>
>> >> On Wed, Nov 2, 2011 at 16:17, Lennart Poettering <lennart@poettering.net> wrote:
>> >> > Kernel threads we detect by checking whether /proc/$PID/cmdline is
>> >> > empty, hence I'd suggest we use the first char of argv[0][0] here, to
>> >> > detect whether something is a process to avoid killing. Question is
>> >> > which char to choose for that. I am tempted to use '@'.
>> >>
>> >> Maybe introduce a 'initramfs' cgroup and move the pids there?
>> >
>> > Well, in which hierarchy? I am a bit concerned about having other
>> > subsystems muck with the systemd cgroup hierarchy, before systemd has
>> > set it up.
>> >
>> > I can see some elegance in having all code from the initrd that remains
>> > running during boot in a cgroup of its own, but that's probably
>> > orthogonal to finding a way to recognize processes not to kill at
>> > shutdown. Why? Because there's stuff like Plymouth which also stays
>> > around from the initramfs, but actually is something we *do* want to
>> > kill on shutdown.
>>
>> So how about rather than binaries self modifying themselves as "please
>> don't kill me" with argv[][], shutdown can just avoid process where
>> /proc/$PID/cmdline starts with /run/initramfs?  Then it's up to  where
>> the initramfs runs the binary to determine which instances it wants
>> provenance over versus leaving to the init system.
>
> Nope, whether something should be excluded of killing during shutdown is
> orthogonal to being part of the initramfs. For example, Plymouth
> (i.e. the graphical boot splash thingy) is started form initrd too, but
> we definitely want to kill it on shut down.

In the plymouth case the path would be /bin/plymouth, the initramfs
would need to take special care to run mdmon from /run/initramfs to
identify it as needing the initramfs environment to carry out its
shutdown.

> I am a bit concerned about checks against paths since initrd might play
> some namespacing games and the paths might not appear to the main system
> they way you'd expect.

The initramfs needs to be modified to either tell mdmon it is running
from the initramfs or arrange for /proc/$MDMON_PID/cwd to appear to be
from /run/initramfs.  I only like the latter because it works with
existing mdmon binaries, but we may need shutdown to always leave
mdmon alone...

For user started md arrays the shutdown sequence still goes:

killall --> umount

...and we would need to express::

killall (but mdmon) --> umount --> mdadm -Ss (stops all not in use arrays)

So maybe we do the argv "@" tagging in all cases and systemd never
kills mdmon but arranges for all (stoppable) md devices to be stopped,
then rely on /run/initramfs/shutdown to handle the rootfs blockdev.

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-11-02 13:32                                     ` Lennart Poettering
  2011-11-02 14:33                                       ` Kay Sievers
@ 2011-11-07  2:52                                       ` NeilBrown
  2011-11-07  3:42                                         ` Kay Sievers
  2011-11-07 12:00                                         ` Lennart Poettering
  2011-11-08  0:11                                       ` Michal Soltys
  2 siblings, 2 replies; 50+ messages in thread
From: NeilBrown @ 2011-11-07  2:52 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Dan Williams, Andrey Borzenkov, Tomasz Torcz, systemd-devel, linux-raid

[-- Attachment #1: Type: text/plain, Size: 2254 bytes --]

On Wed, 2 Nov 2011 14:32:25 +0100 Lennart Poettering <lennart@poettering.net>
wrote:

> On Wed, 02.11.11 13:03, NeilBrown (neilb@suse.de) wrote:

> > Each instance of mdmon manages a set of arrays and must remain running
> > until all of those arrays are readonly (or shut down).  This allows it to
> > record that all writes have completed and mark the array as 'clean' so a
> > resync isn't needed at next boot.
> 
> Why doesn't the kernel do that on its own?

Because the kernel doesn't know about the format of the metadata that
describes the array.

> > 
> > You couldn't just do the equivalent of
> >   fuser -k /some/filesystem
> >   umount /some/filesystem
> > 
> > iterating over filesystems with '/' last?
> >
> > Then anything that only uses the /run filesystem will survive.
> 
> What we do right now is this:
> 
> kill_all_processes();
> do {
>      umount_all_file_systems_we_can();
>      read_only_mount_all_remaining_file_systems();
> } while (we_had_some_success_with_that());
> jump_into_initrd();
> 
> As long as mdmon references a file from the root disk we cannot umount
> it, so the loop wouldn't be effective.

What exactly is "kill_all_processes()"?   is it SIGTERM or SIGKILL or both
with a gap or ???

I assume a SIGKILL.  I don't mind a SIGTERM and it could be useful to
expedite mdmon cleaning up.

However there is an important piece missing.  When you remount,ro a
filesystem, the block device doesn't get told so it thinks it is still open
read/write.  So md cannot tell mdmon that the array is now read-only
It would make a lot of sense for mdmon to exit after receiving a SIGTERM as
soon as the device is marked read-only.  But it just doesn't know.

We can probably fix that, but that doesn't really help for now.

I think I would like:

 - add to the above loop "stop any virtual devices that we can".
   Exactly how to do that if /proc and /sys are already unmounted
   is unclear.  Is one or both of these kept around somewhere?

 - allow processes to be marked some way so they get SIGTERM but not
   SIGKILL.  I'm happy adding magic char to argv[0].

We should be able to make it work with those changes - if they are possible.

Thanks,

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-11-07  2:52                                       ` NeilBrown
@ 2011-11-07  3:42                                         ` Kay Sievers
  2011-11-07  4:30                                           ` NeilBrown
  2011-11-07 12:00                                         ` Lennart Poettering
  1 sibling, 1 reply; 50+ messages in thread
From: Kay Sievers @ 2011-11-07  3:42 UTC (permalink / raw)
  To: NeilBrown
  Cc: Lennart Poettering, linux-raid, Dan Williams, Andrey Borzenkov,
	systemd-devel

On Mon, Nov 7, 2011 at 03:52, NeilBrown <neilb@suse.de> wrote:

> However there is an important piece missing.  When you remount,ro a
> filesystem, the block device doesn't get told so it thinks it is still open
> read/write.  So md cannot tell mdmon that the array is now read-only

That ro/rw flag is visible in /proc/self/mountinfo, shouldn't it be
possible for mdmon to poll() that file and let the kernel wake stuff
up when the ro/rw flag changes, like we do for the usual mount changes
already?

Kay
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-11-07  3:42                                         ` Kay Sievers
@ 2011-11-07  4:30                                           ` NeilBrown
  0 siblings, 0 replies; 50+ messages in thread
From: NeilBrown @ 2011-11-07  4:30 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Lennart Poettering, linux-raid, Dan Williams, Andrey Borzenkov,
	systemd-devel

[-- Attachment #1: Type: text/plain, Size: 1363 bytes --]

On Mon, 7 Nov 2011 04:42:54 +0100 Kay Sievers <kay.sievers@vrfy.org> wrote:

> On Mon, Nov 7, 2011 at 03:52, NeilBrown <neilb@suse.de> wrote:
> 
> > However there is an important piece missing.  When you remount,ro a
> > filesystem, the block device doesn't get told so it thinks it is still open
> > read/write.  So md cannot tell mdmon that the array is now read-only
> 
> That ro/rw flag is visible in /proc/self/mountinfo, shouldn't it be
> possible for mdmon to poll() that file and let the kernel wake stuff
> up when the ro/rw flag changes, like we do for the usual mount changes
> already?
> 
> Kay

The ro/rw flag for file systems is in /proc/self/mountinfo.

However I want the ro/rw flag for the block device.
A block device can be partitioned so it might have multiple filesystems on it.
and it might have swap too.
or a dm table or another md device or an open file descriptor or ....

Yes, I could maybe parse various different files and try to work out what is
going on.  But the kernel can easily *know* what is going on.

Making this work "perfectly" would require md dropping its write-access to
member devices when the last write-access to the top level device goes.  And
the same for dm and loop and .....

But just filesystems would go a long way to catching the common cases
correctly.

Thanks,
NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-11-07  2:52                                       ` NeilBrown
  2011-11-07  3:42                                         ` Kay Sievers
@ 2011-11-07 12:00                                         ` Lennart Poettering
  2011-11-07 19:09                                           ` Williams, Dan J
  1 sibling, 1 reply; 50+ messages in thread
From: Lennart Poettering @ 2011-11-07 12:00 UTC (permalink / raw)
  To: NeilBrown
  Cc: Dan Williams, Andrey Borzenkov, Tomasz Torcz, systemd-devel, linux-raid

On Mon, 07.11.11 13:52, NeilBrown (neilb@suse.de) wrote:

> > Why doesn't the kernel do that on its own?
> 
> Because the kernel doesn't know about the format of the metadata that
> describes the array.

Yupp, my suggestion would be to change that. 

> > What we do right now is this:
> > 
> > kill_all_processes();
> > do {
> >      umount_all_file_systems_we_can();
> >      read_only_mount_all_remaining_file_systems();
> > } while (we_had_some_success_with_that());
> > jump_into_initrd();
> > 
> > As long as mdmon references a file from the root disk we cannot umount
> > it, so the loop wouldn't be effective.
> 
> What exactly is "kill_all_processes()"?   is it SIGTERM or SIGKILL or both
> with a gap or ???

SIGTERM followed by SIGKILL after 5s if the programs do not react to
that in time. But note that this logic only applies to processes which
for some reason managed to escape systemd's usual cgroup-based killing
logic. Normal services are hence already killed at that time, and only
processes which moved themselves out of any cgroup or for which the
service files disabled killing might survive to this point.

> I assume a SIGKILL.  I don't mind a SIGTERM and it could be useful to
> expedite mdmon cleaning up.
> 
> However there is an important piece missing.  When you remount,ro a
> filesystem, the block device doesn't get told so it thinks it is still open
> read/write.  So md cannot tell mdmon that the array is now read-only
> It would make a lot of sense for mdmon to exit after receiving a SIGTERM as
> soon as the device is marked read-only.  But it just doesn't know.

As mentioned by Kay, you can get notifications for this by poll()ing on
/proc/self/mountinfo. Note again however, that we kill first, and only
then try to unmount/remount.

> We can probably fix that, but that doesn't really help for now.
> 
> I think I would like:
> 
>  - add to the above loop "stop any virtual devices that we can".
>    Exactly how to do that if /proc and /sys are already unmounted
>    is unclear.  Is one or both of these kept around somewhere?

/proc and /sys are not unmounted in this loop. Being virtual API fs we
exclude them from this logic and leave them around until the initrd
unmounts them if it wants to.

Actually, in the loop above there are three more steps: in each
iteration we also try to detach all swap devices, all loopback devices
and all DM devices. We probably could add a similar operation for MD
devices here too. But note that this loop is more of a last-resort
thing, and normally shouldn't do much.

>  - allow processes to be marked some way so they get SIGTERM but not
>    SIGKILL.  I'm happy adding magic char to argv[0].

Note that you can configure how you are killed relatively flexibly in
the service files and that the loop pointed out above is only this last
resort thing which is applied to all processes/mount points/... which
stick around after this normal shutdown.

Here's another attempt in explaining how this works:

<snip>
terminate_all_mount_and_service_units();
kill_all_remaining_processes();
do {
     umount_all_remaining_file_systems_we_can();
     read_only_mount_all_remaining_file_systems();
     detach_all_remaining_loop_devices();
     detach_all_remaining_swap_devices();
     detach_all_remaining_dm_devices();
} while (we_had_some_success_with_that());
jump_into_initrd();
</snip>

You have relatively flexible control of the first step in this code. The
second step is then the hammer that tries to fix up what this step
didn't accomplish. My suggestion to check argv[0][0] was to avoid the
hammer.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-11-07 12:00                                         ` Lennart Poettering
@ 2011-11-07 19:09                                           ` Williams, Dan J
  2011-11-08 14:43                                             ` Lennart Poettering
  0 siblings, 1 reply; 50+ messages in thread
From: Williams, Dan J @ 2011-11-07 19:09 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: NeilBrown, Andrey Borzenkov, Tomasz Torcz, systemd-devel, linux-raid

On Mon, Nov 7, 2011 at 4:00 AM, Lennart Poettering
<lennart@poettering.net> wrote:
> On Mon, 07.11.11 13:52, NeilBrown (neilb@suse.de) wrote:
>
>> > Why doesn't the kernel do that on its own?
>>
>> Because the kernel doesn't know about the format of the metadata that
>> describes the array.
>
> Yupp, my suggestion would be to change that.

It's quite a bit of idiosyncratic code that needs to be duplicated in
kernel space and userspace (since userspace always needs to know how
to parse the metadata for array assembly).  All to record a dirty bit
that flips at most every 5 seconds, or a disk failure event which is
even less frequent.  Throw in policy constraints like restricting
which block devices can become part of the raid set.  Rinse and repeat
for every possible metadata format.

[..]
>> What exactly is "kill_all_processes()"?   is it SIGTERM or SIGKILL or both
>> with a gap or ???
>
> SIGTERM followed by SIGKILL after 5s if the programs do not react to
> that in time. But note that this logic only applies to processes which
> for some reason managed to escape systemd's usual cgroup-based killing
> logic. Normal services are hence already killed at that time, and only
> processes which moved themselves out of any cgroup or for which the
> service files disabled killing might survive to this point.

So I think mdmon should always try to escape itself from cgroup based
killing.  It follows the lifespan of the array, and if the array is
not stopped by the cgroup exit (or the array lifespan is not
controlled in a service file), then mdmon must keep running.

[..]
>
> Here's another attempt in explaining how this works:
>
> <snip>
> terminate_all_mount_and_service_units();
> kill_all_remaining_processes();
> do {
>     umount_all_remaining_file_systems_we_can();
>     read_only_mount_all_remaining_file_systems();
>     detach_all_remaining_loop_devices();
>     detach_all_remaining_swap_devices();
>     detach_all_remaining_dm_devices();

So I've started putting together a md_detach_all() routine that will
attempt to stop all md devices (via sysfs).  Where all mdmon instances
have missed the initial killall with the argv '@' flagging.

Like the dm implementation it will address all but the root md device.

> } while (we_had_some_success_with_that());
> jump_into_initrd();

The final act of the initramfs is then "mdadm --wait-clean --scan" to
communicate with the rootfs-blockdev-mdmon to be sure the metadata has
been marked clean.  All other mdmon instances should have exited
naturally when their md devices stopped, but the "--wait-clean --scan"
will have ensured shutdown can progress safely.

> You have relatively flexible control of the first step in this code. The
> second step is then the hammer that tries to fix up what this step
> didn't accomplish. My suggestion to check argv[0][0] was to avoid the
> hammer.

I notice that if the rootfs is on a dm or md device systemd/shutdown
will always fall through to ultimate_send_signal() which will not
discriminate against processes flagged with '@'.  Since we aren't
stopping the root md device I wonder if ultimate_send_signal() should
also ignore flagged processes, or whether the failure to stop the root
device is to be expected and let shutdown skip ultimate_send_signal()
if the only remaining work is shutting down the rootfs-blockdev.  I'm
leaning towards the latter.

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-11-02 13:32                                     ` Lennart Poettering
  2011-11-02 14:33                                       ` Kay Sievers
  2011-11-07  2:52                                       ` NeilBrown
@ 2011-11-08  0:11                                       ` Michal Soltys
  2011-11-08 16:46                                         ` Michal Soltys
  2 siblings, 1 reply; 50+ messages in thread
From: Michal Soltys @ 2011-11-08  0:11 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: NeilBrown, Dan Williams, Andrey Borzenkov, Tomasz Torcz,
	systemd-devel, linux-raid

On 11-11-02 14:32, Lennart Poettering wrote:
> What we do right now is this:
>
> kill_all_processes();
> do {
>       umount_all_file_systems_we_can();
>       read_only_mount_all_remaining_file_systems();
> } while (we_had_some_success_with_that());
> jump_into_initrd();
>
> As long as mdmon references a file from the root disk we cannot umount
> it, so the loop wouldn't be effective.
>

I've peeked into systemd, and from what I can see, it /only/ jumps back 
to initramfs (prepare_new_root() and pivot_to_new_root()) if shutdown 
"binary" is present on initramfs. And whenever mdmon is still running or 
not, is not in any way determinent for pivot_root(2) call to succeed (or 
... ?).

If /run/initramfs/shutdown is not present, then systemd just do the 
things the old way as far as I can see - it doesn't even attempt to 
pivot. And if it doesn't, the it can't umount the root (being itself 
tied to it) ?

So essentially, if systemd execs /shutdown (after pivoting to 
/run/initramfs) - then it's dracut's modules.d/99shutdown, which itself 
sources hooks from other modules to do the rest of cleaning job. And 
that should take care of all the remaining stuff (including terminating 
mdmon in graceful way, and then umounting /oldroot). Either way - pretty 
simple to add the necessary functionality to dracut.

So wouldn't simply a systemd's cgroup named say - immortals - with mdmon 
(by default) in it suffice ? Pivot back as usual, leave mdmon alive, let 
the dracut (or anything else used for initramfs) do the rest of the job 
(properly).


p.s.
Sorry if I missed something obvious, it was a quick and late peek over 
systemd's shutdown.c.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-11-07 19:09                                           ` Williams, Dan J
@ 2011-11-08 14:43                                             ` Lennart Poettering
  2011-11-08 23:27                                               ` Williams, Dan J
  0 siblings, 1 reply; 50+ messages in thread
From: Lennart Poettering @ 2011-11-08 14:43 UTC (permalink / raw)
  To: Williams, Dan J
  Cc: NeilBrown, Andrey Borzenkov, Tomasz Torcz, systemd-devel, linux-raid

On Mon, 07.11.11 11:09, Williams, Dan J (dan.j.williams@intel.com) wrote:

> >> What exactly is "kill_all_processes()"?   is it SIGTERM or SIGKILL or both
> >> with a gap or ???
> >
> > SIGTERM followed by SIGKILL after 5s if the programs do not react to
> > that in time. But note that this logic only applies to processes which
> > for some reason managed to escape systemd's usual cgroup-based killing
> > logic. Normal services are hence already killed at that time, and only
> > processes which moved themselves out of any cgroup or for which the
> > service files disabled killing might survive to this point.
> 
> So I think mdmon should always try to escape itself from cgroup based
> killing.  It follows the lifespan of the array, and if the array is
> not stopped by the cgroup exit (or the array lifespan is not
> controlled in a service file), then mdmon must keep running.

Well, I think when it gets killed by the cgroup-based killer then it
should try to tear down its MD device.

In the mdmon service file use SendSIGKILL=no to disable sending of
SIGKILL after the initial SIGTERM. With KillSignal= you chan choose the
signal you first want to be killed with, if you don't want it to be
SIGTERM. With KillMode= you can choose whether only the main process of
the service, all processes of the service, or no processes of the
service shall be killed. With TimeoutSec= you can set the timeout
between the SIGTERM and the SIGKILL. See systemd.service(5) for more
information.

> > You have relatively flexible control of the first step in this code. The
> > second step is then the hammer that tries to fix up what this step
> > didn't accomplish. My suggestion to check argv[0][0] was to avoid the
> > hammer.
> 
> I notice that if the rootfs is on a dm or md device systemd/shutdown
> will always fall through to ultimate_send_signal() which will not
> discriminate against processes flagged with '@'.  Since we aren't
> stopping the root md device I wonder if ultimate_send_signal() should
> also ignore flagged processes, or whether the failure to stop the root
> device is to be expected and let shutdown skip ultimate_send_signal()
> if the only remaining work is shutting down the rootfs-blockdev.  I'm
> leaning towards the latter.

The idea was to skip processes flgged with '@' in both the
ultimate_send_signal() and send_signal() calls.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-11-08  0:11                                       ` Michal Soltys
@ 2011-11-08 16:46                                         ` Michal Soltys
  2011-11-08 20:32                                           ` Michal Soltys
  0 siblings, 1 reply; 50+ messages in thread
From: Michal Soltys @ 2011-11-08 16:46 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: NeilBrown, Dan Williams, Andrey Borzenkov, Tomasz Torcz,
	systemd-devel, linux-raid

On 11-11-08 01:11, Michal Soltys wrote:
>
> I've peeked into systemd, and from what I can see, it /only/ jumps back
> to initramfs (prepare_new_root() and pivot_to_new_root()) if shutdown
> "binary" is present on initramfs. And whenever mdmon is still running or
> not, is not in any way determinent for pivot_root(2) call to succeed (or
> ... ?).
>
> If /run/initramfs/shutdown is not present, then systemd just do the
> things the old way as far as I can see - it doesn't even attempt to
> pivot. And if it doesn't, the it can't umount the root (being itself
> tied to it) ?
>
> So essentially, if systemd execs /shutdown (after pivoting to
> /run/initramfs) - then it's dracut's modules.d/99shutdown, which itself
> sources hooks from other modules to do the rest of cleaning job. And
> that should take care of all the remaining stuff (including terminating
> mdmon in graceful way, and then umounting /oldroot). Either way - pretty
> simple to add the necessary functionality to dracut.
>
> So wouldn't simply a systemd's cgroup named say - immortals - with mdmon
> (by default) in it suffice ? Pivot back as usual, leave mdmon alive, let
> the dracut (or anything else used for initramfs) do the rest of the job
> (properly).

I did some testings today, and it's all working nicely as expected. 
Actually I modified "classic" rc scripts I'm using under sysinit to 
perform full umount/detach (using similar methods to systemd), with 
mdmon happily living through everything. The only things needed after 
pivot_root were:

mdmon --takeover --all
telinit U

(so obviously my dracut image had mdmon, telinit and init, and slightly 
adjusted shutdown script).

Then everything from oldroot could be nicely and cleanly umounted.

Even more elegant would be if e.g. mdmon had added option such as:

--reroot <newroot>

to chroot() and reopen its files under <newroot>, and then systemd would 
call

mdmon --reroot /run/initramfs --all --takeover

after - prepare_new_root() and before - pivot_to_new_root()

Then even existing intiramfs image could (probably) be mdmon-agnostic.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-11-08 16:46                                         ` Michal Soltys
@ 2011-11-08 20:32                                           ` Michal Soltys
  2011-11-08 22:29                                             ` Williams, Dan J
  0 siblings, 1 reply; 50+ messages in thread
From: Michal Soltys @ 2011-11-08 20:32 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: NeilBrown, Dan Williams, Andrey Borzenkov, Tomasz Torcz,
	systemd-devel, linux-raid

On 11-11-08 17:46, Michal Soltys wrote:
> Then even existing intiramfs image could (probably) be mdmon-agnostic.

Actually:

chroot /run/initramfs mdmon --takeover --all

worked just fine (after preparing new root - so after all mount --binds, 
and before pivot_root(8)).

So in context of systemd instead of sysv scripts - a fork / chroot / 
exec mdmon / wait - instead of killing it would do the thing, followed 
by pivot_to_new_root().

Actually anything that could benefit from "immortality" in one or the 
other way (perhaps udevd, so e.g. lvm doesn't need --noudevsync ? - when 
taken over inside dracut's shutdown or anything similar after going back 
to initramfs) that can be pre-chrooted into /run/initramfs and exec'ed, 
should work just fine. For the record, udevd worked properly with pivot 
survival.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-11-08 20:32                                           ` Michal Soltys
@ 2011-11-08 22:29                                             ` Williams, Dan J
  0 siblings, 0 replies; 50+ messages in thread
From: Williams, Dan J @ 2011-11-08 22:29 UTC (permalink / raw)
  To: Michal Soltys
  Cc: Lennart Poettering, NeilBrown, Andrey Borzenkov, Tomasz Torcz,
	systemd-devel, linux-raid

On Tue, Nov 8, 2011 at 12:32 PM, Michal Soltys <soltys@ziu.info> wrote:
> On 11-11-08 17:46, Michal Soltys wrote:
>>
>> Then even existing intiramfs image could (probably) be mdmon-agnostic.
>
> Actually:
>
> chroot /run/initramfs mdmon --takeover --all

One of the suggestion earlier in the thread is not mess with takeover
at all.  The rootfs md device is always monitored by a mdmon instance
launched from /run/initramfs.  The only way to update mdmon is to
recreate the initramfs and reboot (which is similar to the experience
of trying to update raidfoo.ko for the rootfs blockdev).

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [systemd-devel] systemd kills mdmon if it was started manually by user
  2011-11-08 14:43                                             ` Lennart Poettering
@ 2011-11-08 23:27                                               ` Williams, Dan J
  0 siblings, 0 replies; 50+ messages in thread
From: Williams, Dan J @ 2011-11-08 23:27 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: NeilBrown, Andrey Borzenkov, Tomasz Torcz, systemd-devel, linux-raid

On Tue, Nov 8, 2011 at 6:43 AM, Lennart Poettering
<lennart@poettering.net> wrote:
> On Mon, 07.11.11 11:09, Williams, Dan J (dan.j.williams@intel.com) wrote:
>> So I think mdmon should always try to escape itself from cgroup based
>> killing.  It follows the lifespan of the array, and if the array is
>> not stopped by the cgroup exit (or the array lifespan is not
>> controlled in a service file), then mdmon must keep running.
>
> Well, I think when it gets killed by the cgroup-based killer then it
> should try to tear down its MD device.

We can easily fall off the complexity cliff trying to tear down the MD
device because it can be pinned by a mounted filesystem or being
claimed anywhere in an arbitrary stack of DM or MD devices.  I did not
think cgroup exit would umount() filesystems?

[..]
>> I notice that if the rootfs is on a dm or md device systemd/shutdown
>> will always fall through to ultimate_send_signal() which will not
>> discriminate against processes flagged with '@'.  Since we aren't
>> stopping the root md device I wonder if ultimate_send_signal() should
>> also ignore flagged processes, or whether the failure to stop the root
>> device is to be expected and let shutdown skip ultimate_send_signal()
>> if the only remaining work is shutting down the rootfs-blockdev.  I'm
>> leaning towards the latter.
>
> The idea was to skip processes flgged with '@' in both the
> ultimate_send_signal() and send_signal() calls.

Ok, that makes it easier.

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2011-11-08 23:27 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-12-04  8:41 systemd kills mdmon if it was started manually by user Andrey Borzenkov
2010-12-04  9:12 ` Christian Parpart
2010-12-04 12:08   ` Andrey Borzenkov
2010-12-12 13:20     ` [systemd-devel] " Luca Berra
2011-01-07  0:40     ` Lennart Poettering
     [not found]     ` <20101204121413.GC11336@mother.pipebreaker.pl>
     [not found]       ` <AANLkTi=nTSdHc55f08G9sdEK6u8eXp276VOTHHr0jmXT@mail.gmail.com>
     [not found]         ` <20110125034434.GC7046@tango.0pointer.de>
     [not found]           ` <AANLkTik189VTXYpzLFqP=MNBg=Nx-Yq6BOKURtiby++B@mail.gmail.com>
     [not found]             ` <20110125042814.GA9727@tango.0pointer.de>
2011-02-04 19:55               ` Andrey Borzenkov
2011-02-08  9:48                 ` [systemd-devel] " Lennart Poettering
2011-02-08 10:52                   ` Andrey Borzenkov
2011-02-08 11:07                     ` Lennart Poettering
2011-02-08 13:54                       ` Andrey Borzenkov
2011-02-08 17:28                         ` [systemd-devel] " Lennart Poettering
2011-10-23  8:00                           ` Dan Williams
2011-10-24  8:04                             ` Thomas Jarosch
2011-10-25  1:40                             ` NeilBrown
2011-10-31 11:06                             ` Lennart Poettering
2011-10-31 11:15                               ` [systemd-devel] " Lennart Poettering
2011-11-02  0:44                               ` NeilBrown
2011-11-02  1:16                                 ` Lennart Poettering
2011-11-02  2:03                                   ` NeilBrown
2011-11-02 13:32                                     ` Lennart Poettering
2011-11-02 14:33                                       ` Kay Sievers
2011-11-02 15:17                                         ` Lennart Poettering
2011-11-02 15:21                                           ` Kay Sievers
2011-11-02 15:29                                             ` [systemd-devel] " Lennart Poettering
2011-11-02 22:18                                               ` Williams, Dan J
2011-11-02 23:39                                                 ` Lennart Poettering
2011-11-03  0:28                                                   ` Williams, Dan J
2011-11-02 17:21                                           ` Williams, Dan J
2011-11-02 23:35                                             ` Lennart Poettering
2011-11-02 18:16                                         ` Williams, Dan J
2011-11-02 18:49                                           ` Kay Sievers
2011-11-02 19:31                                             ` Williams, Dan J
2011-11-02 19:51                                               ` Kay Sievers
2011-11-07  2:52                                       ` NeilBrown
2011-11-07  3:42                                         ` Kay Sievers
2011-11-07  4:30                                           ` NeilBrown
2011-11-07 12:00                                         ` Lennart Poettering
2011-11-07 19:09                                           ` Williams, Dan J
2011-11-08 14:43                                             ` Lennart Poettering
2011-11-08 23:27                                               ` Williams, Dan J
2011-11-08  0:11                                       ` Michal Soltys
2011-11-08 16:46                                         ` Michal Soltys
2011-11-08 20:32                                           ` Michal Soltys
2011-11-08 22:29                                             ` Williams, Dan J
2011-02-09 14:01                       ` Lennart Poettering
2011-01-07  0:38 ` Lennart Poettering
2011-01-07  1:09   ` [systemd-devel] " Michael Biebl
2011-01-07  1:17     ` Roman Mamedov
2011-01-07  1:16   ` NeilBrown
2011-01-07  1:42     ` Lennart Poettering

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.