All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] mdraid rootfs support
@ 2009-02-05 22:49 Dan Williams
       [not found] ` <20090205224808.18610.14957.stgit-p8uTFz9XbKjBPTuBivz2/GFmcEqAMTzPQQ4Iyu8u01E@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Dan Williams @ 2009-02-05 22:49 UTC (permalink / raw)
  To: initramfs-u79uwXL29TY76Z2rM5mHXA
  Cc: neilb-l3A5Bk7waGM, jacek.danecki-ral2JQCrhuEAvxtiuMwx3w

This series is a first take at dracut support for an mdraid rootfs.  It
includes considerations for the new external metadata formats supported in the
latest development branch of mdadm:

	git://neil.brown.name/mdadm devel-3.0

This is an RFC because it is not clear to me that a single call to "udevadm
settle" is enough to guarantee discovery of all storage devices ahead of raid
assembly.  A cursory test with (4) ahci attached drives was successful.

Regards,
Dan

---

Dan Williams (3):
      add more disk id helpers to udevexe
      raid: external and internal metadata support
      gen-mod-lists: create lists of modules that may talk to a root device


 dracut        |   13 ++++++++++---
 gen-mod-lists |   34 ++++++++++++++++++++++++++++++++++
 init          |   10 ++++++++++
 3 files changed, 54 insertions(+), 3 deletions(-)
 create mode 100755 gen-mod-lists
--
To unsubscribe from this list: send the line "unsubscribe initramfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC PATCH 1/3] gen-mod-lists: create lists of modules that may talk to a root device
       [not found] ` <20090205224808.18610.14957.stgit-p8uTFz9XbKjBPTuBivz2/GFmcEqAMTzPQQ4Iyu8u01E@public.gmane.org>
@ 2009-02-05 22:49   ` Dan Williams
  2009-02-05 22:49   ` [RFC PATCH 2/3] raid: external and internal metadata support Dan Williams
  2009-02-05 22:49   ` [RFC PATCH 3/3] add more disk id helpers to udevexe Dan Williams
  2 siblings, 0 replies; 24+ messages in thread
From: Dan Williams @ 2009-02-05 22:49 UTC (permalink / raw)
  To: initramfs-u79uwXL29TY76Z2rM5mHXA
  Cc: neilb-l3A5Bk7waGM, jacek.danecki-ral2JQCrhuEAvxtiuMwx3w

notting-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org:
	The idea is that we don't want to include every single module,
	but we want to include every module that might define a block
	device to boot from, or a network device to network boot from.
	Having it in the upstream kernel would be helpful, although how
	it's generated now is obviously a hack.

	Doing it at runtime in dracut would work, but would be obviously
	slow.

This is a temporary hack to duplicate this functionality from the
Fedora kernel srpm in dracut.

Also added "raid" modules.

Signed-off-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 gen-mod-lists |   34 ++++++++++++++++++++++++++++++++++
 1 files changed, 34 insertions(+), 0 deletions(-)
 create mode 100755 gen-mod-lists

diff --git a/gen-mod-lists b/gen-mod-lists
new file mode 100755
index 0000000..13999d7
--- /dev/null
+++ b/gen-mod-lists
@@ -0,0 +1,34 @@
+#!/bin/bash
+
+# Copied from from kernel.spec (kernel-2.6.27.12-78.2.8.fc9.src.rpm)
+# Creates /lib/modules/$KernelVer/modules.{block,networking,raid}
+
+KernelVer=$1
+[ -n $KernelVer ] && KernelVer=$(uname -r)
+
+if [ ! -d /lib/modules/$KernelVer ]; then
+	echo "error: could not find /lib/modules/$KernelVer"
+	exit 1
+fi
+
+find /lib/modules/$KernelVer -name "*.ko" -type f >modnames
+
+# Generate a list of modules for block and networking.
+
+fgrep /drivers/ modnames | xargs --no-run-if-empty nm -upA |
+sed -n 's,^.*/\([^/]*\.ko\):  *U \(.*\)$,\1 \2,p' > drivers.undef
+
+collect_modules_list()
+{
+  sed -r -n -e "s/^([^ ]+) \\.?($2)\$/\\1/p" drivers.undef |
+  LC_ALL=C sort -u > /lib/modules/$KernelVer/modules.$1
+}
+
+collect_modules_list networking \
+                     'register_netdev|ieee80211_register_hw|usbnet_probe'
+collect_modules_list block \
+                     'ata_scsi_ioctl|scsi_add_host|blk_init_queue|register_mtd_blktrans|scsi_esp_register'
+
+# mdraid modules, could be made part of 'block'
+collect_modules_list raid \
+                     'register_md_personality'

--
To unsubscribe from this list: send the line "unsubscribe initramfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 2/3] raid: external and internal metadata support
       [not found] ` <20090205224808.18610.14957.stgit-p8uTFz9XbKjBPTuBivz2/GFmcEqAMTzPQQ4Iyu8u01E@public.gmane.org>
  2009-02-05 22:49   ` [RFC PATCH 1/3] gen-mod-lists: create lists of modules that may talk to a root device Dan Williams
@ 2009-02-05 22:49   ` Dan Williams
       [not found]     ` <20090205224920.18610.63979.stgit-p8uTFz9XbKjBPTuBivz2/GFmcEqAMTzPQQ4Iyu8u01E@public.gmane.org>
  2009-02-05 22:49   ` [RFC PATCH 3/3] add more disk id helpers to udevexe Dan Williams
  2 siblings, 1 reply; 24+ messages in thread
From: Dan Williams @ 2009-02-05 22:49 UTC (permalink / raw)
  To: initramfs-u79uwXL29TY76Z2rM5mHXA
  Cc: neilb-l3A5Bk7waGM, jacek.danecki-ral2JQCrhuEAvxtiuMwx3w

External metadata support implies that metadata events are handled by a
userspace daemon.  This daemon, mdmon, needs to be started ahead of the
rootfs being mounted to handle the raid volume dirty bit.  Even if the
rootfs is mounted read-only the rootdev may still be written by
filesystem journal-playback operations.

After the rootfs is mounted to /sysroot, mdmon is restarted in the new
namespace.  The command "mdmon /proc/mdstat /sysroot" tells mdmon to
terminate any instances in the current namespace and then launch new
instances, chroot(2) to /sysroot, per container device found in
/proc/mdstat.

Signed-off-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 dracut |   11 +++++++++--
 init   |   10 ++++++++++
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/dracut b/dracut
index ef2ca42..513b6c6 100755
--- a/dracut
+++ b/dracut
@@ -69,13 +69,14 @@ initdir=$(mktemp -d -t initramfs.XXXXXX)
 exe="/bin/bash /bin/mount /bin/mknod /bin/mkdir /sbin/modprobe /sbin/udevd /sbin/udevadm /sbin/nash /bin/kill /sbin/pidof /bin/sleep /bin/echo /usr/sbin/chroot"
 lvmexe="/sbin/lvm"
 cryptexe="/sbin/cryptsetup"
+raidexe="/sbin/mdadm /sbin/mdmon"
 # and some things that are nice for debugging
 debugexe="/bin/ls /bin/cat /bin/ln /bin/ps /bin/grep /bin/more"
 # udev things we care about
 udevexe="/lib/udev/vol_id /lib/udev/console_init"
 
 # install base files
-for binary in $exe $debugexe $udevexe $lvmexe $cryptexe ; do
+for binary in $exe $debugexe $udevexe $lvmexe $cryptexe $raidexe ; do
   inst $binary $initdir
 done
 
@@ -152,7 +153,7 @@ cp $switchroot $initdir/sbin/switch_root
 mkdir -p $initdir/etc $initdir/proc $initdir/sys $initdir/sysroot $initdir/dev/pts
 
 # FIXME: hard-coded module list of doom.
-[ -z "$modules" ] && modules="=ata =block =drm dm-crypt aes sha256 cbc"
+[ -z "$modules" ] && modules="=ata =block =drm =raid dm-crypt aes sha256 cbc"
 
 mkdir -p $initdir/lib/modules/$kernel
 # expand out module deps, etc
@@ -171,6 +172,12 @@ if [ -x /usr/libexec/plymouth/plymouth-populate-initrd ]; then
     /usr/libexec/plymouth/plymouth-populate-initrd -t "$initdir" || :
 fi
 
+# raid
+# mdadm.conf allows mdadm to disambiguate foreign arrays for some metadata types
+# check /etc and /etc/mdadm (/etc wins if both are present)
+[ -f /etc/mdadm/mdadm.conf ] && inst /etc/mdadm/mdadm.conf "$initdir" /etc/mdadm.conf
+[ -f /etc/mdadm.conf ] && inst /etc/mdadm.conf "$initdir"
+
 pushd $initdir >/dev/null
 find . |cpio -H newc -o |gzip -9 > $outfile
 popd >/dev/null
diff --git a/init b/init
index 706127f..0294502 100755
--- a/init
+++ b/init
@@ -46,6 +46,13 @@ mknod /dev/tty1 c 4 1
 /sbin/udevd --daemon
 /sbin/udevadm trigger
 
+# start any defined raid arrays
+# we settle before assembling to hopefully prevent prematurely degrading arrays
+if [ -f /etc/mdadm.conf ]; then
+  /sbin/udevadm settle
+  /sbin/mdadm -Asc /etc/mdadm.conf
+fi
+
 # mount the rootfs
 NEWROOT="/sysroot"
 
@@ -110,6 +117,9 @@ kill `pidof udevd`
 
 [ -x /bin/plymouth ] && /bin/plymouth --newroot=$NEWROOT
 
+# switch any mdmon instances to newroot
+[ -f /etc/mdadm.conf ] && /sbin/mdmon /proc/mdstat $NEWROOT
+
 # FIXME: nash die die die
 exec /sbin/switch_root
 # davej doesn't like initrd bugs

--
To unsubscribe from this list: send the line "unsubscribe initramfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 3/3] add more disk id helpers to udevexe
       [not found] ` <20090205224808.18610.14957.stgit-p8uTFz9XbKjBPTuBivz2/GFmcEqAMTzPQQ4Iyu8u01E@public.gmane.org>
  2009-02-05 22:49   ` [RFC PATCH 1/3] gen-mod-lists: create lists of modules that may talk to a root device Dan Williams
  2009-02-05 22:49   ` [RFC PATCH 2/3] raid: external and internal metadata support Dan Williams
@ 2009-02-05 22:49   ` Dan Williams
  2 siblings, 0 replies; 24+ messages in thread
From: Dan Williams @ 2009-02-05 22:49 UTC (permalink / raw)
  To: initramfs-u79uwXL29TY76Z2rM5mHXA
  Cc: neilb-l3A5Bk7waGM, jacek.danecki-ral2JQCrhuEAvxtiuMwx3w

Allow udev to create /dev/disk/by-id links

Signed-off-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 dracut |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/dracut b/dracut
index 513b6c6..da86a23 100755
--- a/dracut
+++ b/dracut
@@ -73,7 +73,7 @@ raidexe="/sbin/mdadm /sbin/mdmon"
 # and some things that are nice for debugging
 debugexe="/bin/ls /bin/cat /bin/ln /bin/ps /bin/grep /bin/more"
 # udev things we care about
-udevexe="/lib/udev/vol_id /lib/udev/console_init"
+udevexe="/lib/udev/*_id /lib/udev/console_init"
 
 # install base files
 for binary in $exe $debugexe $udevexe $lvmexe $cryptexe $raidexe ; do

--
To unsubscribe from this list: send the line "unsubscribe initramfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] raid: external and internal metadata support
       [not found]     ` <20090205224920.18610.63979.stgit-p8uTFz9XbKjBPTuBivz2/GFmcEqAMTzPQQ4Iyu8u01E@public.gmane.org>
@ 2009-02-06 16:40       ` Jeremy Katz
       [not found]         ` <20090206164019.GD552-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Jeremy Katz @ 2009-02-06 16:40 UTC (permalink / raw)
  To: Dan Williams
  Cc: initramfs-u79uwXL29TY76Z2rM5mHXA, neilb-l3A5Bk7waGM,
	jacek.danecki-ral2JQCrhuEAvxtiuMwx3w

On Thursday, February 05 2009, Dan Williams said:
> index 706127f..0294502 100755
> --- a/init
> +++ b/init
> @@ -46,6 +46,13 @@ mknod /dev/tty1 c 4 1
>  /sbin/udevd --daemon
>  /sbin/udevadm trigger
>  
> +# start any defined raid arrays
> +# we settle before assembling to hopefully prevent prematurely degrading arrays
> +if [ -f /etc/mdadm.conf ]; then
> +  /sbin/udevadm settle
> +  /sbin/mdadm -Asc /etc/mdadm.conf
> +fi
> +

RAID arrays should be getting started by udev rules, not by explicit
calls to mdadm in /init.  Yes, this means having proper integration with
udev for your kernel pieces.  But this ends up helping everything as it
will also let us lose the multiple redundant calls to things like mdadm
(and lvm, etc) throughout the boot process which should just be
occurring as devices show up.

> +# switch any mdmon instances to newroot
> +[ -f /etc/mdadm.conf ] && /sbin/mdmon /proc/mdstat $NEWROOT
> +

Is there a real need for mdmon to start prior to being in the real
rootfs?

Jeremy
--
To unsubscribe from this list: send the line "unsubscribe initramfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [RFC PATCH 2/3] raid: external and internal metadata support
       [not found]         ` <20090206164019.GD552-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-02-06 16:50           ` Danecki, Jacek
       [not found]             ` <A9DE54D0CD747C4CB06DCE5B6FA2246F4B496AFA-IGOiFh9zz4yvNW/NfzhIbrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2009-02-06 18:02           ` Dan Williams
  1 sibling, 1 reply; 24+ messages in thread
From: Danecki, Jacek @ 2009-02-06 16:50 UTC (permalink / raw)
  To: Jeremy Katz, Williams, Dan J
  Cc: initramfs-u79uwXL29TY76Z2rM5mHXA, neilb-l3A5Bk7waGM


> Is there a real need for mdmon to start prior to being in the real
> rootfs?

mdmon is needed to change raid array to RW mode, so as long as rootfs is mounted RO, mdmon can be started in real rootfs.--
To unsubscribe from this list: send the line "unsubscribe initramfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] raid: external and internal metadata support
       [not found]             ` <A9DE54D0CD747C4CB06DCE5B6FA2246F4B496AFA-IGOiFh9zz4yvNW/NfzhIbrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2009-02-06 16:55               ` Dan Williams
  2009-02-06 16:56               ` Bill Nottingham
  1 sibling, 0 replies; 24+ messages in thread
From: Dan Williams @ 2009-02-06 16:55 UTC (permalink / raw)
  To: Danecki, Jacek
  Cc: Jeremy Katz, initramfs-u79uwXL29TY76Z2rM5mHXA, neilb-l3A5Bk7waGM

On Fri, Feb 6, 2009 at 9:50 AM, Danecki, Jacek <jacek.danecki-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>
>> Is there a real need for mdmon to start prior to being in the real
>> rootfs?
>
> mdmon is needed to change raid array to RW mode, so as long as rootfs is mounted RO, mdmon can be started in real rootfs.--

No, that is what I originally thought until I tried to mount an xfs
filesystem that had been uncleanly shutdown.  Even if the rootfs is
mounted read-only the backing device needs to be read-write to recover
the journal.

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe initramfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] raid: external and internal metadata support
       [not found]             ` <A9DE54D0CD747C4CB06DCE5B6FA2246F4B496AFA-IGOiFh9zz4yvNW/NfzhIbrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2009-02-06 16:55               ` Dan Williams
@ 2009-02-06 16:56               ` Bill Nottingham
       [not found]                 ` <20090206165601.GF11144-Zdt1ptygihhQcNjhGXsBABcY2uh10dtjAL8bYrjMMd8@public.gmane.org>
  1 sibling, 1 reply; 24+ messages in thread
From: Bill Nottingham @ 2009-02-06 16:56 UTC (permalink / raw)
  To: Danecki, Jacek
  Cc: Jeremy Katz, Williams, Dan J, initramfs-u79uwXL29TY76Z2rM5mHXA,
	neilb-l3A5Bk7waGM

Danecki, Jacek (jacek.danecki-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org) said: 
> > Is there a real need for mdmon to start prior to being in the real
> > rootfs?
> 
> mdmon is needed to change raid array to RW mode, so as long as rootfs is mounted RO, mdmon can be started in real rootfs.

So, for one particular specific type of block device, you need
a daemon to switch it writable. Every other type of block device
can handle this without separate tooling. I'm not seeing how this
is an improvement.

Bill
--
To unsubscribe from this list: send the line "unsubscribe initramfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] raid: external and internal metadata support
       [not found]                 ` <20090206165601.GF11144-Zdt1ptygihhQcNjhGXsBABcY2uh10dtjAL8bYrjMMd8@public.gmane.org>
@ 2009-02-06 17:27                   ` Dan Williams
       [not found]                     ` <e9c3a7c20902060927j2b900940kd851573469110135-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Dan Williams @ 2009-02-06 17:27 UTC (permalink / raw)
  To: Bill Nottingham
  Cc: Danecki, Jacek, Jeremy Katz, initramfs-u79uwXL29TY76Z2rM5mHXA,
	neilb-l3A5Bk7waGM

On Fri, Feb 6, 2009 at 9:56 AM, Bill Nottingham <notting-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> So, for one particular specific type of block device, you need
> a daemon to switch it writable. Every other type of block device
> can handle this without separate tooling. I'm not seeing how this
> is an improvement.
>

It is not just setting writable, mdmon is also there to clear the bit
when writes have quiesced.  Raid devices have always been special in
that they need to manage a dirty bit in their metadata to determine if
a resync needs to be performed after a dirty shutdown.  With hardware
raid or pure kernel (MD metadata) raid this mechanism is hidden.

External metadata raid is akin to fuse filesystems.  The kernel
provides the generic infrastructure and a userspace daemon handles the
implementation details.

The improvement is that with one kernel implementation we can support
any number of metadata formats.

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe initramfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] raid: external and internal metadata support
       [not found]                     ` <e9c3a7c20902060927j2b900940kd851573469110135-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-02-06 17:38                       ` Bill Nottingham
       [not found]                         ` <20090206173814.GA3541-Zdt1ptygihhQcNjhGXsBABcY2uh10dtjAL8bYrjMMd8@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Bill Nottingham @ 2009-02-06 17:38 UTC (permalink / raw)
  To: Dan Williams
  Cc: Danecki, Jacek, Jeremy Katz, initramfs-u79uwXL29TY76Z2rM5mHXA,
	neilb-l3A5Bk7waGM

Dan Williams (dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org) said: 
> It is not just setting writable, mdmon is also there to clear the bit
> when writes have quiesced.

Let me just see if I understand this infrastructure correctly.

- device is set writable
- kernel tells userspace
- userspace frobs bit in superblock to say 'I want to be dirty!'
- userspace tells kernel
- kernel writes bit to disk
... stuff happens ...
- userspace tells kernel to unmount, or remount R/O
- kernel tells userspace "hey, i unmounted this"
(userspace freaks out because the filesystem the daemon is running on
 just went away)
- userspace frobs bit in superblock to say 'This array is CLEAN!'
- userspace tells kernel
- kernel writes bit to disk

Is that really how it's supposed to work?

So, why isn't the ext* journal or filesystem unclean flag
handled via a userspace file monitoring daemon, then?

Bill
--
To unsubscribe from this list: send the line "unsubscribe initramfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] raid: external and internal metadata support
       [not found]                         ` <20090206173814.GA3541-Zdt1ptygihhQcNjhGXsBABcY2uh10dtjAL8bYrjMMd8@public.gmane.org>
@ 2009-02-06 18:00                           ` Jacek Danecki
       [not found]                             ` <498C7AD8.6080105-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
  2009-02-06 18:12                           ` Dan Williams
  1 sibling, 1 reply; 24+ messages in thread
From: Jacek Danecki @ 2009-02-06 18:00 UTC (permalink / raw)
  To: Bill Nottingham
  Cc: Williams, Dan J, Jeremy Katz, initramfs-u79uwXL29TY76Z2rM5mHXA,
	neilb-l3A5Bk7waGM

Bill Nottingham wrote:
> 
> So, why isn't the ext* journal or filesystem unclean flag
> handled via a userspace file monitoring daemon, then?

Dan, Neil

Are any plans about rewrite mdmon in kernel-space?

--
To unsubscribe from this list: send the line "unsubscribe initramfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] raid: external and internal metadata support
       [not found]         ` <20090206164019.GD552-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-02-06 16:50           ` Danecki, Jacek
@ 2009-02-06 18:02           ` Dan Williams
  1 sibling, 0 replies; 24+ messages in thread
From: Dan Williams @ 2009-02-06 18:02 UTC (permalink / raw)
  To: Jeremy Katz
  Cc: initramfs-u79uwXL29TY76Z2rM5mHXA, neilb-l3A5Bk7waGM,
	jacek.danecki-ral2JQCrhuEAvxtiuMwx3w, Kay Sievers

On Fri, Feb 6, 2009 at 9:40 AM, Jeremy Katz <katzj-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> RAID arrays should be getting started by udev rules, not by explicit
> calls to mdadm in /init.  Yes, this means having proper integration with
> udev for your kernel pieces.  But this ends up helping everything as it
> will also let us lose the multiple redundant calls to things like mdadm
> (and lvm, etc) throughout the boot process which should just be
> occurring as devices show up.

The trick is determining when a device has not shown up yet versus it
will never show up... to prevent the array being marked degraded
prematurely.  Is there some mechanism for udev to broadcast "if you
were waiting for more devices to show up don't hold your breath"?
I.e. at the point where a call to "udevadm settle" would reasonably be
expected to not find any pending events?

I am thinking something along the lines of:
<udev: add disk>
mdadm --incremental --no-degraded $dev
<udev: add disk>
mdadm --incremental --no-degraded $dev
<udev: probably no more devices>
mdadm --incremental $last_dev

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe initramfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] raid: external and internal metadata support
       [not found]                         ` <20090206173814.GA3541-Zdt1ptygihhQcNjhGXsBABcY2uh10dtjAL8bYrjMMd8@public.gmane.org>
  2009-02-06 18:00                           ` Jacek Danecki
@ 2009-02-06 18:12                           ` Dan Williams
       [not found]                             ` <e9c3a7c20902061012w15a31e7br6ce2074b7b9db555-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2009-02-08 19:16                             ` Szabolcs Szakacsits
  1 sibling, 2 replies; 24+ messages in thread
From: Dan Williams @ 2009-02-06 18:12 UTC (permalink / raw)
  To: Bill Nottingham
  Cc: Danecki, Jacek, Jeremy Katz, initramfs-u79uwXL29TY76Z2rM5mHXA,
	neilb-l3A5Bk7waGM

On Fri, Feb 6, 2009 at 10:38 AM, Bill Nottingham <notting-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> Dan Williams (dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org) said:
>> It is not just setting writable, mdmon is also there to clear the bit
>> when writes have quiesced.
>
> Let me just see if I understand this infrastructure correctly.
>
> - device is set writable
> - kernel tells userspace
...tells userspace that we want to transition the array from clean to dirty, yes

> - userspace frobs bit in superblock to say 'I want to be dirty!'
yes

> - userspace tells kernel
...yup array is dirty, start writing.

> - kernel writes bit to disk
> ... stuff happens ...
> - userspace tells kernel to unmount, or remount R/O
> - kernel tells userspace "hey, i unmounted this"
> (userspace freaks out because the filesystem the daemon is running on
>  just went away)
mdmon does not know or care if the *filesystem* is read-only.  It is
reading and writing /proc, /sys, and the raw disk devices.

> - userspace frobs bit in superblock to say 'This array is CLEAN!'
...not in this scenario no.

> - userspace tells kernel
> - kernel writes bit to disk
>
> Is that really how it's supposed to work?
You lost me at userspace freaks out, but that is the general flow.

> So, why isn't the ext* journal or filesystem unclean flag
> handled via a userspace file monitoring daemon, then?
I'm not trying to be obtuse, but because it isn't.  Put another way,
consider what extra tools the initramfs would need if we wanted to
support an ntfs-3g rootfs.

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe initramfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] raid: external and internal metadata support
       [not found]                             ` <e9c3a7c20902061012w15a31e7br6ce2074b7b9db555-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-02-06 18:21                               ` Bill Nottingham
       [not found]                                 ` <20090206182118.GA4413-Zdt1ptygihhQcNjhGXsBABcY2uh10dtjAL8bYrjMMd8@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Bill Nottingham @ 2009-02-06 18:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: Danecki, Jacek, Jeremy Katz, initramfs-u79uwXL29TY76Z2rM5mHXA,
	neilb-l3A5Bk7waGM

Dan Williams (dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org) said: 
> > - kernel tells userspace "hey, i unmounted this"
> > (userspace freaks out because the filesystem the daemon is running on
> >  just went away)
> mdmon does not know or care if the *filesystem* is read-only.  It is
> reading and writing /proc, /sys, and the raw disk devices.

Your daemon has to be running from somewhere. That tends to be
reduced to the initramfs that you've already deleted and switchrooted
away from (in which case, good luck on upgrades of your userspace tools -
you're stuck with the version in the initramfs you booted from, even
if your later userspace tools end up using some later protocol).
You can't run it from the rootfs, because that's a chicken/egg
scenario (or you'll switch to it, and then be unable to mark it
clean, because you can't mark the array r/o, because the filesystem
is r/w, which you can't undo because the daemon is running on
it...)

Long-lived daemons running from the initramfs aren't really good.
We don't run udev that way.

> > - userspace frobs bit in superblock to say 'This array is CLEAN!'
> ...not in this scenario no.

Then when is the clean bit set?

> > So, why isn't the ext* journal or filesystem unclean flag
> > handled via a userspace file monitoring daemon, then?
>
> I'm not trying to be obtuse, but because it isn't.  Put another way,
> consider what extra tools the initramfs would need if we wanted to
> support an ntfs-3g rootfs.

You're asking for *the exact same thing*... just RAID specific.

Bill
--
To unsubscribe from this list: send the line "unsubscribe initramfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] raid: external and internal metadata support
       [not found]                                 ` <20090206182118.GA4413-Zdt1ptygihhQcNjhGXsBABcY2uh10dtjAL8bYrjMMd8@public.gmane.org>
@ 2009-02-06 19:19                                   ` Dan Williams
       [not found]                                     ` <e9c3a7c20902061119i2120cc5fpda0a5cdc3aedc17b-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Dan Williams @ 2009-02-06 19:19 UTC (permalink / raw)
  To: Bill Nottingham
  Cc: Danecki, Jacek, Jeremy Katz, initramfs-u79uwXL29TY76Z2rM5mHXA,
	neilb-l3A5Bk7waGM

On Fri, Feb 6, 2009 at 11:21 AM, Bill Nottingham <notting-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> Dan Williams (dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org) said:
>> > - kernel tells userspace "hey, i unmounted this"
>> > (userspace freaks out because the filesystem the daemon is running on
>> >  just went away)
>> mdmon does not know or care if the *filesystem* is read-only.  It is
>> reading and writing /proc, /sys, and the raw disk devices.
>
> Your daemon has to be running from somewhere. That tends to be
> reduced to the initramfs that you've already deleted and switchrooted
> away from (in which case, good luck on upgrades of your userspace tools -
> you're stuck with the version in the initramfs you booted from, even
> if your later userspace tools end up using some later protocol).

Actually no, your not necessarily stuck with the mdmon from boot.  In
a pinch you could "mdmon /proc/mdstat /".  Worse case you need to
re-dracut and reboot, but that is already more flexible than the
metadata handled in kernel-space approach.

> You can't run it from the rootfs, because that's a chicken/egg
> scenario (or you'll switch to it, and then be unable to mark it
> clean, because you can't mark the array r/o, because the filesystem
> is r/w, which you can't undo because the daemon is running on
> it...)

Array r/o is a separate issue from the raid metadata clean bit, see below.

> Long-lived daemons running from the initramfs aren't really good.
I agree, and I initially looked for ways to wait until the rootfs was
available before launching mdmon... then I hit the xfs journal
recovery case.

> We don't run udev that way.
At first glance it looks like plymouth is run this way, but I am
probably mistaken.  From dracut/init:
[ -x /bin/plymouth ] && /bin/plymouth --newroot=$NEWROOT

One might say "just set the dirty bit, terminate, and wait for the
mdmon in the rootfs to take over".  The problem is that a disk could
fail in this window, and this event needs to be handled before the
kernel does anything else to the array.

>
>> > - userspace frobs bit in superblock to say 'This array is CLEAN!'
>> ...not in this scenario no.
>
> Then when is the clean bit set?
The clean bit can be set as soon as the parity data is in sync with
the data on the other drives.  We typically wait for some period of
write-inactivity to avoid needlessly touching the metadata after every
write.

>> > So, why isn't the ext* journal or filesystem unclean flag
>> > handled via a userspace file monitoring daemon, then?
>>
>> I'm not trying to be obtuse, but because it isn't.  Put another way,
>> consider what extra tools the initramfs would need if we wanted to
>> support an ntfs-3g rootfs.
>
> You're asking for *the exact same thing*... just RAID specific.
>

The key difference being that there are performance reasons for
handling filesystem metadata in the kernel.  Raid metadata events are
always in the slow path.

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe initramfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] raid: external and internal metadata support
       [not found]                             ` <498C7AD8.6080105-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
@ 2009-02-06 19:34                               ` NeilBrown
       [not found]                                 ` <2c0cae741a7229789cd777d93180072a.squirrel-eq65iwfR9nKIECXXMXunQA@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: NeilBrown @ 2009-02-06 19:34 UTC (permalink / raw)
  To: Jacek Danecki
  Cc: Bill Nottingham, Williams, Dan J, Jeremy Katz,
	initramfs-u79uwXL29TY76Z2rM5mHXA

On Sat, February 7, 2009 5:00 am, Jacek Danecki wrote:
> Bill Nottingham wrote:
>>
>> So, why isn't the ext* journal or filesystem unclean flag
>> handled via a userspace file monitoring daemon, then?
>
> Dan, Neil
>
> Are any plans about rewrite mdmon in kernel-space?
>

Definitely not.

There is more to this than the 'unclean' flag.
The really important task for mdmon (which hopefully it never has
to perform...) is to record device failures (which is what RAID is
really all about).

If a device fails while trying to write to it, we cannot allow that
write to complete until the other devices have had that device failure
recorded on them.  Otherwise, following an unclean shutdown we might trust
the data that is on that drive, which is now out-of-date.

The task of mdmon is to discover when there have been write error,
record the device failure in the metadata, then allow the write
to complete.

It has a number of other tasks as well, but that is the important
one which means that it must always be running when the array is
writable.

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe initramfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] raid: external and internal metadata support
       [not found]                                 ` <2c0cae741a7229789cd777d93180072a.squirrel-eq65iwfR9nKIECXXMXunQA@public.gmane.org>
@ 2009-02-06 20:03                                   ` Bill Nottingham
  2009-02-08 19:08                                     ` Szabolcs Szakacsits
  0 siblings, 1 reply; 24+ messages in thread
From: Bill Nottingham @ 2009-02-06 20:03 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jacek Danecki, Williams, Dan J, Jeremy Katz,
	initramfs-u79uwXL29TY76Z2rM5mHXA

NeilBrown (neilb-l3A5Bk7waGM@public.gmane.org) said: 
> There is more to this than the 'unclean' flag.
> The really important task for mdmon (which hopefully it never has
> to perform...) is to record device failures (which is what RAID is
> really all about).
> 
> If a device fails while trying to write to it, we cannot allow that
> write to complete until the other devices have had that device failure
> recorded on them.  Otherwise, following an unclean shutdown we might trust
> the data that is on that drive, which is now out-of-date.

OK, so:

1) kernel sends write request. If error....
2) <some error occurs>
3) kernel sends error to userspace
4) mdmon wakes up
5) mdmon decides where to record this
6) mdmon writes to super blocks
7) go to step one, hope you don't hit step 2 this time

This now means that reliable suspend and resume is completely
impossible on RAID devices, just as it is on FUSE. You can't have
waking up userspace be part of your write and sync process -
you've just deadlocked at step 3/4.

Unless I've missed something here?

Bill
--
To unsubscribe from this list: send the line "unsubscribe initramfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] raid: external and internal metadata support
       [not found]                                     ` <e9c3a7c20902061119i2120cc5fpda0a5cdc3aedc17b-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-02-06 20:08                                       ` Bill Nottingham
       [not found]                                         ` <20090206200818.GC6150-Zdt1ptygihhQcNjhGXsBABcY2uh10dtjAL8bYrjMMd8@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Bill Nottingham @ 2009-02-06 20:08 UTC (permalink / raw)
  To: Dan Williams
  Cc: Danecki, Jacek, Jeremy Katz, initramfs-u79uwXL29TY76Z2rM5mHXA,
	neilb-l3A5Bk7waGM

Dan Williams (dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org) said: 
> Actually no, your not necessarily stuck with the mdmon from boot.  In
> a pinch you could "mdmon /proc/mdstat /".

Not really.

You state:

> One might say "just set the dirty bit, terminate, and wait for the
> mdmon in the rootfs to take over".  The problem is that a disk could
> fail in this window, and this event needs to be handled before the
> kernel does anything else to the array.
...
> The clean bit can be set as soon as the parity data is in sync with
> the data on the other drives.  We typically wait for some period of
> write-inactivity to avoid needlessly touching the metadata after every
> write.

You shut down the machine. After a while, you get to the point where
you're getting ready to unmount the filesystem. Since mdmon's running
on it (if you started it post boot), you have to kill it. After that
point, there are going to be writes (a final sync, if nothing else,
when you unmount the filesystem.) And you won't be able to set any
RAID metadata flags then, as the daemon won't be running. So, doing
a later run of "mdmon /proc/mdstat" doesn't fully protect you.

Bill
--
To unsubscribe from this list: send the line "unsubscribe initramfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] raid: external and internal metadata support
       [not found]                                         ` <20090206200818.GC6150-Zdt1ptygihhQcNjhGXsBABcY2uh10dtjAL8bYrjMMd8@public.gmane.org>
@ 2009-02-06 20:21                                           ` NeilBrown
       [not found]                                             ` <8c48d75b834c74adc39b6e904a44237e.squirrel-eq65iwfR9nKIECXXMXunQA@public.gmane.org>
  2009-02-06 20:26                                           ` Dan Williams
  1 sibling, 1 reply; 24+ messages in thread
From: NeilBrown @ 2009-02-06 20:21 UTC (permalink / raw)
  To: Bill Nottingham
  Cc: Dan Williams, Danecki, Jacek, Jeremy Katz,
	initramfs-u79uwXL29TY76Z2rM5mHXA

On Sat, February 7, 2009 7:08 am, Bill Nottingham wrote:
> Dan Williams (dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org) said:
>> Actually no, your not necessarily stuck with the mdmon from boot.  In
>> a pinch you could "mdmon /proc/mdstat /".
>
> Not really.
>
> You state:
>
>> One might say "just set the dirty bit, terminate, and wait for the
>> mdmon in the rootfs to take over".  The problem is that a disk could
>> fail in this window, and this event needs to be handled before the
>> kernel does anything else to the array.
> ...
>> The clean bit can be set as soon as the parity data is in sync with
>> the data on the other drives.  We typically wait for some period of
>> write-inactivity to avoid needlessly touching the metadata after every
>> write.
>
> You shut down the machine. After a while, you get to the point where
> you're getting ready to unmount the filesystem. Since mdmon's running
> on it (if you started it post boot), you have to kill it. After that
> point, there are going to be writes (a final sync, if nothing else,
> when you unmount the filesystem.) And you won't be able to set any
> RAID metadata flags then, as the daemon won't be running. So, doing
> a later run of "mdmon /proc/mdstat" doesn't fully protect you.

???
Last time I checked, Linux would not unmount the root filesystem.
It just remounts it 'read-only'.
Is that going to change?

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe initramfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] raid: external and internal metadata support
       [not found]                                         ` <20090206200818.GC6150-Zdt1ptygihhQcNjhGXsBABcY2uh10dtjAL8bYrjMMd8@public.gmane.org>
  2009-02-06 20:21                                           ` NeilBrown
@ 2009-02-06 20:26                                           ` Dan Williams
       [not found]                                             ` <e9c3a7c20902061226m3f1e9e55pc2986a8527ade77-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 24+ messages in thread
From: Dan Williams @ 2009-02-06 20:26 UTC (permalink / raw)
  To: Bill Nottingham
  Cc: Danecki, Jacek, Jeremy Katz, initramfs-u79uwXL29TY76Z2rM5mHXA,
	neilb-l3A5Bk7waGM

On Fri, Feb 6, 2009 at 1:08 PM, Bill Nottingham <notting-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> Dan Williams (dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org) said:
>> Actually no, your not necessarily stuck with the mdmon from boot.  In
>> a pinch you could "mdmon /proc/mdstat /".
>
> Not really.
>
> You state:
>
>> One might say "just set the dirty bit, terminate, and wait for the
>> mdmon in the rootfs to take over".  The problem is that a disk could
>> fail in this window, and this event needs to be handled before the
>> kernel does anything else to the array.
> ...
>> The clean bit can be set as soon as the parity data is in sync with
>> the data on the other drives.  We typically wait for some period of
>> write-inactivity to avoid needlessly touching the metadata after every
>> write.
>
> You shut down the machine. After a while, you get to the point where
> you're getting ready to unmount the filesystem. Since mdmon's running
> on it (if you started it post boot), you have to kill it. After that
> point, there are going to be writes (a final sync, if nothing else,
> when you unmount the filesystem.) And you won't be able to set any
> RAID metadata flags then, as the daemon won't be running. So, doing
> a later run of "mdmon /proc/mdstat" doesn't fully protect you.
>

mdmon needs some coordination with the shutdown scripts to be kept
alive until the rootfs is marked readonly... actually up until the
point where the rootdev can be marked readonly.

If you take a look at Debian's killall implementation it has
provisions to exclude fuse and other critical userspace process from
killall.  A similar exclusion is needed for mdmon.
--
To unsubscribe from this list: send the line "unsubscribe initramfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] raid: external and internal metadata support
       [not found]                                             ` <8c48d75b834c74adc39b6e904a44237e.squirrel-eq65iwfR9nKIECXXMXunQA@public.gmane.org>
@ 2009-02-06 20:27                                               ` Bill Nottingham
  0 siblings, 0 replies; 24+ messages in thread
From: Bill Nottingham @ 2009-02-06 20:27 UTC (permalink / raw)
  To: NeilBrown
  Cc: Dan Williams, Danecki, Jacek, Jeremy Katz,
	initramfs-u79uwXL29TY76Z2rM5mHXA

NeilBrown (neilb-l3A5Bk7waGM@public.gmane.org) said: 
> > You shut down the machine. After a while, you get to the point where
> > you're getting ready to unmount the filesystem. Since mdmon's running
> > on it (if you started it post boot), you have to kill it. After that
> > point, there are going to be writes (a final sync, if nothing else,
> > when you unmount the filesystem.) And you won't be able to set any
> > RAID metadata flags then, as the daemon won't be running. So, doing
> > a later run of "mdmon /proc/mdstat" doesn't fully protect you.
> 
> Last time I checked, Linux would not unmount the root filesystem.
> It just remounts it 'read-only'.
> Is that going to change?

Yeah, I screwed up that part. However, it still syncs, and the mdmon process
will still be dead.

Bill
--
To unsubscribe from this list: send the line "unsubscribe initramfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] raid: external and internal metadata support
  2009-02-06 20:03                                   ` Bill Nottingham
@ 2009-02-08 19:08                                     ` Szabolcs Szakacsits
  0 siblings, 0 replies; 24+ messages in thread
From: Szabolcs Szakacsits @ 2009-02-08 19:08 UTC (permalink / raw)
  To: initramfs-u79uwXL29TY76Z2rM5mHXA

Bill Nottingham <notting@...> writes: 
> OK, so:
> 
> 1) kernel sends write request. If error....
> 2) <some error occurs>
> 3) kernel sends error to userspace
> 4) mdmon wakes up
> 5) mdmon decides where to record this
> 6) mdmon writes to super blocks
> 7) go to step one, hope you don't hit step 2 this time
> 
> This now means that reliable suspend and resume is completely
> impossible on RAID devices, just as it is on FUSE. 

It's not clear from the context but I suppose you mean only FUSE 
root file systems (e.g. what Ubuntu/WUBI has on NTFS via NTFS-3G).

One of the solutions is to apply the same mechanism what swapfiles 
use. That avoids user space completely. The suspend information 
go to a dynamically (userspace involved) and a statically (no 
userspace involved) allocated space on the suspend device.

Regards,  Szaka

--
NTFS-3G: http://ntfs-3g.org

--
To unsubscribe from this list: send the line "unsubscribe initramfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] raid: external and internal metadata support
  2009-02-06 18:12                           ` Dan Williams
       [not found]                             ` <e9c3a7c20902061012w15a31e7br6ce2074b7b9db555-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-02-08 19:16                             ` Szabolcs Szakacsits
  1 sibling, 0 replies; 24+ messages in thread
From: Szabolcs Szakacsits @ 2009-02-08 19:16 UTC (permalink / raw)
  To: initramfs-u79uwXL29TY76Z2rM5mHXA

Dan Williams <dan.j.williams@...> writes:
> consider what extra tools the initramfs would need if we wanted to
> support an ntfs-3g rootfs.

The FUSE kernel module. Nothing else. Several distros do it.

Regards, Szaka

NTFS-3G: http://ntfs-3g.org 


--
To unsubscribe from this list: send the line "unsubscribe initramfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] raid: external and internal metadata support
       [not found]                                             ` <e9c3a7c20902061226m3f1e9e55pc2986a8527ade77-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-07-09 20:19                                               ` Warren Togami
  0 siblings, 0 replies; 24+ messages in thread
From: Warren Togami @ 2009-07-09 20:19 UTC (permalink / raw)
  To: Dan Williams
  Cc: Bill Nottingham, Danecki, Jacek, Jeremy Katz,
	initramfs-u79uwXL29TY76Z2rM5mHXA, neilb-l3A5Bk7waGM

On 02/06/2009 03:26 PM, Dan Williams wrote:
> On Fri, Feb 6, 2009 at 1:08 PM, Bill Nottingham<notting-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>  wrote:
>> Dan Williams (dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org) said:
>>> Actually no, your not necessarily stuck with the mdmon from boot.  In
>>> a pinch you could "mdmon /proc/mdstat /".
>> Not really.
>>
>> You state:
>>
>>> One might say "just set the dirty bit, terminate, and wait for the
>>> mdmon in the rootfs to take over".  The problem is that a disk could
>>> fail in this window, and this event needs to be handled before the
>>> kernel does anything else to the array.
>> ...
>>> The clean bit can be set as soon as the parity data is in sync with
>>> the data on the other drives.  We typically wait for some period of
>>> write-inactivity to avoid needlessly touching the metadata after every
>>> write.
>> You shut down the machine. After a while, you get to the point where
>> you're getting ready to unmount the filesystem. Since mdmon's running
>> on it (if you started it post boot), you have to kill it. After that
>> point, there are going to be writes (a final sync, if nothing else,
>> when you unmount the filesystem.) And you won't be able to set any
>> RAID metadata flags then, as the daemon won't be running. So, doing
>> a later run of "mdmon /proc/mdstat" doesn't fully protect you.
>>
>
> mdmon needs some coordination with the shutdown scripts to be kept
> alive until the rootfs is marked readonly... actually up until the
> point where the rootdev can be marked readonly.
>
> If you take a look at Debian's killall implementation it has
> provisions to exclude fuse and other critical userspace process from
> killall.  A similar exclusion is needed for mdmon.

It appears we have no solution for this yet in Fedora 12.

https://bugzilla.redhat.com/show_bug.cgi?id=496843
This bug has a similar request for network block devices that need a 
userspace process.

Warren Togami
wtogami-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe initramfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2009-07-09 20:19 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-02-05 22:49 [RFC PATCH 0/3] mdraid rootfs support Dan Williams
     [not found] ` <20090205224808.18610.14957.stgit-p8uTFz9XbKjBPTuBivz2/GFmcEqAMTzPQQ4Iyu8u01E@public.gmane.org>
2009-02-05 22:49   ` [RFC PATCH 1/3] gen-mod-lists: create lists of modules that may talk to a root device Dan Williams
2009-02-05 22:49   ` [RFC PATCH 2/3] raid: external and internal metadata support Dan Williams
     [not found]     ` <20090205224920.18610.63979.stgit-p8uTFz9XbKjBPTuBivz2/GFmcEqAMTzPQQ4Iyu8u01E@public.gmane.org>
2009-02-06 16:40       ` Jeremy Katz
     [not found]         ` <20090206164019.GD552-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-02-06 16:50           ` Danecki, Jacek
     [not found]             ` <A9DE54D0CD747C4CB06DCE5B6FA2246F4B496AFA-IGOiFh9zz4yvNW/NfzhIbrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2009-02-06 16:55               ` Dan Williams
2009-02-06 16:56               ` Bill Nottingham
     [not found]                 ` <20090206165601.GF11144-Zdt1ptygihhQcNjhGXsBABcY2uh10dtjAL8bYrjMMd8@public.gmane.org>
2009-02-06 17:27                   ` Dan Williams
     [not found]                     ` <e9c3a7c20902060927j2b900940kd851573469110135-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-02-06 17:38                       ` Bill Nottingham
     [not found]                         ` <20090206173814.GA3541-Zdt1ptygihhQcNjhGXsBABcY2uh10dtjAL8bYrjMMd8@public.gmane.org>
2009-02-06 18:00                           ` Jacek Danecki
     [not found]                             ` <498C7AD8.6080105-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2009-02-06 19:34                               ` NeilBrown
     [not found]                                 ` <2c0cae741a7229789cd777d93180072a.squirrel-eq65iwfR9nKIECXXMXunQA@public.gmane.org>
2009-02-06 20:03                                   ` Bill Nottingham
2009-02-08 19:08                                     ` Szabolcs Szakacsits
2009-02-06 18:12                           ` Dan Williams
     [not found]                             ` <e9c3a7c20902061012w15a31e7br6ce2074b7b9db555-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-02-06 18:21                               ` Bill Nottingham
     [not found]                                 ` <20090206182118.GA4413-Zdt1ptygihhQcNjhGXsBABcY2uh10dtjAL8bYrjMMd8@public.gmane.org>
2009-02-06 19:19                                   ` Dan Williams
     [not found]                                     ` <e9c3a7c20902061119i2120cc5fpda0a5cdc3aedc17b-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-02-06 20:08                                       ` Bill Nottingham
     [not found]                                         ` <20090206200818.GC6150-Zdt1ptygihhQcNjhGXsBABcY2uh10dtjAL8bYrjMMd8@public.gmane.org>
2009-02-06 20:21                                           ` NeilBrown
     [not found]                                             ` <8c48d75b834c74adc39b6e904a44237e.squirrel-eq65iwfR9nKIECXXMXunQA@public.gmane.org>
2009-02-06 20:27                                               ` Bill Nottingham
2009-02-06 20:26                                           ` Dan Williams
     [not found]                                             ` <e9c3a7c20902061226m3f1e9e55pc2986a8527ade77-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-07-09 20:19                                               ` Warren Togami
2009-02-08 19:16                             ` Szabolcs Szakacsits
2009-02-06 18:02           ` Dan Williams
2009-02-05 22:49   ` [RFC PATCH 3/3] add more disk id helpers to udevexe Dan Williams

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.