[mdadm PATCH 1/4] Grow_continue_command: ensure 'content' is properly initialised.

All of lore.kernel.org
 help / color / mirror / Atom feed

* [mdadm PATCH 1/4] Grow_continue_command: ensure 'content' is properly initialised.
  2017-04-20  2:40 [mdadm PATCH 0/4] Assorted mdadm patches NeilBrown
  2017-04-20  2:40 ` [mdadm PATCH 2/4] systemd/mdadm-last-resort: use ConditionPathExists instead of Conflicts NeilBrown
@ 2017-04-20  2:40 ` NeilBrown
  2017-04-20 16:56   ` Jes Sorensen
  2017-04-20  2:40 ` [mdadm PATCH 3/4] Detail: ensure --export names are acceptable as shell variables NeilBrown
  2017-04-20  2:40 ` [mdadm PATCH 4/4] Create: tell udev device is not ready when first created NeilBrown
  3 siblings, 1 reply; 35+ messages in thread
From: NeilBrown @ 2017-04-20  2:40 UTC (permalink / raw)
  To: Jes Sorensen; +Cc: linux-raid

Grow_continue_command() call verify_reshape_position(), which assumes
that info->sys_name is initialised.
'info' in verify_reshape_position() is 'content' in Grow_continue_command().

In the st->ss->external != 0 branch of that function, sysfs_init() is called
to initialize content->sys_name.
In the st->ss->external == 0 branch, ->sys_name is not initialized so
verify_reshape_position() will not do the right thing.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 Grow.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/Grow.c b/Grow.c
index 15f4ed1253bf..c6967ed1c9c7 100755
--- a/Grow.c
+++ b/Grow.c
@@ -5002,6 +5002,7 @@ int Grow_continue_command(char *devname, int fd,
 			goto Grow_continue_command_exit;
 		}
 		content = &array;
+		sysfs_init(content, fd, NULL);
 		/* Need to load a superblock.
 		 * FIXME we should really get what we need from
 		 * sysfs



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [mdadm PATCH 0/4] Assorted mdadm patches
@ 2017-04-20  2:40 NeilBrown
  2017-04-20  2:40 ` [mdadm PATCH 2/4] systemd/mdadm-last-resort: use ConditionPathExists instead of Conflicts NeilBrown
                   ` (3 more replies)
  0 siblings, 4 replies; 35+ messages in thread
From: NeilBrown @ 2017-04-20  2:40 UTC (permalink / raw)
  To: Jes Sorensen; +Cc: linux-raid

These are 4 unrelated patches that I've gathered
over the last few weeks.
The last one is probably the most interesting and one
that you should probably think carefully before apply
(but I hope you'll decide in favour).

When you --assemble -or --create an array, udev immediately has a look
at the new device and might act on that content.  e.g. tell udisks it
can mount a filesystem, or tell mdadm that there is part of a RAID
array in here.

When you --assemble an array you want that to happen.
When you --create it, you don't.
udev cannot distinguish 'assemble' from 'create'.
This can be annoying.

The way I have found to tell udev about the distinction to create a
/run/udev/rules.d/01-mdadm.rules file which marks any newly appearing
md array as SYSTEMD_READY==0.  This is created before the array, and
removed once the array exisits (and hopefully has been handled by
udev).
Most udev rules which might process a newly appearing device will
avoid doing so i SYSTEMD_READY is set to 0.

Thanks,
NeilBrown

---

NeilBrown (4):
      Grow_continue_command: ensure 'content' is properly initialised.
      systemd/mdadm-last-resort: use ConditionPathExists instead of Conflicts
      Detail: ensure --export names are acceptable as shell variables.
      Create: tell udev device is not ready when first created.

 Create.c                           |    9 +++++++++
 Detail.c                           |   12 +++++++++---
 Grow.c                             |    1 +
 lib.c                              |   22 ++++++++++++++++++++++
 mdadm.h                            |    2 ++
 systemd/mdadm-last-resort@.service |    2 +-
 6 files changed, 44 insertions(+), 4 deletions(-)

--
Signature

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [mdadm PATCH 2/4] systemd/mdadm-last-resort: use ConditionPathExists instead of Conflicts
  2017-04-20  2:40 [mdadm PATCH 0/4] Assorted mdadm patches NeilBrown
@ 2017-04-20  2:40 ` NeilBrown
  2017-04-20 16:57   ` Jes Sorensen
  2017-04-20  2:40 ` [mdadm PATCH 1/4] Grow_continue_command: ensure 'content' is properly initialised NeilBrown
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 35+ messages in thread
From: NeilBrown @ 2017-04-20  2:40 UTC (permalink / raw)
  To: Jes Sorensen; +Cc: linux-raid

Commit cec72c071bbe ("systemd/mdadm-last-resort: add Conflicts to .service file.")

added a 'Conflicts' directive to the mdadm-last-resort@.service file in
the hope that this would make sure the service didn't run after the device
was active, even if the timer managed to get started, which is possible in
race conditions.

This seemed to work is testing, but it isn't clear why, and it is known
to cause problems.
If systemd happens to know that the mentioned device is a dependency of a
mount point, the Conflicts can unmount that mountpoint, which is certainly
not wanted.

So remove the "Conflicts" and instead use
 ConditionPathExists=!/sys/devices/virtual/block/%i/md/sync_action

The "sync_action" file exists for any array which requires last-resort
handling, and only appears when the array is activated.  So it is safe
to rely on it to determine if the last-resort is really needed.

Fixes: cec72c071bbe ("systemd/mdadm-last-resort: add Conflicts to .service file.")
Signed-off-by: NeilBrown <neilb@suse.com>
---
 systemd/mdadm-last-resort@.service |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/systemd/mdadm-last-resort@.service b/systemd/mdadm-last-resort@.service
index e93d72b2b45e..f9d4d12738a3 100644
--- a/systemd/mdadm-last-resort@.service
+++ b/systemd/mdadm-last-resort@.service
@@ -1,7 +1,7 @@
 [Unit]
 Description=Activate md array even though degraded
 DefaultDependencies=no
-Conflicts=sys-devices-virtual-block-%i.device
+ConditionPathExists=!/sys/devices/virtual/block/%i/md/sync_action

 [Service]
 Type=oneshot

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [mdadm PATCH 3/4] Detail: ensure --export names are acceptable as shell variables.
  2017-04-20  2:40 [mdadm PATCH 0/4] Assorted mdadm patches NeilBrown
  2017-04-20  2:40 ` [mdadm PATCH 2/4] systemd/mdadm-last-resort: use ConditionPathExists instead of Conflicts NeilBrown
  2017-04-20  2:40 ` [mdadm PATCH 1/4] Grow_continue_command: ensure 'content' is properly initialised NeilBrown
@ 2017-04-20  2:40 ` NeilBrown
  2017-04-20 16:59   ` Jes Sorensen
  2017-04-20  2:40 ` [mdadm PATCH 4/4] Create: tell udev device is not ready when first created NeilBrown
  3 siblings, 1 reply; 35+ messages in thread
From: NeilBrown @ 2017-04-20  2:40 UTC (permalink / raw)
  To: Jes Sorensen; +Cc: linux-raid

If an array contains a device which has a name that
contains something other than alphnumerics and underscores,
then some values reported by "mdadm --detail --export" will
not be valid as variable assignment of the shell.
This particularly affects dm devices.
e.g.
   MD_DEVICE_dm-4_ROLE=1
   MD_DEVICE_dm-4_DEV=/dev/dm-4

As it is particularly useful to be able to work with these
in a shell script, and as the precise name is not important,
change all non-alphanumerics to '_'.

   MD_DEVICE_dm_4_ROLE=1
   MD_DEVICE_dm_4_DEV=/dev/dm-4

Signed-off-by: NeilBrown <neilb@suse.com>
---
 Detail.c |   12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/Detail.c b/Detail.c
index d4e620438f93..7c3dac8ba8ec 100644
--- a/Detail.c
+++ b/Detail.c
@@ -25,6 +25,7 @@
 #include	"mdadm.h"
 #include	"md_p.h"
 #include	"md_u.h"
+#include	<ctype.h>
 #include	<dirent.h>
 
 static int cmpstringp(const void *p1, const void *p2)
@@ -276,17 +277,22 @@ int Detail(char *dev, struct context *c)
 				char *path =
 					map_dev(mdi->disk.major,
 						mdi->disk.minor, 0);
+				char *sysdev = xstrdup(mdi->sys_name + 1);
+				char *cp;
+				for (cp = sysdev; *cp; cp++)
+					if (!isalnum(*cp))
+						*cp = '_';
 
 				if (mdi->disk.raid_disk >= 0)
 					printf("MD_DEVICE_%s_ROLE=%d\n",
-					       mdi->sys_name+4,
+					       sysdev,
 					       mdi->disk.raid_disk);
 				else
 					printf("MD_DEVICE_%s_ROLE=spare\n",
-					       mdi->sys_name+4);
+					       sysdev);
 				if (path)
 					printf("MD_DEVICE_%s_DEV=%s\n",
-					       mdi->sys_name+4, path);
+					       sysdev, path);
 			}
 		}
 		goto out;



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [mdadm PATCH 4/4] Create: tell udev device is not ready when first created.
  2017-04-20  2:40 [mdadm PATCH 0/4] Assorted mdadm patches NeilBrown
                   ` (2 preceding siblings ...)
  2017-04-20  2:40 ` [mdadm PATCH 3/4] Detail: ensure --export names are acceptable as shell variables NeilBrown
@ 2017-04-20  2:40 ` NeilBrown
  2017-04-20 17:29   ` Jes Sorensen
  3 siblings, 1 reply; 35+ messages in thread
From: NeilBrown @ 2017-04-20  2:40 UTC (permalink / raw)
  To: Jes Sorensen; +Cc: linux-raid

When an array is created the content is not initializes,
so it could have remnants of an old filesystem or md array
etc on it.
udev will see this and might try to activate it, which is almost
certainly not what is wanted.

So create a temporary udev rules files to set ENV{SYSTEMD_READY}="0"
while the creation event is processed.  This is fairly uniformly
used to suppress actions based on the contents of the device.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 Create.c |    9 +++++++++
 lib.c    |   22 ++++++++++++++++++++++
 mdadm.h  |    2 ++
 3 files changed, 33 insertions(+)

diff --git a/Create.c b/Create.c
index 6ca092449880..915ba6f49fe8 100644
--- a/Create.c
+++ b/Create.c
@@ -605,9 +605,11 @@ int Create(struct supertype *st, char *mddev,
 
 	/* We need to create the device */
 	map_lock(&map);
+	udev_block();
 	mdfd = create_mddev(mddev, name, c->autof, LOCAL, chosen_name);
 	if (mdfd < 0) {
 		map_unlock(&map);
+		udev_unblock();
 		return 1;
 	}
 	/* verify if chosen_name is not in use,
@@ -620,6 +622,7 @@ int Create(struct supertype *st, char *mddev,
 			chosen_name);
 		close(mdfd);
 		map_unlock(&map);
+		udev_unblock();
 		return 1;
 	}
 	mddev = chosen_name;
@@ -1053,9 +1056,15 @@ int Create(struct supertype *st, char *mddev,
 		pr_err("not starting array - not enough devices.\n");
 	}
 	close(mdfd);
+	/* Give udev a moment to process the Change event caused
+	 * by the close.
+	 */
+	usleep(100*1000);
+	udev_unblock();
 	return 0;
 
  abort:
+	udev_unblock();
 	map_lock(&map);
  abort_locked:
 	map_remove(&map, fd2devnm(mdfd));
diff --git a/lib.c b/lib.c
index b640634ef6f2..c38dd7cf1660 100644
--- a/lib.c
+++ b/lib.c
@@ -163,6 +163,28 @@ char *fd2devnm(int fd)
 	return NULL;
 }
 
+/* When we create a new array, we don't want the content to
+ * be immediately examined by udev - it is probably meaningless.
+ * So create /run/udev/rules.d/01-mdadm-create.rules to tell udev
+ * that the device isn't ready.
+ */
+static char udev_path[] = "/run/udev/rules.d/01-mdadm-create.rules";
+void udev_block(void)
+{
+	FILE *f;
+
+	f = fopen(udev_path, "w");
+	if (f) {
+		fputs("KERNEL==\"md*\", ENV{SYSTEMD_READY}=\"0\"\n", f);
+		fclose(f);
+	}
+}
+
+void udev_unblock(void)
+{
+	unlink(udev_path);
+}
+
 /*
  * convert a major/minor pair for a block device into a name in /dev, if possible.
  * On the first call, walk /dev collecting name.
diff --git a/mdadm.h b/mdadm.h
index f1f643c794d4..fa4d71fc91f0 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -1556,6 +1556,8 @@ extern char *stat2kname(struct stat *st);
 extern char *fd2kname(int fd);
 extern char *stat2devnm(struct stat *st);
 extern char *fd2devnm(int fd);
+extern void udev_block(void);
+extern void udev_unblock(void);
 
 extern int in_initrd(void);
 



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [mdadm PATCH 1/4] Grow_continue_command: ensure 'content' is properly initialised.
  2017-04-20  2:40 ` [mdadm PATCH 1/4] Grow_continue_command: ensure 'content' is properly initialised NeilBrown
@ 2017-04-20 16:56   ` Jes Sorensen
  0 siblings, 0 replies; 35+ messages in thread
From: Jes Sorensen @ 2017-04-20 16:56 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On 04/19/2017 10:40 PM, NeilBrown wrote:
> Grow_continue_command() call verify_reshape_position(), which assumes
> that info->sys_name is initialised.
> 'info' in verify_reshape_position() is 'content' in Grow_continue_command().
>
> In the st->ss->external != 0 branch of that function, sysfs_init() is called
> to initialize content->sys_name.
> In the st->ss->external == 0 branch, ->sys_name is not initialized so
> verify_reshape_position() will not do the right thing.
>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  Grow.c |    1 +
>  1 file changed, 1 insertion(+)

Applied!

Thanks,
Jes



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [mdadm PATCH 2/4] systemd/mdadm-last-resort: use ConditionPathExists instead of Conflicts
  2017-04-20  2:40 ` [mdadm PATCH 2/4] systemd/mdadm-last-resort: use ConditionPathExists instead of Conflicts NeilBrown
@ 2017-04-20 16:57   ` Jes Sorensen
  0 siblings, 0 replies; 35+ messages in thread
From: Jes Sorensen @ 2017-04-20 16:57 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On 04/19/2017 10:40 PM, NeilBrown wrote:
> Commit cec72c071bbe ("systemd/mdadm-last-resort: add Conflicts to .service file.")
>
> added a 'Conflicts' directive to the mdadm-last-resort@.service file in
> the hope that this would make sure the service didn't run after the device
> was active, even if the timer managed to get started, which is possible in
> race conditions.
>
> This seemed to work is testing, but it isn't clear why, and it is known
> to cause problems.
> If systemd happens to know that the mentioned device is a dependency of a
> mount point, the Conflicts can unmount that mountpoint, which is certainly
> not wanted.
>
> So remove the "Conflicts" and instead use
>  ConditionPathExists=!/sys/devices/virtual/block/%i/md/sync_action
>
> The "sync_action" file exists for any array which requires last-resort
> handling, and only appears when the array is activated.  So it is safe
> to rely on it to determine if the last-resort is really needed.
>
> Fixes: cec72c071bbe ("systemd/mdadm-last-resort: add Conflicts to .service file.")
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  systemd/mdadm-last-resort@.service |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Ohhh obscure systemd conditionals, my favorite thing in the whole World.....

Applied!

Thanks,
Jes



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [mdadm PATCH 3/4] Detail: ensure --export names are acceptable as shell variables.
  2017-04-20  2:40 ` [mdadm PATCH 3/4] Detail: ensure --export names are acceptable as shell variables NeilBrown
@ 2017-04-20 16:59   ` Jes Sorensen
  0 siblings, 0 replies; 35+ messages in thread
From: Jes Sorensen @ 2017-04-20 16:59 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On 04/19/2017 10:40 PM, NeilBrown wrote:
> If an array contains a device which has a name that
> contains something other than alphnumerics and underscores,
> then some values reported by "mdadm --detail --export" will
> not be valid as variable assignment of the shell.
> This particularly affects dm devices.
> e.g.
>    MD_DEVICE_dm-4_ROLE=1
>    MD_DEVICE_dm-4_DEV=/dev/dm-4
>
> As it is particularly useful to be able to work with these
> in a shell script, and as the precise name is not important,
> change all non-alphanumerics to '_'.
>
>    MD_DEVICE_dm_4_ROLE=1
>    MD_DEVICE_dm_4_DEV=/dev/dm-4
>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  Detail.c |   12 +++++++++---
>  1 file changed, 9 insertions(+), 3 deletions(-)

Applied!

Thanks,
Jes



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [mdadm PATCH 4/4] Create: tell udev device is not ready when first created.
  2017-04-20  2:40 ` [mdadm PATCH 4/4] Create: tell udev device is not ready when first created NeilBrown
@ 2017-04-20 17:29   ` Jes Sorensen
  2017-04-20 21:35     ` NeilBrown
  0 siblings, 1 reply; 35+ messages in thread
From: Jes Sorensen @ 2017-04-20 17:29 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On 04/19/2017 10:40 PM, NeilBrown wrote:
> When an array is created the content is not initializes,
> so it could have remnants of an old filesystem or md array
> etc on it.
> udev will see this and might try to activate it, which is almost
> certainly not what is wanted.
>
> So create a temporary udev rules files to set ENV{SYSTEMD_READY}="0"
> while the creation event is processed.  This is fairly uniformly
> used to suppress actions based on the contents of the device.
>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  Create.c |    9 +++++++++
>  lib.c    |   22 ++++++++++++++++++++++
>  mdadm.h  |    2 ++
>  3 files changed, 33 insertions(+)

Neil,

I have heard of this problem before, but I have some concerns about the 
solution. First of all, /run/udev/rules.d/ isn't a universal directory. 
At least Fedora doesn't have it, so we need to take that into account. I 
wasn't aware it was possible to create temporary udev rules like this.

Second, isn't this going to be racey if you have multiple arrays 
running? I am wondering if we cannot find a solution that relies on a 
permanently installed udev rule that we enable/disable with systemctl 
and use the device name as an argument?

Thoughts?

Jes

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [mdadm PATCH 4/4] Create: tell udev device is not ready when first created.
  2017-04-20 17:29   ` Jes Sorensen
@ 2017-04-20 21:35     ` NeilBrown
  2017-04-26 10:19       ` Peter Rajnoha
  0 siblings, 1 reply; 35+ messages in thread
From: NeilBrown @ 2017-04-20 21:35 UTC (permalink / raw)
  To: Jes Sorensen; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 8300 bytes --]

On Thu, Apr 20 2017, Jes Sorensen wrote:

> On 04/19/2017 10:40 PM, NeilBrown wrote:
>> When an array is created the content is not initializes,
>> so it could have remnants of an old filesystem or md array
>> etc on it.
>> udev will see this and might try to activate it, which is almost
>> certainly not what is wanted.
>>
>> So create a temporary udev rules files to set ENV{SYSTEMD_READY}="0"
>> while the creation event is processed.  This is fairly uniformly
>> used to suppress actions based on the contents of the device.
>>
>> Signed-off-by: NeilBrown <neilb@suse.com>
>> ---
>>  Create.c |    9 +++++++++
>>  lib.c    |   22 ++++++++++++++++++++++
>>  mdadm.h  |    2 ++
>>  3 files changed, 33 insertions(+)
>
> Neil,
>
> I have heard of this problem before, but I have some concerns about the 
> solution. First of all, /run/udev/rules.d/ isn't a universal directory. 
> At least Fedora doesn't have it, so we need to take that into account. I 
> wasn't aware it was possible to create temporary udev rules like this.

Hmmm.. Debian doesn't have it either.  However I suspect that udev will
still try to read from the directory.  The man page explicitly lists it:

RULES FILES
       The udev rules are read from the files located in the system rules
       directory /usr/lib/udev/rules.d, the volatile runtime directory
       /run/udev/rules.d and the local administration directory
       /etc/udev/rules.d.  ....

So possibly mdadm would need to mkdir("/run/udev/rules.d", 0755) first.
The man page uses the term "volatile" rather than "temporary", but this
possibility does seem to be part of the design of udev.
 

>
> Second, isn't this going to be racey if you have multiple arrays 
> running? I am wondering if we cannot find a solution that relies on a 
> permanently installed udev rule that we enable/disable with systemctl 
> and use the device name as an argument?

There would only be a problematic race of an array as being created at
the same time that another array is being assembled.  Is that at all
likely?  They tend to happen at different parts of the usage cycle...  I
guess we shouldn't assume though.

I admit that blocking *all* arrays was a short cut.  We need to create
the udev rule before the new device is first activated, and keep it in
existence until after the array has been started and the fd on /dev/mdXX
has been closed - and then a bit more because udev doesn't respond
instantly.
In some cases we don't know the name of the md device until
create_mddev() chooses on with find_free_devnum().
So if we want to block udev from handling a device, we need to put the
block into effect (e.g. create the rules.d file) in mddev_create() just
before the call to open_dev_excl().

If we wanted an more permanent udev rule, we would need to record the
devices that should be ignored in the filesystem somewhere else.
Maybe in /run/mdadm.
e.g.

 KERNEL=="md*", TEST="/run/mdadm/creating-$kernel", ENV{SYSTEMD_READY}="0"

Then we could have something like the following (untested) in mdadm.
Does that seem more suitable?

Thanks,
NeilBrown

---
 Assemble.c    |  2 +-
 Build.c       |  2 +-
 Create.c      |  9 ++++++++-
 Incremental.c |  4 ++--
 lib.c         | 29 +++++++++++++++++++++++++++++
 mdadm.h       |  4 +++-
 mdopen.c      |  4 +++-
 7 files changed, 47 insertions(+), 7 deletions(-)

diff --git a/Assemble.c b/Assemble.c
index b8285239354b..309841432ff5 100644
--- a/Assemble.c
+++ b/Assemble.c
@@ -1459,7 +1459,7 @@ try_again:
 			name = strchr(name, ':')+1;
 
 		mdfd = create_mddev(mddev, name, ident->autof, trustworthy,
-				    chosen_name);
+				    chosen_name, 0);
 	}
 	if (mdfd < 0) {
 		st->ss->free_super(st);
diff --git a/Build.c b/Build.c
index 11ba12f4ae7d..665d9067b8d6 100644
--- a/Build.c
+++ b/Build.c
@@ -109,7 +109,7 @@ int Build(char *mddev, struct mddev_dev *devlist,
 	/* We need to create the device.  It can have no name. */
 	map_lock(&map);
 	mdfd = create_mddev(mddev, NULL, c->autof, LOCAL,
-			    chosen_name);
+			    chosen_name, 0);
 	if (mdfd < 0) {
 		map_unlock(&map);
 		return 1;
diff --git a/Create.c b/Create.c
index 6ca092449880..df1bc20c635b 100644
--- a/Create.c
+++ b/Create.c
@@ -605,7 +605,7 @@ int Create(struct supertype *st, char *mddev,
 
 	/* We need to create the device */
 	map_lock(&map);
-	mdfd = create_mddev(mddev, name, c->autof, LOCAL, chosen_name);
+	mdfd = create_mddev(mddev, name, c->autof, LOCAL, chosen_name, 1);
 	if (mdfd < 0) {
 		map_unlock(&map);
 		return 1;
@@ -620,6 +620,7 @@ int Create(struct supertype *st, char *mddev,
 			chosen_name);
 		close(mdfd);
 		map_unlock(&map);
+		udev_unblock();
 		return 1;
 	}
 	mddev = chosen_name;
@@ -1053,9 +1054,15 @@ int Create(struct supertype *st, char *mddev,
 		pr_err("not starting array - not enough devices.\n");
 	}
 	close(mdfd);
+	/* Give udev a moment to process the Change event caused
+	 * by the close.
+	 */
+	usleep(100*1000);
+	udev_unblock();
 	return 0;
 
  abort:
+	udev_unblock();
 	map_lock(&map);
  abort_locked:
 	map_remove(&map, fd2devnm(mdfd));
diff --git a/Incremental.c b/Incremental.c
index 28f1f7734956..63ed4fa1a88d 100644
--- a/Incremental.c
+++ b/Incremental.c
@@ -321,7 +321,7 @@ int Incremental(struct mddev_dev *devlist, struct context *c,
 
 		/* Couldn't find an existing array, maybe make a new one */
 		mdfd = create_mddev(match ? match->devname : NULL,
-				    name_to_use, c->autof, trustworthy, chosen_name);
+				    name_to_use, c->autof, trustworthy, chosen_name, 0);
 
 		if (mdfd < 0)
 			goto out_unlock;
@@ -1605,7 +1605,7 @@ static int Incremental_container(struct supertype *st, char *devname,
 					    ra->name,
 					    c->autof,
 					    trustworthy,
-					    chosen_name);
+					    chosen_name, 0);
 		}
 		if (only && (!mp || strcmp(mp->devnm, only) != 0))
 			continue;
diff --git a/lib.c b/lib.c
index b640634ef6f2..d8692cae813d 100644
--- a/lib.c
+++ b/lib.c
@@ -163,6 +163,35 @@ char *fd2devnm(int fd)
 	return NULL;
 }
 
+/* When we create a new array, we don't want the content to
+ * be immediately examined by udev - it is probably meaningless.
+ * So create /run/udev/rules.d/01-mdadm-create.rules to tell udev
+ * that the device isn't ready.
+ */
+static char block_path[] = "/run/mdadm/creating-%s";
+static char *unblock_path = NULL;
+void udev_block(char *devnm)
+{
+	int fd;
+	char *path = NULL;
+
+	xasprintf(&path, block_path, devnm);
+	fd = open(path, O_CREAT|O_RDWR, 0600);
+	if (fd >= 0) {
+		close(fd);
+		unblock_path = path;
+	} else
+		free(path);
+}
+
+void udev_unblock(void)
+{
+	if (unblock_path)
+		unlink(unblock_path);
+	free(unblock_path);
+	unblock_path = NULL;
+}
+
 /*
  * convert a major/minor pair for a block device into a name in /dev, if possible.
  * On the first call, walk /dev collecting name.
diff --git a/mdadm.h b/mdadm.h
index f1f643c794d4..187e60050068 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -1522,7 +1522,7 @@ extern char *get_md_name(char *devnm);
 extern char DefaultConfFile[];
 
 extern int create_mddev(char *dev, char *name, int autof, int trustworthy,
-			char *chosen);
+			char *chosen, int block_udev);
 /* values for 'trustworthy' */
 #define	LOCAL	1
 #define	LOCAL_ANY 10
@@ -1556,6 +1556,8 @@ extern char *stat2kname(struct stat *st);
 extern char *fd2kname(int fd);
 extern char *stat2devnm(struct stat *st);
 extern char *fd2devnm(int fd);
+extern void udev_block(char *devnm);
+extern void udev_unblock(void);
 
 extern int in_initrd(void);
 
diff --git a/mdopen.c b/mdopen.c
index 82b97fc90339..1dbcdcddb9f6 100644
--- a/mdopen.c
+++ b/mdopen.c
@@ -135,7 +135,7 @@ void make_parts(char *dev, int cnt)
  */
 
 int create_mddev(char *dev, char *name, int autof, int trustworthy,
-		 char *chosen)
+		 char *chosen, int block_udev)
 {
 	int mdfd;
 	struct stat stb;
@@ -414,6 +414,8 @@ int create_mddev(char *dev, char *name, int autof, int trustworthy,
 				make_parts(chosen, parts);
 		}
 	}
+	if (block_udev)
+		udev_block(devnm);
 	mdfd = open_dev_excl(devnm);
 	if (mdfd < 0)
 		pr_err("unexpected failure opening %s\n",
-- 
2.12.2


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [mdadm PATCH 4/4] Create: tell udev device is not ready when first created.
  2017-04-20 21:35     ` NeilBrown
@ 2017-04-26 10:19       ` Peter Rajnoha
  2017-04-28  3:55         ` NeilBrown
  2017-04-28  5:05         ` [mdadm PATCH] Create: tell udev md " NeilBrown
  0 siblings, 2 replies; 35+ messages in thread
From: Peter Rajnoha @ 2017-04-26 10:19 UTC (permalink / raw)
  To: NeilBrown, Jes Sorensen; +Cc: linux-raid, dm-devel

On 04/20/2017 11:35 PM, NeilBrown wrote:
> On Thu, Apr 20 2017, Jes Sorensen wrote:
...
>> Second, isn't this going to be racey if you have multiple arrays 
>> running? I am wondering if we cannot find a solution that relies on a 
>> permanently installed udev rule that we enable/disable with systemctl 
>> and use the device name as an argument?
> 
> There would only be a problematic race of an array as being created at
> the same time that another array is being assembled.  Is that at all
> likely?  They tend to happen at different parts of the usage cycle...  I
> guess we shouldn't assume though.
> 
> I admit that blocking *all* arrays was a short cut.  We need to create
> the udev rule before the new device is first activated, and keep it in
> existence until after the array has been started and the fd on /dev/mdXX
> has been closed - and then a bit more because udev doesn't respond
> instantly.
> In some cases we don't know the name of the md device until
> create_mddev() chooses on with find_free_devnum().
> So if we want to block udev from handling a device, we need to put the
> block into effect (e.g. create the rules.d file) in mddev_create() just
> before the call to open_dev_excl().
> 
> If we wanted an more permanent udev rule, we would need to record the
> devices that should be ignored in the filesystem somewhere else.
> Maybe in /run/mdadm.
> e.g.
> 
>  KERNEL=="md*", TEST="/run/mdadm/creating-$kernel", ENV{SYSTEMD_READY}="0"
> 
> Then we could have something like the following (untested) in mdadm.
> Does that seem more suitable?
> 

Yes, please, if possible, go for a permanent udev rule surely - this
will make it much easier for other foreign tools to hook in properly if
needed and it would also be much easier to debug.

But, wouldn't it be better if we could just pass this information ("not
initialized yet") as RUN_ARRAY md ioctl parameter? In that case the md
driver in kernel could add the variable to the uevent it generates which
userspace udev rules could check for easily. This way, we don't need to
hassle with creating files in filesystem and the information would be
directly a part of the uevent the md kernel driver generates (so
directly accessible in udev rules too). Also, possibly adding more
variables for other future scenarios if needed.

We use something similar in device-mapper already where we pass flags
(besides other things) as a device-mapper ioctl parameter which then
gets into the uevent as DM_COOKIE uevent variable (which in turn the
udev rules can decode into individual flags). This method proved to be
working well for us to solve the problem of uninitialized data area so
we don't get into trouble with scans from udev (see also dm_ioctl
structure in usr/include/linux/dm-ioctl.h and the dm_kobject_uevent
function in drivers/md/dm.c for the kernel part; in udev rules the flags
are then various DM_UDEV_* variables we check for)

-- 
Peter

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [mdadm PATCH 4/4] Create: tell udev device is not ready when first created.
  2017-04-26 10:19       ` Peter Rajnoha
@ 2017-04-28  3:55         ` NeilBrown
  2017-04-28  9:08           ` Peter Rajnoha
  2017-04-28  5:05         ` [mdadm PATCH] Create: tell udev md " NeilBrown
  1 sibling, 1 reply; 35+ messages in thread
From: NeilBrown @ 2017-04-28  3:55 UTC (permalink / raw)
  To: Peter Rajnoha, Jes Sorensen; +Cc: linux-raid, dm-devel

[-- Attachment #1: Type: text/plain, Size: 4864 bytes --]

On Wed, Apr 26 2017, Peter Rajnoha wrote:

> On 04/20/2017 11:35 PM, NeilBrown wrote:
>> On Thu, Apr 20 2017, Jes Sorensen wrote:
> ...
>>> Second, isn't this going to be racey if you have multiple arrays 
>>> running? I am wondering if we cannot find a solution that relies on a 
>>> permanently installed udev rule that we enable/disable with systemctl 
>>> and use the device name as an argument?
>> 
>> There would only be a problematic race of an array as being created at
>> the same time that another array is being assembled.  Is that at all
>> likely?  They tend to happen at different parts of the usage cycle...  I
>> guess we shouldn't assume though.
>> 
>> I admit that blocking *all* arrays was a short cut.  We need to create
>> the udev rule before the new device is first activated, and keep it in
>> existence until after the array has been started and the fd on /dev/mdXX
>> has been closed - and then a bit more because udev doesn't respond
>> instantly.
>> In some cases we don't know the name of the md device until
>> create_mddev() chooses on with find_free_devnum().
>> So if we want to block udev from handling a device, we need to put the
>> block into effect (e.g. create the rules.d file) in mddev_create() just
>> before the call to open_dev_excl().
>> 
>> If we wanted an more permanent udev rule, we would need to record the
>> devices that should be ignored in the filesystem somewhere else.
>> Maybe in /run/mdadm.
>> e.g.
>> 
>>  KERNEL=="md*", TEST="/run/mdadm/creating-$kernel", ENV{SYSTEMD_READY}="0"
>> 
>> Then we could have something like the following (untested) in mdadm.
>> Does that seem more suitable?
>> 
>
> Yes, please, if possible, go for a permanent udev rule surely - this
> will make it much easier for other foreign tools to hook in properly if
> needed and it would also be much easier to debug.

I'm leaning towards the files-in-/run/mdadm approach too.  I'll make a
proper patch.

>
> But, wouldn't it be better if we could just pass this information ("not
> initialized yet") as RUN_ARRAY md ioctl parameter? In that case the md
> driver in kernel could add the variable to the uevent it generates which
> userspace udev rules could check for easily. This way, we don't need to
> hassle with creating files in filesystem and the information would be
> directly a part of the uevent the md kernel driver generates (so
> directly accessible in udev rules too). Also, possibly adding more
> variables for other future scenarios if needed.

When would we clear the "not initialised yet" flag in the kernel, and
how?  And would that be enough.

When mdadm creates an md array, at least 3 uevents get generated.
The first is generated when the md device object is created, either by
opening the /dev/mdXX file, or by writing magic to somewhere in sysfs.
The second is generated when the array is started and the content is
visible.
The third is generated when mdadm closes the file descriptor.  It opens
/dev/mdXX for O_RDWR and performs ioctls on this, and then closes it.
Because udev uses inotify to watch for a 'close for a writable file
descriptor', this generates another event.

We need to ensure that none of these cause udev to run anything that
inspects the contents of the array.
Of the three, only the second one is directly under the control of the
md module, so only that one can add an environment variable.

It might be possible to avoid the last one (by not opening for write),
but I cannot see a way to avoid the first one.

I don't think that making a file appear in /run is really very different
from making a variable appear in a uevent.   If the variable were
describing the event itself there would be a different, but it would be
describing the state of the device.  So the only important difference is
"which is easier to work with".  I think creating an deleting a file is
easier to setting and clearing a variable.

Thanks,
NeilBrown


>
> We use something similar in device-mapper already where we pass flags
> (besides other things) as a device-mapper ioctl parameter which then
> gets into the uevent as DM_COOKIE uevent variable (which in turn the
> udev rules can decode into individual flags). This method proved to be
> working well for us to solve the problem of uninitialized data area so
> we don't get into trouble with scans from udev (see also dm_ioctl
> structure in usr/include/linux/dm-ioctl.h and the dm_kobject_uevent
> function in drivers/md/dm.c for the kernel part; in udev rules the flags
> are then various DM_UDEV_* variables we check for)
>
> -- 
> Peter
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [mdadm PATCH] Create: tell udev md device is not ready when first created.
  2017-04-26 10:19       ` Peter Rajnoha
  2017-04-28  3:55         ` NeilBrown
@ 2017-04-28  5:05         ` NeilBrown
  2017-04-28  9:28           ` Peter Rajnoha
                             ` (2 more replies)
  1 sibling, 3 replies; 35+ messages in thread
From: NeilBrown @ 2017-04-28  5:05 UTC (permalink / raw)
  To: Jes Sorensen; +Cc: Peter Rajnoha, linux-raid, dm-devel

[-- Attachment #1: Type: text/plain, Size: 9293 bytes --]


When an array is created the content is not initialized,
so it could have remnants of an old filesystem or md array
etc on it.
udev will see this and might try to activate it, which is almost
certainly not what is wanted.

So create a mechanism for mdadm to communicate with udev to tell
it that the device isn't ready.  This mechanism is the existance
of a file /run/mdadm/created-mdXXX where mdXXX is the md device name.

When creating an array, mdadm will create the file.
A new udev rule file, 01-md-raid-creating.rules, will detect the
precense of thst file and set ENV{SYSTEMD_READY}="0".
This is fairly uniformly used to suppress actions based on the
contents of the device.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 Assemble.c                  |  2 +-
 Build.c                     |  2 +-
 Create.c                    |  9 +++++++-
 Incremental.c               |  4 ++--
 Makefile                    |  4 ++--
 lib.c                       | 29 +++++++++++++++++++++++++
 mdadm.h                     |  4 +++-
 mdopen.c                    | 52 ++++++++++++++++++++++++++++-----------------
 udev-md-raid-creating.rules |  7 ++++++
 9 files changed, 86 insertions(+), 27 deletions(-)
 create mode 100644 udev-md-raid-creating.rules

diff --git a/Assemble.c b/Assemble.c
index d6beb23da9c5..a9442c8ce73b 100644
--- a/Assemble.c
+++ b/Assemble.c
@@ -1478,7 +1478,7 @@ try_again:
 			name = strchr(name, ':')+1;
 
 		mdfd = create_mddev(mddev, name, ident->autof, trustworthy,
-				    chosen_name);
+				    chosen_name, 0);
 	}
 	if (mdfd < 0) {
 		st->ss->free_super(st);
diff --git a/Build.c b/Build.c
index 11ba12f4ae7d..665d9067b8d6 100644
--- a/Build.c
+++ b/Build.c
@@ -109,7 +109,7 @@ int Build(char *mddev, struct mddev_dev *devlist,
 	/* We need to create the device.  It can have no name. */
 	map_lock(&map);
 	mdfd = create_mddev(mddev, NULL, c->autof, LOCAL,
-			    chosen_name);
+			    chosen_name, 0);
 	if (mdfd < 0) {
 		map_unlock(&map);
 		return 1;
diff --git a/Create.c b/Create.c
index 6ca092449880..df1bc20c635b 100644
--- a/Create.c
+++ b/Create.c
@@ -605,7 +605,7 @@ int Create(struct supertype *st, char *mddev,
 
 	/* We need to create the device */
 	map_lock(&map);
-	mdfd = create_mddev(mddev, name, c->autof, LOCAL, chosen_name);
+	mdfd = create_mddev(mddev, name, c->autof, LOCAL, chosen_name, 1);
 	if (mdfd < 0) {
 		map_unlock(&map);
 		return 1;
@@ -620,6 +620,7 @@ int Create(struct supertype *st, char *mddev,
 			chosen_name);
 		close(mdfd);
 		map_unlock(&map);
+		udev_unblock();
 		return 1;
 	}
 	mddev = chosen_name;
@@ -1053,9 +1054,15 @@ int Create(struct supertype *st, char *mddev,
 		pr_err("not starting array - not enough devices.\n");
 	}
 	close(mdfd);
+	/* Give udev a moment to process the Change event caused
+	 * by the close.
+	 */
+	usleep(100*1000);
+	udev_unblock();
 	return 0;
 
  abort:
+	udev_unblock();
 	map_lock(&map);
  abort_locked:
 	map_remove(&map, fd2devnm(mdfd));
diff --git a/Incremental.c b/Incremental.c
index 28f1f7734956..63ed4fa1a88d 100644
--- a/Incremental.c
+++ b/Incremental.c
@@ -321,7 +321,7 @@ int Incremental(struct mddev_dev *devlist, struct context *c,
 
 		/* Couldn't find an existing array, maybe make a new one */
 		mdfd = create_mddev(match ? match->devname : NULL,
-				    name_to_use, c->autof, trustworthy, chosen_name);
+				    name_to_use, c->autof, trustworthy, chosen_name, 0);
 
 		if (mdfd < 0)
 			goto out_unlock;
@@ -1605,7 +1605,7 @@ static int Incremental_container(struct supertype *st, char *devname,
 					    ra->name,
 					    c->autof,
 					    trustworthy,
-					    chosen_name);
+					    chosen_name, 0);
 		}
 		if (only && (!mp || strcmp(mp->devnm, only) != 0))
 			continue;
diff --git a/Makefile b/Makefile
index 685069612617..021d3adf3ed1 100644
--- a/Makefile
+++ b/Makefile
@@ -256,8 +256,8 @@ install-man: mdadm.8 md.4 mdadm.conf.5 mdmon.8
 	$(INSTALL) -D -m 644 md.4 $(DESTDIR)$(MAN4DIR)/md.4
 	$(INSTALL) -D -m 644 mdadm.conf.5 $(DESTDIR)$(MAN5DIR)/mdadm.conf.5
 
-install-udev: udev-md-raid-arrays.rules udev-md-raid-assembly.rules
-	@for file in 63-md-raid-arrays.rules 64-md-raid-assembly.rules ; \
+install-udev: udev-md-raid-arrays.rules udev-md-raid-assembly.rules udev-md-raid-creating.rules
+	@for file in 01-md-raid-creating.rules 63-md-raid-arrays.rules 64-md-raid-assembly.rules ; \
 	do sed -e 's,BINDIR,$(BINDIR),g' udev-$${file#??-} > .install.tmp.1 && \
 	   $(ECHO) $(INSTALL) -D -m 644 udev-$${file#??-} $(DESTDIR)$(UDEVDIR)/rules.d/$$file ; \
 	   $(INSTALL) -D -m 644 .install.tmp.1 $(DESTDIR)$(UDEVDIR)/rules.d/$$file ; \
diff --git a/lib.c b/lib.c
index b640634ef6f2..7e44b1f27fcc 100644
--- a/lib.c
+++ b/lib.c
@@ -163,6 +163,35 @@ char *fd2devnm(int fd)
 	return NULL;
 }
 
+/* When we create a new array, we don't want the content to
+ * be immediately examined by udev - it is probably meaningless.
+ * So create /run/mdadm/creating-FOO and expect that a udev
+ * rule will noticed this and act accordingly.
+ */
+static char block_path[] = "/run/mdadm/creating-%s";
+static char *unblock_path = NULL;
+void udev_block(char *devnm)
+{
+	int fd;
+	char *path = NULL;
+
+	xasprintf(&path, block_path, devnm);
+	fd = open(path, O_CREAT|O_RDWR, 0600);
+	if (fd >= 0) {
+		close(fd);
+		unblock_path = path;
+	} else
+		free(path);
+}
+
+void udev_unblock(void)
+{
+	if (unblock_path)
+		unlink(unblock_path);
+	free(unblock_path);
+	unblock_path = NULL;
+}
+
 /*
  * convert a major/minor pair for a block device into a name in /dev, if possible.
  * On the first call, walk /dev collecting name.
diff --git a/mdadm.h b/mdadm.h
index 1bbacfe9e916..6a382a7c1b90 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -1533,7 +1533,7 @@ extern char *get_md_name(char *devnm);
 extern char DefaultConfFile[];
 
 extern int create_mddev(char *dev, char *name, int autof, int trustworthy,
-			char *chosen);
+			char *chosen, int block_udev);
 /* values for 'trustworthy' */
 #define	LOCAL	1
 #define	LOCAL_ANY 10
@@ -1567,6 +1567,8 @@ extern char *stat2kname(struct stat *st);
 extern char *fd2kname(int fd);
 extern char *stat2devnm(struct stat *st);
 extern char *fd2devnm(int fd);
+extern void udev_block(char *devnm);
+extern void udev_unblock(void);
 
 extern int in_initrd(void);
 
diff --git a/mdopen.c b/mdopen.c
index 82b97fc90339..099efa0aa2e5 100644
--- a/mdopen.c
+++ b/mdopen.c
@@ -135,7 +135,7 @@ void make_parts(char *dev, int cnt)
  */
 
 int create_mddev(char *dev, char *name, int autof, int trustworthy,
-		 char *chosen)
+		 char *chosen, int block_udev)
 {
 	int mdfd;
 	struct stat stb;
@@ -147,6 +147,10 @@ int create_mddev(char *dev, char *name, int autof, int trustworthy,
 	char devname[37];
 	char devnm[32];
 	char cbuf[400];
+
+	if (!use_udev())
+		block_udev = 0;
+
 	if (chosen == NULL)
 		chosen = cbuf;
 
@@ -305,43 +309,53 @@ int create_mddev(char *dev, char *name, int autof, int trustworthy,
 		int fd;
 		int n = -1;
 		sprintf(devnm, "md_%s", cname);
+		if (block_udev)
+			udev_block(devnm);
 		fd = open("/sys/module/md_mod/parameters/new_array", O_WRONLY);
 		if (fd >= 0) {
 			n = write(fd, devnm, strlen(devnm));
 			close(fd);
 		}
-		if (n < 0)
+		if (n < 0) {
 			devnm[0] = 0;
+			udev_unblock();
+		}
 	}
 	if (num >= 0) {
 		int fd;
 		int n = -1;
 		sprintf(devnm, "md%d", num);
+		if (block_udev)
+			udev_block(devnm);
 		fd = open("/sys/module/md_mod/parameters/new_array", O_WRONLY);
 		if (fd >= 0) {
 			n = write(fd, devnm, strlen(devnm));
 			close(fd);
 		}
-		if (n < 0)
+		if (n < 0) {
 			devnm[0] = 0;
-	}
-	if (devnm[0])
-		;
-	else if (num < 0) {
-		/* need to choose a free number. */
-		char *_devnm = find_free_devnm(use_mdp);
-		if (_devnm == NULL) {
-			pr_err("No avail md devices - aborting\n");
-			return -1;
+			udev_unblock();
 		}
-		strcpy(devnm, _devnm);
-	} else {
-		sprintf(devnm, "%s%d", use_mdp?"md_d":"md", num);
-		if (mddev_busy(devnm)) {
-			pr_err("%s is already in use.\n",
-				dev);
-			return -1;
+	}
+	if (devnm[0] == 0) {
+		if (num < 0) {
+			/* need to choose a free number. */
+			char *_devnm = find_free_devnm(use_mdp);
+			if (_devnm == NULL) {
+				pr_err("No avail md devices - aborting\n");
+				return -1;
+			}
+			strcpy(devnm, _devnm);
+		} else {
+			sprintf(devnm, "%s%d", use_mdp?"md_d":"md", num);
+			if (mddev_busy(devnm)) {
+				pr_err("%s is already in use.\n",
+				       dev);
+				return -1;
+			}
 		}
+		if (block_udev)
+			udev_block(devnm);
 	}
 
 	sprintf(devname, "/dev/%s", devnm);
diff --git a/udev-md-raid-creating.rules b/udev-md-raid-creating.rules
new file mode 100644
index 000000000000..2be466bcefd2
--- /dev/null
+++ b/udev-md-raid-creating.rules
@@ -0,0 +1,7 @@
+# do not edit this file, it will be overwritten on update
+# While mdadm is creating an array, it creates a file
+# /run/mdadm/creating-mdXXX.  If that file exists, then
+# the array is not "ready" and we should make sure the
+# content is ignored.
+
+KERNEL=="md*", TEST="/run/mdadm/creating-$kernel", ENV{SYSTEMD_READY}="0"
-- 
2.12.2


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [mdadm PATCH 4/4] Create: tell udev device is not ready when first created.
  2017-04-28  3:55         ` NeilBrown
@ 2017-04-28  9:08           ` Peter Rajnoha
  2017-05-01  4:35             ` [dm-devel] " NeilBrown
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Rajnoha @ 2017-04-28  9:08 UTC (permalink / raw)
  To: NeilBrown, Jes Sorensen; +Cc: linux-raid, dm-devel

On 04/28/2017 05:55 AM, NeilBrown wrote:
> On Wed, Apr 26 2017, Peter Rajnoha wrote:
> 
>> On 04/20/2017 11:35 PM, NeilBrown wrote:
>>> If we wanted an more permanent udev rule, we would need to record the
>>> devices that should be ignored in the filesystem somewhere else.
>>> Maybe in /run/mdadm.
>>> e.g.
>>>
>>>  KERNEL=="md*", TEST="/run/mdadm/creating-$kernel", ENV{SYSTEMD_READY}="0"
>>>
>>> Then we could have something like the following (untested) in mdadm.
>>> Does that seem more suitable?
>>>
>>
>> Yes, please, if possible, go for a permanent udev rule surely - this
>> will make it much easier for other foreign tools to hook in properly if
>> needed and it would also be much easier to debug.
> 
> I'm leaning towards the files-in-/run/mdadm approach too.  I'll make a
> proper patch.
> 
>>
>> But, wouldn't it be better if we could just pass this information ("not
>> initialized yet") as RUN_ARRAY md ioctl parameter? In that case the md
>> driver in kernel could add the variable to the uevent it generates which
>> userspace udev rules could check for easily. This way, we don't need to
>> hassle with creating files in filesystem and the information would be
>> directly a part of the uevent the md kernel driver generates (so
>> directly accessible in udev rules too). Also, possibly adding more
>> variables for other future scenarios if needed.
> 
> When would we clear the "not initialised yet" flag in the kernel, and
> how?  And would that be enough.

The flag wouldn't be stored in kernel, md kernel driver would just pass
the flag with the uevent as it received in with ioctl/sysfs request to
create a new dev. The udev in userspace would handle the state
transition then from "flagged as not-initialized" state to "initilized"
by checking the sequence of events as they come.

We should have this sequence I assume:

  1) "add" (creating dev, not usable yet)
  2) "change" (activated dev, but not initialized yet)
  3) "synthetic change" (after wiping the dev and closing it, the WATCH
rule fires)

> 
> When mdadm creates an md array, at least 3 uevents get generated.
> The first is generated when the md device object is created, either by
> opening the /dev/mdXX file, or by writing magic to somewhere in sysfs.
> The second is generated when the array is started and the content is
> visible.
> The third is generated when mdadm closes the file descriptor.  It opens
> /dev/mdXX for O_RDWR and performs ioctls on this, and then closes it.
> Because udev uses inotify to watch for a 'close for a writable file
> descriptor', this generates another event.
> 
> We need to ensure that none of these cause udev to run anything that
> inspects the contents of the array.
> Of the three, only the second one is directly under the control of the
> md module, so only that one can add an environment variable.
> 
> It might be possible to avoid the last one (by not opening for write),
> but I cannot see a way to avoid the first one.

So the first event is the "add" event ("md device object created") - in
this case, the device is not yet usable anyway I suppose, so we can skip
udev scans for this event right away (unless it's the synthetic "add"
event which is generated by writing "add" to /sys/block/.../uevent file
or alternatively using udevadm trigger - but we should be able to
recognize this event "synthetic add event" because it always comes after
the activating "change" event, otherwise we can skip scans).

The second event, which is the "change" event, would be marked with the
new "not initialized" flag. And so we skip the scans in udev too.

Then mdadm opens the devive, clears any old content/signatures the data
area may contain, then closes it - this generates the third event -
which is the "synthetic change" event (as a result of the inotify WATCH
rule). And this one would drop the "not initialized" flag in udev db and
the scans in udev are now enabled.

So we should be able to handle all three kinds of events I think.

Now, as for even better synthetic event recognition, I've proposed a
patch recently where udev as well as any other tool generating these
synthetic events can add variables for more detailed event
identification, so this one should help us in the future even more in
these situations: https://lkml.org/lkml/2017/3/15/461. With this, we can
even disable the WATCH rule till the device is properly initialized and
the tool can generate the event itself by writing the
/sys/block/.../uevent file with a variable that the tool can then wait
for even (so the tool doesn't exit till the device is not properly
initialized). Once this initialization is all done, the WATCH rule can
be enabled for the dev. Also, with this, we don't need to be afraid that
some other tool fired the WATCH rule by chance if it opened the dev for
RW and closed it by mistake before we had a chance to initialize it
(which would fire the synthetic change event before the
wiping/initialization).

> 
> I don't think that making a file appear in /run is really very different
> from making a variable appear in a uevent.   If the variable were
> describing the event itself there would be a different, but it would be
> describing the state of the device.  So the only important difference is
> "which is easier to work with".  I think creating an deleting a file is
> easier to setting and clearing a variable.
> 

Yes, I agree that it's an alternative solution - it definitely and
surely improves current situation, either if we choose to write the file
or pass the flag in the uevent directly. It's just that with the
information written in filesystem, we have something external to check
for in addition to processing the uevent variables while we don't need
to, I think. As I described the sequence of events above, I think we
should be able to recognize the events properly and we should be able to
drop the flag automatically.

-- 
Peter

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [mdadm PATCH] Create: tell udev md device is not ready when first created.
  2017-04-28  5:05         ` [mdadm PATCH] Create: tell udev md " NeilBrown
@ 2017-04-28  9:28           ` Peter Rajnoha
  2017-05-02 13:32             ` Jes Sorensen
  2017-05-02 13:42           ` Jes Sorensen
  2017-05-04 10:58           ` Peter Rajnoha
  2 siblings, 1 reply; 35+ messages in thread
From: Peter Rajnoha @ 2017-04-28  9:28 UTC (permalink / raw)
  To: NeilBrown, Jes Sorensen; +Cc: linux-raid, dm-devel

On 04/28/2017 07:05 AM, NeilBrown wrote:
> 
> When an array is created the content is not initialized,
> so it could have remnants of an old filesystem or md array
> etc on it.
> udev will see this and might try to activate it, which is almost
> certainly not what is wanted.
> 
> So create a mechanism for mdadm to communicate with udev to tell
> it that the device isn't ready.  This mechanism is the existance
> of a file /run/mdadm/created-mdXXX where mdXXX is the md device name.
> 
> When creating an array, mdadm will create the file.
> A new udev rule file, 01-md-raid-creating.rules, will detect the
> precense of thst file and set ENV{SYSTEMD_READY}="0".
> This is fairly uniformly used to suppress actions based on the
> contents of the device.
> 

The scans in udev are primarily directed by blkid call which detects the
signatures and based on this information various other udev rules fire.

The blkid as well as wipefs uses common libblkid library to detect these
signatures - is mdadm going to use libblkid to wipe the signatures on MD
device initialization or is it relying on external tools to do this? How
is mdadm actually initializing/wiping the newly created MD device?

-- 
Peter

> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  Assemble.c                  |  2 +-
>  Build.c                     |  2 +-
>  Create.c                    |  9 +++++++-
>  Incremental.c               |  4 ++--
>  Makefile                    |  4 ++--
>  lib.c                       | 29 +++++++++++++++++++++++++
>  mdadm.h                     |  4 +++-
>  mdopen.c                    | 52 ++++++++++++++++++++++++++++-----------------
>  udev-md-raid-creating.rules |  7 ++++++
>  9 files changed, 86 insertions(+), 27 deletions(-)
>  create mode 100644 udev-md-raid-creating.rules
> 
> diff --git a/Assemble.c b/Assemble.c
> index d6beb23da9c5..a9442c8ce73b 100644
> --- a/Assemble.c
> +++ b/Assemble.c
> @@ -1478,7 +1478,7 @@ try_again:
>  			name = strchr(name, ':')+1;
>  
>  		mdfd = create_mddev(mddev, name, ident->autof, trustworthy,
> -				    chosen_name);
> +				    chosen_name, 0);
>  	}
>  	if (mdfd < 0) {
>  		st->ss->free_super(st);
> diff --git a/Build.c b/Build.c
> index 11ba12f4ae7d..665d9067b8d6 100644
> --- a/Build.c
> +++ b/Build.c
> @@ -109,7 +109,7 @@ int Build(char *mddev, struct mddev_dev *devlist,
>  	/* We need to create the device.  It can have no name. */
>  	map_lock(&map);
>  	mdfd = create_mddev(mddev, NULL, c->autof, LOCAL,
> -			    chosen_name);
> +			    chosen_name, 0);
>  	if (mdfd < 0) {
>  		map_unlock(&map);
>  		return 1;
> diff --git a/Create.c b/Create.c
> index 6ca092449880..df1bc20c635b 100644
> --- a/Create.c
> +++ b/Create.c
> @@ -605,7 +605,7 @@ int Create(struct supertype *st, char *mddev,
>  
>  	/* We need to create the device */
>  	map_lock(&map);
> -	mdfd = create_mddev(mddev, name, c->autof, LOCAL, chosen_name);
> +	mdfd = create_mddev(mddev, name, c->autof, LOCAL, chosen_name, 1);
>  	if (mdfd < 0) {
>  		map_unlock(&map);
>  		return 1;
> @@ -620,6 +620,7 @@ int Create(struct supertype *st, char *mddev,
>  			chosen_name);
>  		close(mdfd);
>  		map_unlock(&map);
> +		udev_unblock();
>  		return 1;
>  	}
>  	mddev = chosen_name;
> @@ -1053,9 +1054,15 @@ int Create(struct supertype *st, char *mddev,
>  		pr_err("not starting array - not enough devices.\n");
>  	}
>  	close(mdfd);
> +	/* Give udev a moment to process the Change event caused
> +	 * by the close.
> +	 */
> +	usleep(100*1000);
> +	udev_unblock();
>  	return 0;
>  
>   abort:
> +	udev_unblock();
>  	map_lock(&map);
>   abort_locked:
>  	map_remove(&map, fd2devnm(mdfd));
> diff --git a/Incremental.c b/Incremental.c
> index 28f1f7734956..63ed4fa1a88d 100644
> --- a/Incremental.c
> +++ b/Incremental.c
> @@ -321,7 +321,7 @@ int Incremental(struct mddev_dev *devlist, struct context *c,
>  
>  		/* Couldn't find an existing array, maybe make a new one */
>  		mdfd = create_mddev(match ? match->devname : NULL,
> -				    name_to_use, c->autof, trustworthy, chosen_name);
> +				    name_to_use, c->autof, trustworthy, chosen_name, 0);
>  
>  		if (mdfd < 0)
>  			goto out_unlock;
> @@ -1605,7 +1605,7 @@ static int Incremental_container(struct supertype *st, char *devname,
>  					    ra->name,
>  					    c->autof,
>  					    trustworthy,
> -					    chosen_name);
> +					    chosen_name, 0);
>  		}
>  		if (only && (!mp || strcmp(mp->devnm, only) != 0))
>  			continue;
> diff --git a/Makefile b/Makefile
> index 685069612617..021d3adf3ed1 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -256,8 +256,8 @@ install-man: mdadm.8 md.4 mdadm.conf.5 mdmon.8
>  	$(INSTALL) -D -m 644 md.4 $(DESTDIR)$(MAN4DIR)/md.4
>  	$(INSTALL) -D -m 644 mdadm.conf.5 $(DESTDIR)$(MAN5DIR)/mdadm.conf.5
>  
> -install-udev: udev-md-raid-arrays.rules udev-md-raid-assembly.rules
> -	@for file in 63-md-raid-arrays.rules 64-md-raid-assembly.rules ; \
> +install-udev: udev-md-raid-arrays.rules udev-md-raid-assembly.rules udev-md-raid-creating.rules
> +	@for file in 01-md-raid-creating.rules 63-md-raid-arrays.rules 64-md-raid-assembly.rules ; \
>  	do sed -e 's,BINDIR,$(BINDIR),g' udev-$${file#??-} > .install.tmp.1 && \
>  	   $(ECHO) $(INSTALL) -D -m 644 udev-$${file#??-} $(DESTDIR)$(UDEVDIR)/rules.d/$$file ; \
>  	   $(INSTALL) -D -m 644 .install.tmp.1 $(DESTDIR)$(UDEVDIR)/rules.d/$$file ; \
> diff --git a/lib.c b/lib.c
> index b640634ef6f2..7e44b1f27fcc 100644
> --- a/lib.c
> +++ b/lib.c
> @@ -163,6 +163,35 @@ char *fd2devnm(int fd)
>  	return NULL;
>  }
>  
> +/* When we create a new array, we don't want the content to
> + * be immediately examined by udev - it is probably meaningless.
> + * So create /run/mdadm/creating-FOO and expect that a udev
> + * rule will noticed this and act accordingly.
> + */
> +static char block_path[] = "/run/mdadm/creating-%s";
> +static char *unblock_path = NULL;
> +void udev_block(char *devnm)
> +{
> +	int fd;
> +	char *path = NULL;
> +
> +	xasprintf(&path, block_path, devnm);
> +	fd = open(path, O_CREAT|O_RDWR, 0600);
> +	if (fd >= 0) {
> +		close(fd);
> +		unblock_path = path;
> +	} else
> +		free(path);
> +}
> +
> +void udev_unblock(void)
> +{
> +	if (unblock_path)
> +		unlink(unblock_path);
> +	free(unblock_path);
> +	unblock_path = NULL;
> +}
> +
>  /*
>   * convert a major/minor pair for a block device into a name in /dev, if possible.
>   * On the first call, walk /dev collecting name.
> diff --git a/mdadm.h b/mdadm.h
> index 1bbacfe9e916..6a382a7c1b90 100644
> --- a/mdadm.h
> +++ b/mdadm.h
> @@ -1533,7 +1533,7 @@ extern char *get_md_name(char *devnm);
>  extern char DefaultConfFile[];
>  
>  extern int create_mddev(char *dev, char *name, int autof, int trustworthy,
> -			char *chosen);
> +			char *chosen, int block_udev);
>  /* values for 'trustworthy' */
>  #define	LOCAL	1
>  #define	LOCAL_ANY 10
> @@ -1567,6 +1567,8 @@ extern char *stat2kname(struct stat *st);
>  extern char *fd2kname(int fd);
>  extern char *stat2devnm(struct stat *st);
>  extern char *fd2devnm(int fd);
> +extern void udev_block(char *devnm);
> +extern void udev_unblock(void);
>  
>  extern int in_initrd(void);
>  
> diff --git a/mdopen.c b/mdopen.c
> index 82b97fc90339..099efa0aa2e5 100644
> --- a/mdopen.c
> +++ b/mdopen.c
> @@ -135,7 +135,7 @@ void make_parts(char *dev, int cnt)
>   */
>  
>  int create_mddev(char *dev, char *name, int autof, int trustworthy,
> -		 char *chosen)
> +		 char *chosen, int block_udev)
>  {
>  	int mdfd;
>  	struct stat stb;
> @@ -147,6 +147,10 @@ int create_mddev(char *dev, char *name, int autof, int trustworthy,
>  	char devname[37];
>  	char devnm[32];
>  	char cbuf[400];
> +
> +	if (!use_udev())
> +		block_udev = 0;
> +
>  	if (chosen == NULL)
>  		chosen = cbuf;
>  
> @@ -305,43 +309,53 @@ int create_mddev(char *dev, char *name, int autof, int trustworthy,
>  		int fd;
>  		int n = -1;
>  		sprintf(devnm, "md_%s", cname);
> +		if (block_udev)
> +			udev_block(devnm);
>  		fd = open("/sys/module/md_mod/parameters/new_array", O_WRONLY);
>  		if (fd >= 0) {
>  			n = write(fd, devnm, strlen(devnm));
>  			close(fd);
>  		}
> -		if (n < 0)
> +		if (n < 0) {
>  			devnm[0] = 0;
> +			udev_unblock();
> +		}
>  	}
>  	if (num >= 0) {
>  		int fd;
>  		int n = -1;
>  		sprintf(devnm, "md%d", num);
> +		if (block_udev)
> +			udev_block(devnm);
>  		fd = open("/sys/module/md_mod/parameters/new_array", O_WRONLY);
>  		if (fd >= 0) {
>  			n = write(fd, devnm, strlen(devnm));
>  			close(fd);
>  		}
> -		if (n < 0)
> +		if (n < 0) {
>  			devnm[0] = 0;
> -	}
> -	if (devnm[0])
> -		;
> -	else if (num < 0) {
> -		/* need to choose a free number. */
> -		char *_devnm = find_free_devnm(use_mdp);
> -		if (_devnm == NULL) {
> -			pr_err("No avail md devices - aborting\n");
> -			return -1;
> +			udev_unblock();
>  		}
> -		strcpy(devnm, _devnm);
> -	} else {
> -		sprintf(devnm, "%s%d", use_mdp?"md_d":"md", num);
> -		if (mddev_busy(devnm)) {
> -			pr_err("%s is already in use.\n",
> -				dev);
> -			return -1;
> +	}
> +	if (devnm[0] == 0) {
> +		if (num < 0) {
> +			/* need to choose a free number. */
> +			char *_devnm = find_free_devnm(use_mdp);
> +			if (_devnm == NULL) {
> +				pr_err("No avail md devices - aborting\n");
> +				return -1;
> +			}
> +			strcpy(devnm, _devnm);
> +		} else {
> +			sprintf(devnm, "%s%d", use_mdp?"md_d":"md", num);
> +			if (mddev_busy(devnm)) {
> +				pr_err("%s is already in use.\n",
> +				       dev);
> +				return -1;
> +			}
>  		}
> +		if (block_udev)
> +			udev_block(devnm);
>  	}
>  
>  	sprintf(devname, "/dev/%s", devnm);
> diff --git a/udev-md-raid-creating.rules b/udev-md-raid-creating.rules
> new file mode 100644
> index 000000000000..2be466bcefd2
> --- /dev/null
> +++ b/udev-md-raid-creating.rules
> @@ -0,0 +1,7 @@
> +# do not edit this file, it will be overwritten on update
> +# While mdadm is creating an array, it creates a file
> +# /run/mdadm/creating-mdXXX.  If that file exists, then
> +# the array is not "ready" and we should make sure the
> +# content is ignored.
> +
> +KERNEL=="md*", TEST="/run/mdadm/creating-$kernel", ENV{SYSTEMD_READY}="0"
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [dm-devel] [mdadm PATCH 4/4] Create: tell udev device is not ready when first created.
  2017-04-28  9:08           ` Peter Rajnoha
@ 2017-05-01  4:35             ` NeilBrown
  2017-05-02 11:40               ` Peter Rajnoha
  0 siblings, 1 reply; 35+ messages in thread
From: NeilBrown @ 2017-05-01  4:35 UTC (permalink / raw)
  To: Peter Rajnoha, Jes Sorensen; +Cc: linux-raid, dm-devel

[-- Attachment #1: Type: text/plain, Size: 7671 bytes --]

On Fri, Apr 28 2017, Peter Rajnoha wrote:

> On 04/28/2017 05:55 AM, NeilBrown wrote:
>> On Wed, Apr 26 2017, Peter Rajnoha wrote:
>> 
>>> On 04/20/2017 11:35 PM, NeilBrown wrote:
>>>> If we wanted an more permanent udev rule, we would need to record the
>>>> devices that should be ignored in the filesystem somewhere else.
>>>> Maybe in /run/mdadm.
>>>> e.g.
>>>>
>>>>  KERNEL=="md*", TEST="/run/mdadm/creating-$kernel", ENV{SYSTEMD_READY}="0"
>>>>
>>>> Then we could have something like the following (untested) in mdadm.
>>>> Does that seem more suitable?
>>>>
>>>
>>> Yes, please, if possible, go for a permanent udev rule surely - this
>>> will make it much easier for other foreign tools to hook in properly if
>>> needed and it would also be much easier to debug.
>> 
>> I'm leaning towards the files-in-/run/mdadm approach too.  I'll make a
>> proper patch.
>> 
>>>
>>> But, wouldn't it be better if we could just pass this information ("not
>>> initialized yet") as RUN_ARRAY md ioctl parameter? In that case the md
>>> driver in kernel could add the variable to the uevent it generates which
>>> userspace udev rules could check for easily. This way, we don't need to
>>> hassle with creating files in filesystem and the information would be
>>> directly a part of the uevent the md kernel driver generates (so
>>> directly accessible in udev rules too). Also, possibly adding more
>>> variables for other future scenarios if needed.
>> 
>> When would we clear the "not initialised yet" flag in the kernel, and
>> how?  And would that be enough.
>
> The flag wouldn't be stored in kernel, md kernel driver would just pass
> the flag with the uevent as it received in with ioctl/sysfs request to
> create a new dev. The udev in userspace would handle the state
> transition then from "flagged as not-initialized" state to "initilized"
> by checking the sequence of events as they come.
>
> We should have this sequence I assume:
>
>   1) "add" (creating dev, not usable yet)
>   2) "change" (activated dev, but not initialized yet)
>   3) "synthetic change" (after wiping the dev and closing it, the WATCH
> rule fires)
>

"Should" is arguable, but there are no guarantees of this sequence.

A particular cause of irregular sequencing is when an array is assembled
by the initrd, then after the real root is mounted, something runs
  udevadm trigger --action=add
(e.g. systemd-udev-triggger.service).

In that case, 'add' is the first and only message that the
full-root-available udev sees, so it is not safe to ignore the array as
"not usable yet".


>> 
>> When mdadm creates an md array, at least 3 uevents get generated.
>> The first is generated when the md device object is created, either by
>> opening the /dev/mdXX file, or by writing magic to somewhere in sysfs.
>> The second is generated when the array is started and the content is
>> visible.
>> The third is generated when mdadm closes the file descriptor.  It opens
>> /dev/mdXX for O_RDWR and performs ioctls on this, and then closes it.
>> Because udev uses inotify to watch for a 'close for a writable file
>> descriptor', this generates another event.
>> 
>> We need to ensure that none of these cause udev to run anything that
>> inspects the contents of the array.
>> Of the three, only the second one is directly under the control of the
>> md module, so only that one can add an environment variable.
>> 
>> It might be possible to avoid the last one (by not opening for write),
>> but I cannot see a way to avoid the first one.
>
> So the first event is the "add" event ("md device object created") - in
> this case, the device is not yet usable anyway I suppose, so we can skip
> udev scans for this event right away (unless it's the synthetic "add"
> event which is generated by writing "add" to /sys/block/.../uevent file
> or alternatively using udevadm trigger - but we should be able to
> recognize this event "synthetic add event" because it always comes after
> the activating "change" event, otherwise we can skip scans).

"not yet usable anyway" is not reliable.  It is very easy for mdadm to
finish making the array usable before udev gets around to processing the
initial "add" event.

You seem to be suggesting that udev rules should try to reverse-engineer
the sequence of events and deduce what is meant from that sequence.  I
doubt that would be very robust.


>
> The second event, which is the "change" event, would be marked with the
> new "not initialized" flag. And so we skip the scans in udev too.

(one obvious problem with this approach is that it cannot work with old
kernels, while my approach only requires an update to mdadm)

>
> Then mdadm opens the devive, clears any old content/signatures the data
> area may contain, then closes it - this generates the third event -
> which is the "synthetic change" event (as a result of the inotify WATCH
> rule). And this one would drop the "not initialized" flag in udev db and
> the scans in udev are now enabled.

Nope, it would be incorrect for mdadm to clear any old content.
Sometimes people want to ignore old content.  Sometimes they very
definitely want to use it.  It would be wrong for any code to try to
guess what is wanted.


>
> So we should be able to handle all three kinds of events I think.
>
> Now, as for even better synthetic event recognition, I've proposed a
> patch recently where udev as well as any other tool generating these
> synthetic events can add variables for more detailed event
> identification, so this one should help us in the future even more in
> these situations: https://lkml.org/lkml/2017/3/15/461. With this, we can
> even disable the WATCH rule till the device is properly initialized and
> the tool can generate the event itself by writing the
> /sys/block/.../uevent file with a variable that the tool can then wait
> for even (so the tool doesn't exit till the device is not properly
> initialized). Once this initialization is all done, the WATCH rule can
> be enabled for the dev. Also, with this, we don't need to be afraid that
> some other tool fired the WATCH rule by chance if it opened the dev for
> RW and closed it by mistake before we had a chance to initialize it
> (which would fire the synthetic change event before the
> wiping/initialization).

It sounds to me like you are adding a lot of complexity to an
already-fragile system.  I'm not filled with confidence - sorry.

Thanks,
NeilBrown


>
>> 
>> I don't think that making a file appear in /run is really very different
>> from making a variable appear in a uevent.   If the variable were
>> describing the event itself there would be a different, but it would be
>> describing the state of the device.  So the only important difference is
>> "which is easier to work with".  I think creating an deleting a file is
>> easier to setting and clearing a variable.
>> 
>
> Yes, I agree that it's an alternative solution - it definitely and
> surely improves current situation, either if we choose to write the file
> or pass the flag in the uevent directly. It's just that with the
> information written in filesystem, we have something external to check
> for in addition to processing the uevent variables while we don't need
> to, I think. As I described the sequence of events above, I think we
> should be able to recognize the events properly and we should be able to
> drop the flag automatically.
>
> -- 
> Peter
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [dm-devel] [mdadm PATCH 4/4] Create: tell udev device is not ready when first created.
  2017-05-01  4:35             ` [dm-devel] " NeilBrown
@ 2017-05-02 11:40               ` Peter Rajnoha
  2017-05-02 13:40                 ` Jes Sorensen
  2017-05-02 21:42                 ` NeilBrown
  0 siblings, 2 replies; 35+ messages in thread
From: Peter Rajnoha @ 2017-05-02 11:40 UTC (permalink / raw)
  To: NeilBrown, Jes Sorensen; +Cc: linux-raid, dm-devel

On 05/01/2017 06:35 AM, NeilBrown wrote:
> On Fri, Apr 28 2017, Peter Rajnoha wrote:
> 
>> On 04/28/2017 05:55 AM, NeilBrown wrote:
>>> On Wed, Apr 26 2017, Peter Rajnoha wrote:
>>>
>>>> On 04/20/2017 11:35 PM, NeilBrown wrote:
>>>>> If we wanted an more permanent udev rule, we would need to record the
>>>>> devices that should be ignored in the filesystem somewhere else.
>>>>> Maybe in /run/mdadm.
>>>>> e.g.
>>>>>
>>>>>  KERNEL=="md*", TEST="/run/mdadm/creating-$kernel", ENV{SYSTEMD_READY}="0"
>>>>>
>>>>> Then we could have something like the following (untested) in mdadm.
>>>>> Does that seem more suitable?
>>>>>
>>>>
>>>> Yes, please, if possible, go for a permanent udev rule surely - this
>>>> will make it much easier for other foreign tools to hook in properly if
>>>> needed and it would also be much easier to debug.
>>>
>>> I'm leaning towards the files-in-/run/mdadm approach too.  I'll make a
>>> proper patch.
>>>
>>>>
>>>> But, wouldn't it be better if we could just pass this information ("not
>>>> initialized yet") as RUN_ARRAY md ioctl parameter? In that case the md
>>>> driver in kernel could add the variable to the uevent it generates which
>>>> userspace udev rules could check for easily. This way, we don't need to
>>>> hassle with creating files in filesystem and the information would be
>>>> directly a part of the uevent the md kernel driver generates (so
>>>> directly accessible in udev rules too). Also, possibly adding more
>>>> variables for other future scenarios if needed.
>>>
>>> When would we clear the "not initialised yet" flag in the kernel, and
>>> how?  And would that be enough.
>>
>> The flag wouldn't be stored in kernel, md kernel driver would just pass
>> the flag with the uevent as it received in with ioctl/sysfs request to
>> create a new dev. The udev in userspace would handle the state
>> transition then from "flagged as not-initialized" state to "initilized"
>> by checking the sequence of events as they come.
>>
>> We should have this sequence I assume:
>>
>>   1) "add" (creating dev, not usable yet)
>>   2) "change" (activated dev, but not initialized yet)
>>   3) "synthetic change" (after wiping the dev and closing it, the WATCH
>> rule fires)
>>
> 
> "Should" is arguable, but there are no guarantees of this sequence.
> 
> A particular cause of irregular sequencing is when an array is assembled
> by the initrd, then after the real root is mounted, something runs
>   udevadm trigger --action=add
> (e.g. systemd-udev-triggger.service).
> 
> In that case, 'add' is the first and only message that the
> full-root-available udev sees, so it is not safe to ignore the array as
> "not usable yet".
> 
> 

As for initrd case, there's OPTIONS+="db_persist" rule which is already
used by dracut to keep udev db records when switching from initrd to
root fs and replacing the udev daemon instance (other devices which do
not have this option set will have udev db records cleared when switched
to root fs). Not sure why it's not documented in udev yet (I already
asked for that years ago), but the support is definitely there. So
there's already a way to keep these records and to not lose track of the
current state when switching from initrd. This makes it still possible
to detect whether the synthetic "add" (or "change") event is sent before
or after proper device initialization and to make it possible to react
on the trigger (with synthetic uevents generated) still while ignoring
the "add" event if that comes from the kernel before the device is
initialized. And even if we're switching from initrd to root fs and
restarting udev daemon.

>>>
>>> When mdadm creates an md array, at least 3 uevents get generated.
>>> The first is generated when the md device object is created, either by
>>> opening the /dev/mdXX file, or by writing magic to somewhere in sysfs.
>>> The second is generated when the array is started and the content is
>>> visible.
>>> The third is generated when mdadm closes the file descriptor.  It opens
>>> /dev/mdXX for O_RDWR and performs ioctls on this, and then closes it.
>>> Because udev uses inotify to watch for a 'close for a writable file
>>> descriptor', this generates another event.
>>>
>>> We need to ensure that none of these cause udev to run anything that
>>> inspects the contents of the array.
>>> Of the three, only the second one is directly under the control of the
>>> md module, so only that one can add an environment variable.
>>>
>>> It might be possible to avoid the last one (by not opening for write),
>>> but I cannot see a way to avoid the first one.
>>
>> So the first event is the "add" event ("md device object created") - in
>> this case, the device is not yet usable anyway I suppose, so we can skip
>> udev scans for this event right away (unless it's the synthetic "add"
>> event which is generated by writing "add" to /sys/block/.../uevent file
>> or alternatively using udevadm trigger - but we should be able to
>> recognize this event "synthetic add event" because it always comes after
>> the activating "change" event, otherwise we can skip scans).
> 
> "not yet usable anyway" is not reliable.  It is very easy for mdadm to
> finish making the array usable before udev gets around to processing the
> initial "add" event.

But in that case, there's certainly going to be a "change" event right
after. We shouldn't skip the "add" completely and pretend it's the
"change" now - there are rules for "add" events and rules for "change"
events. The fact that things can change underneath while the event is
being processed is something we need to count with in udev environment.
The rules for "add" and "change" can be different. Also, various udev
listeners/event subscribers can rely on this - otherwise we can't come
up with proper state machine to check the sequence of events.

> 
> You seem to be suggesting that udev rules should try to reverse-engineer
> the sequence of events and deduce what is meant from that sequence.  I
> doubt that would be very robust.
> 
> 

I wouldn't say reverse-engineering. The rules expect certain sequence
during device setup and initialization. And if the events are coming out
of order, we can at least detect something is wrong. If we have "add"
for device addition and then "change" to complete the device setup
(which is happening for dm and md devices), this rule is static and
stable. As for the out-of-band synthetic uevents (the udevadm trigger or
echo add/change > /sys/.../uevent, the WATCH rule...), this is different
- the event has its source in userspace which is causing it, not the
kernel driver directly. Synthetic uevents are separate - we can
recognize them, but that kernel patch I mentioned will help us here even
more to make the rules easier to detect this.

>>
>> The second event, which is the "change" event, would be marked with the
>> new "not initialized" flag. And so we skip the scans in udev too.
> 
> (one obvious problem with this approach is that it cannot work with old
> kernels, while my approach only requires an update to mdadm)
> 

The fact we need to update the kernel is true, but does it really
matter? Such kernel patch is going to be small and easy to backport if
needed for distributions with older kernels.

This is all about providing the straightforward way of detecting the
uevents - if we have only ADD and CHANGE while certain subsystem needs
more to differentiate uevents further (like md and dm which can't simply
map its events to those 2 events only), adding a variable to the event
to make this difference clear is the most straightforward and clean way
to do that I think. We don't need to rely on anything outside udev
environment itself - it's directly a part of it as a variable in uevent
which is already happening with certain events - e.g. change with with
DISK_RO=1 set and there are lots of other existing examples...

(Another way would be to introduce completely new uevents besides
add/change/remove. That's also possible, but harder-to-deploy solution.)

>>
>> Then mdadm opens the devive, clears any old content/signatures the data
>> area may contain, then closes it - this generates the third event -
>> which is the "synthetic change" event (as a result of the inotify WATCH
>> rule). And this one would drop the "not initialized" flag in udev db and
>> the scans in udev are now enabled.
> 
> Nope, it would be incorrect for mdadm to clear any old content.
> Sometimes people want to ignore old content.  Sometimes they very
> definitely want to use it.  It would be wrong for any code to try to
> guess what is wanted.
> 
> 

The mdadm is not going to guess - it can have an option to
enable/disable the wiping on demand directly on command line (which is
also what is actually done in LVM).

Otherwise, if mdadm is not going to wipe/initialize the device itself,
then it needs to wait for the external tool to do it (based on external
configuration or on some manual wipefs-like call). So the sequence would be:

 1) mdadm creating the device
 2) mdadm setting up the device, marking it as "not initialized yet"
 4a) mdadm waiting for the external decision to be made about wiping part
 4b) external tool doing the wiping (or not) based on configuration or
user's will
 5) mdadm continuing and finishing when the wiping part is complete

I expect mdadm to return only if the device is properly initialized
otherwise it's harder for other tools called after mdadm to start
working with the device - they need to poll for the state laboriously
and check for readiness.

>>
>> So we should be able to handle all three kinds of events I think.
>>
>> Now, as for even better synthetic event recognition, I've proposed a
>> patch recently where udev as well as any other tool generating these
>> synthetic events can add variables for more detailed event
>> identification, so this one should help us in the future even more in
>> these situations: https://lkml.org/lkml/2017/3/15/461. With this, we can
>> even disable the WATCH rule till the device is properly initialized and
>> the tool can generate the event itself by writing the
>> /sys/block/.../uevent file with a variable that the tool can then wait
>> for even (so the tool doesn't exit till the device is not properly
>> initialized). Once this initialization is all done, the WATCH rule can
>> be enabled for the dev. Also, with this, we don't need to be afraid that
>> some other tool fired the WATCH rule by chance if it opened the dev for
>> RW and closed it by mistake before we had a chance to initialize it
>> (which would fire the synthetic change event before the
>> wiping/initialization).
> 
> It sounds to me like you are adding a lot of complexity to an
> already-fragile system.  I'm not filled with confidence - sorry.
> 

It would be great if we agree on some standard here. A variable directly
in the uevent is one of such standards (of course, then we need to agree
on variable naming and their meaning - but that's what I'm heading to).

Once the kernel patch is in, I'll propose a patch for the udev daemon to
add variable to detect the events coming from WATCH rule, the events
coming from the trigger etc. So we can make a proper difference and
rules can react to the events the most suitable way. But important here
is that it's going to be marked in standard way. This, I believe,
removes the fragility of the system where each subsystem is doing it's
own magic to handle these events in non-standard ways... or saying it's
impossible to detect and make a difference among events at all and hence
giving up.

I'm not saying that using out-of-band information channels like custom
file somewhere in filesystem doesn't do its job, but it's surely
creating an exception. Creating rules with such exceptions can cause
situations which are harder to debug and harder to follow correctly if
we're tracking whole stacks with different types of devices.

In addition to that, with the possibility to add extra variables in
synthesized uevents we're also able to wait for the event (via udev
monitor) and that way, we can wait for the udev rules to be applied
before continuing further with the tool that caused the synthesized
uevent. Then, we can avoid code where we wait for some random time, for
example (from the suggested patch):

+	/* Give udev a moment to process the Change event caused
+	 * by the close.
+	 */
+	usleep(100*1000);
+	udev_unblock();

In this case, for example, we can generate an event by writing to the
/sys/.../uevent file instead of relying on the WATCH rule and then wait
for that properly since we can set a variable which we recognize when it
gets back to us via uevent monitor.

So it's not adding that much complexity. Compare that to the time we
spent on debugging these things when the timing doesn't happen to be ideal.

-- 
Peter

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [mdadm PATCH] Create: tell udev md device is not ready when first created.
  2017-04-28  9:28           ` Peter Rajnoha
@ 2017-05-02 13:32             ` Jes Sorensen
  2017-05-03 14:13               ` Peter Rajnoha
  0 siblings, 1 reply; 35+ messages in thread
From: Jes Sorensen @ 2017-05-02 13:32 UTC (permalink / raw)
  To: Peter Rajnoha, NeilBrown; +Cc: linux-raid, dm-devel

On 04/28/2017 05:28 AM, Peter Rajnoha wrote:
> On 04/28/2017 07:05 AM, NeilBrown wrote:
>>
>> When an array is created the content is not initialized,
>> so it could have remnants of an old filesystem or md array
>> etc on it.
>> udev will see this and might try to activate it, which is almost
>> certainly not what is wanted.
>>
>> So create a mechanism for mdadm to communicate with udev to tell
>> it that the device isn't ready.  This mechanism is the existance
>> of a file /run/mdadm/created-mdXXX where mdXXX is the md device name.
>>
>> When creating an array, mdadm will create the file.
>> A new udev rule file, 01-md-raid-creating.rules, will detect the
>> precense of thst file and set ENV{SYSTEMD_READY}="0".
>> This is fairly uniformly used to suppress actions based on the
>> contents of the device.
>
> The scans in udev are primarily directed by blkid call which detects the
> signatures and based on this information various other udev rules fire.
>
> The blkid as well as wipefs uses common libblkid library to detect these
> signatures - is mdadm going to use libblkid to wipe the signatures on MD
> device initialization or is it relying on external tools to do this? How
> is mdadm actually initializing/wiping the newly created MD device?

mdadm doesn't wipe data and it isn't supposed to. Being able to create 
an array from drives with existing data is a feature.

It is the responsibility of the system administrator to handle drives 
with existing data, in the same way the administrator is expected to 
handle insertion of USB drives or external drives being powered on.

Jes


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [dm-devel] [mdadm PATCH 4/4] Create: tell udev device is not ready when first created.
  2017-05-02 11:40               ` Peter Rajnoha
@ 2017-05-02 13:40                 ` Jes Sorensen
  2017-05-03 14:27                   ` Peter Rajnoha
  2017-05-02 21:42                 ` NeilBrown
  1 sibling, 1 reply; 35+ messages in thread
From: Jes Sorensen @ 2017-05-02 13:40 UTC (permalink / raw)
  To: Peter Rajnoha, NeilBrown; +Cc: linux-raid, dm-devel

On 05/02/2017 07:40 AM, Peter Rajnoha wrote:
> On 05/01/2017 06:35 AM, NeilBrown wrote:
>> On Fri, Apr 28 2017, Peter Rajnoha wrote:
>>> Then mdadm opens the devive, clears any old content/signatures the data
>>> area may contain, then closes it - this generates the third event -
>>> which is the "synthetic change" event (as a result of the inotify WATCH
>>> rule). And this one would drop the "not initialized" flag in udev db and
>>> the scans in udev are now enabled.
>>
>> Nope, it would be incorrect for mdadm to clear any old content.
>> Sometimes people want to ignore old content.  Sometimes they very
>> definitely want to use it.  It would be wrong for any code to try to
>> guess what is wanted.
>
> The mdadm is not going to guess - it can have an option to
> enable/disable the wiping on demand directly on command line (which is
> also what is actually done in LVM).

I know the anaconda team keeps pushing for the nonsense of being able to 
wipe drives on creation. It is wrong, it is broken, and it is not going 
to happen.

> Otherwise, if mdadm is not going to wipe/initialize the device itself,
> then it needs to wait for the external tool to do it (based on external
> configuration or on some manual wipefs-like call). So the sequence would be:
>
>  1) mdadm creating the device
>  2) mdadm setting up the device, marking it as "not initialized yet"
>  4a) mdadm waiting for the external decision to be made about wiping part
>  4b) external tool doing the wiping (or not) based on configuration or
> user's will
>  5) mdadm continuing and finishing when the wiping part is complete
>
> I expect mdadm to return only if the device is properly initialized
> otherwise it's harder for other tools called after mdadm to start
> working with the device - they need to poll for the state laboriously
> and check for readiness.

What defines readiness? Some believe a raid1 array must be fully 
assembled with all members, other setups are happy to have one running 
drive in place.....

4a/4b in your list here once again has no place in mdadm. Please kindly 
tell the anaconda team to go back and handle this properly instead.

Jes


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [mdadm PATCH] Create: tell udev md device is not ready when first created.
  2017-04-28  5:05         ` [mdadm PATCH] Create: tell udev md " NeilBrown
  2017-04-28  9:28           ` Peter Rajnoha
@ 2017-05-02 13:42           ` Jes Sorensen
  2017-05-03 14:32             ` Peter Rajnoha
  2017-05-04 10:58           ` Peter Rajnoha
  2 siblings, 1 reply; 35+ messages in thread
From: Jes Sorensen @ 2017-05-02 13:42 UTC (permalink / raw)
  To: NeilBrown; +Cc: Peter Rajnoha, linux-raid, dm-devel

On 04/28/2017 01:05 AM, NeilBrown wrote:
>
> When an array is created the content is not initialized,
> so it could have remnants of an old filesystem or md array
> etc on it.
> udev will see this and might try to activate it, which is almost
> certainly not what is wanted.
>
> So create a mechanism for mdadm to communicate with udev to tell
> it that the device isn't ready.  This mechanism is the existance
> of a file /run/mdadm/created-mdXXX where mdXXX is the md device name.
>
> When creating an array, mdadm will create the file.
> A new udev rule file, 01-md-raid-creating.rules, will detect the
> precense of thst file and set ENV{SYSTEMD_READY}="0".
> This is fairly uniformly used to suppress actions based on the
> contents of the device.
>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  Assemble.c                  |  2 +-
>  Build.c                     |  2 +-
>  Create.c                    |  9 +++++++-
>  Incremental.c               |  4 ++--
>  Makefile                    |  4 ++--
>  lib.c                       | 29 +++++++++++++++++++++++++
>  mdadm.h                     |  4 +++-
>  mdopen.c                    | 52 ++++++++++++++++++++++++++++-----------------
>  udev-md-raid-creating.rules |  7 ++++++
>  9 files changed, 86 insertions(+), 27 deletions(-)
>  create mode 100644 udev-md-raid-creating.rules

Applied!

I like this solution much better, even though I am not in love with the 
giant usleep() call. Would be nice to find a better way around that.

Sorry it took so long to get back to you on this, last week was a mess.

Thanks,
Jes



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [dm-devel] [mdadm PATCH 4/4] Create: tell udev device is not ready when first created.
  2017-05-02 11:40               ` Peter Rajnoha
  2017-05-02 13:40                 ` Jes Sorensen
@ 2017-05-02 21:42                 ` NeilBrown
  1 sibling, 0 replies; 35+ messages in thread
From: NeilBrown @ 2017-05-02 21:42 UTC (permalink / raw)
  To: Peter Rajnoha, Jes Sorensen; +Cc: linux-raid, dm-devel

[-- Attachment #1: Type: text/plain, Size: 13902 bytes --]


I'm sorry, but I didn't read all your words.

You seemed to be telling me about extra complexity in udev, and extra
complexity that you think belongs in mdadm, which together might achieve
your vision for how things should work.

But to me, complexity is the enemy.  Give me "simple" any day.

NeilBrown


On Tue, May 02 2017, Peter Rajnoha wrote:

> On 05/01/2017 06:35 AM, NeilBrown wrote:
>> On Fri, Apr 28 2017, Peter Rajnoha wrote:
>> 
>>> On 04/28/2017 05:55 AM, NeilBrown wrote:
>>>> On Wed, Apr 26 2017, Peter Rajnoha wrote:
>>>>
>>>>> On 04/20/2017 11:35 PM, NeilBrown wrote:
>>>>>> If we wanted an more permanent udev rule, we would need to record the
>>>>>> devices that should be ignored in the filesystem somewhere else.
>>>>>> Maybe in /run/mdadm.
>>>>>> e.g.
>>>>>>
>>>>>>  KERNEL=="md*", TEST="/run/mdadm/creating-$kernel", ENV{SYSTEMD_READY}="0"
>>>>>>
>>>>>> Then we could have something like the following (untested) in mdadm.
>>>>>> Does that seem more suitable?
>>>>>>
>>>>>
>>>>> Yes, please, if possible, go for a permanent udev rule surely - this
>>>>> will make it much easier for other foreign tools to hook in properly if
>>>>> needed and it would also be much easier to debug.
>>>>
>>>> I'm leaning towards the files-in-/run/mdadm approach too.  I'll make a
>>>> proper patch.
>>>>
>>>>>
>>>>> But, wouldn't it be better if we could just pass this information ("not
>>>>> initialized yet") as RUN_ARRAY md ioctl parameter? In that case the md
>>>>> driver in kernel could add the variable to the uevent it generates which
>>>>> userspace udev rules could check for easily. This way, we don't need to
>>>>> hassle with creating files in filesystem and the information would be
>>>>> directly a part of the uevent the md kernel driver generates (so
>>>>> directly accessible in udev rules too). Also, possibly adding more
>>>>> variables for other future scenarios if needed.
>>>>
>>>> When would we clear the "not initialised yet" flag in the kernel, and
>>>> how?  And would that be enough.
>>>
>>> The flag wouldn't be stored in kernel, md kernel driver would just pass
>>> the flag with the uevent as it received in with ioctl/sysfs request to
>>> create a new dev. The udev in userspace would handle the state
>>> transition then from "flagged as not-initialized" state to "initilized"
>>> by checking the sequence of events as they come.
>>>
>>> We should have this sequence I assume:
>>>
>>>   1) "add" (creating dev, not usable yet)
>>>   2) "change" (activated dev, but not initialized yet)
>>>   3) "synthetic change" (after wiping the dev and closing it, the WATCH
>>> rule fires)
>>>
>> 
>> "Should" is arguable, but there are no guarantees of this sequence.
>> 
>> A particular cause of irregular sequencing is when an array is assembled
>> by the initrd, then after the real root is mounted, something runs
>>   udevadm trigger --action=add
>> (e.g. systemd-udev-triggger.service).
>> 
>> In that case, 'add' is the first and only message that the
>> full-root-available udev sees, so it is not safe to ignore the array as
>> "not usable yet".
>> 
>> 
>
> As for initrd case, there's OPTIONS+="db_persist" rule which is already
> used by dracut to keep udev db records when switching from initrd to
> root fs and replacing the udev daemon instance (other devices which do
> not have this option set will have udev db records cleared when switched
> to root fs). Not sure why it's not documented in udev yet (I already
> asked for that years ago), but the support is definitely there. So
> there's already a way to keep these records and to not lose track of the
> current state when switching from initrd. This makes it still possible
> to detect whether the synthetic "add" (or "change") event is sent before
> or after proper device initialization and to make it possible to react
> on the trigger (with synthetic uevents generated) still while ignoring
> the "add" event if that comes from the kernel before the device is
> initialized. And even if we're switching from initrd to root fs and
> restarting udev daemon.
>
>>>>
>>>> When mdadm creates an md array, at least 3 uevents get generated.
>>>> The first is generated when the md device object is created, either by
>>>> opening the /dev/mdXX file, or by writing magic to somewhere in sysfs.
>>>> The second is generated when the array is started and the content is
>>>> visible.
>>>> The third is generated when mdadm closes the file descriptor.  It opens
>>>> /dev/mdXX for O_RDWR and performs ioctls on this, and then closes it.
>>>> Because udev uses inotify to watch for a 'close for a writable file
>>>> descriptor', this generates another event.
>>>>
>>>> We need to ensure that none of these cause udev to run anything that
>>>> inspects the contents of the array.
>>>> Of the three, only the second one is directly under the control of the
>>>> md module, so only that one can add an environment variable.
>>>>
>>>> It might be possible to avoid the last one (by not opening for write),
>>>> but I cannot see a way to avoid the first one.
>>>
>>> So the first event is the "add" event ("md device object created") - in
>>> this case, the device is not yet usable anyway I suppose, so we can skip
>>> udev scans for this event right away (unless it's the synthetic "add"
>>> event which is generated by writing "add" to /sys/block/.../uevent file
>>> or alternatively using udevadm trigger - but we should be able to
>>> recognize this event "synthetic add event" because it always comes after
>>> the activating "change" event, otherwise we can skip scans).
>> 
>> "not yet usable anyway" is not reliable.  It is very easy for mdadm to
>> finish making the array usable before udev gets around to processing the
>> initial "add" event.
>
> But in that case, there's certainly going to be a "change" event right
> after. We shouldn't skip the "add" completely and pretend it's the
> "change" now - there are rules for "add" events and rules for "change"
> events. The fact that things can change underneath while the event is
> being processed is something we need to count with in udev environment.
> The rules for "add" and "change" can be different. Also, various udev
> listeners/event subscribers can rely on this - otherwise we can't come
> up with proper state machine to check the sequence of events.
>
>> 
>> You seem to be suggesting that udev rules should try to reverse-engineer
>> the sequence of events and deduce what is meant from that sequence.  I
>> doubt that would be very robust.
>> 
>> 
>
> I wouldn't say reverse-engineering. The rules expect certain sequence
> during device setup and initialization. And if the events are coming out
> of order, we can at least detect something is wrong. If we have "add"
> for device addition and then "change" to complete the device setup
> (which is happening for dm and md devices), this rule is static and
> stable. As for the out-of-band synthetic uevents (the udevadm trigger or
> echo add/change > /sys/.../uevent, the WATCH rule...), this is different
> - the event has its source in userspace which is causing it, not the
> kernel driver directly. Synthetic uevents are separate - we can
> recognize them, but that kernel patch I mentioned will help us here even
> more to make the rules easier to detect this.
>
>>>
>>> The second event, which is the "change" event, would be marked with the
>>> new "not initialized" flag. And so we skip the scans in udev too.
>> 
>> (one obvious problem with this approach is that it cannot work with old
>> kernels, while my approach only requires an update to mdadm)
>> 
>
> The fact we need to update the kernel is true, but does it really
> matter? Such kernel patch is going to be small and easy to backport if
> needed for distributions with older kernels.
>
> This is all about providing the straightforward way of detecting the
> uevents - if we have only ADD and CHANGE while certain subsystem needs
> more to differentiate uevents further (like md and dm which can't simply
> map its events to those 2 events only), adding a variable to the event
> to make this difference clear is the most straightforward and clean way
> to do that I think. We don't need to rely on anything outside udev
> environment itself - it's directly a part of it as a variable in uevent
> which is already happening with certain events - e.g. change with with
> DISK_RO=1 set and there are lots of other existing examples...
>
> (Another way would be to introduce completely new uevents besides
> add/change/remove. That's also possible, but harder-to-deploy solution.)
>
>>>
>>> Then mdadm opens the devive, clears any old content/signatures the data
>>> area may contain, then closes it - this generates the third event -
>>> which is the "synthetic change" event (as a result of the inotify WATCH
>>> rule). And this one would drop the "not initialized" flag in udev db and
>>> the scans in udev are now enabled.
>> 
>> Nope, it would be incorrect for mdadm to clear any old content.
>> Sometimes people want to ignore old content.  Sometimes they very
>> definitely want to use it.  It would be wrong for any code to try to
>> guess what is wanted.
>> 
>> 
>
> The mdadm is not going to guess - it can have an option to
> enable/disable the wiping on demand directly on command line (which is
> also what is actually done in LVM).
>
> Otherwise, if mdadm is not going to wipe/initialize the device itself,
> then it needs to wait for the external tool to do it (based on external
> configuration or on some manual wipefs-like call). So the sequence would be:
>
>  1) mdadm creating the device
>  2) mdadm setting up the device, marking it as "not initialized yet"
>  4a) mdadm waiting for the external decision to be made about wiping part
>  4b) external tool doing the wiping (or not) based on configuration or
> user's will
>  5) mdadm continuing and finishing when the wiping part is complete
>
> I expect mdadm to return only if the device is properly initialized
> otherwise it's harder for other tools called after mdadm to start
> working with the device - they need to poll for the state laboriously
> and check for readiness.
>
>>>
>>> So we should be able to handle all three kinds of events I think.
>>>
>>> Now, as for even better synthetic event recognition, I've proposed a
>>> patch recently where udev as well as any other tool generating these
>>> synthetic events can add variables for more detailed event
>>> identification, so this one should help us in the future even more in
>>> these situations: https://lkml.org/lkml/2017/3/15/461. With this, we can
>>> even disable the WATCH rule till the device is properly initialized and
>>> the tool can generate the event itself by writing the
>>> /sys/block/.../uevent file with a variable that the tool can then wait
>>> for even (so the tool doesn't exit till the device is not properly
>>> initialized). Once this initialization is all done, the WATCH rule can
>>> be enabled for the dev. Also, with this, we don't need to be afraid that
>>> some other tool fired the WATCH rule by chance if it opened the dev for
>>> RW and closed it by mistake before we had a chance to initialize it
>>> (which would fire the synthetic change event before the
>>> wiping/initialization).
>> 
>> It sounds to me like you are adding a lot of complexity to an
>> already-fragile system.  I'm not filled with confidence - sorry.
>> 
>
> It would be great if we agree on some standard here. A variable directly
> in the uevent is one of such standards (of course, then we need to agree
> on variable naming and their meaning - but that's what I'm heading to).
>
> Once the kernel patch is in, I'll propose a patch for the udev daemon to
> add variable to detect the events coming from WATCH rule, the events
> coming from the trigger etc. So we can make a proper difference and
> rules can react to the events the most suitable way. But important here
> is that it's going to be marked in standard way. This, I believe,
> removes the fragility of the system where each subsystem is doing it's
> own magic to handle these events in non-standard ways... or saying it's
> impossible to detect and make a difference among events at all and hence
> giving up.
>
> I'm not saying that using out-of-band information channels like custom
> file somewhere in filesystem doesn't do its job, but it's surely
> creating an exception. Creating rules with such exceptions can cause
> situations which are harder to debug and harder to follow correctly if
> we're tracking whole stacks with different types of devices.
>
> In addition to that, with the possibility to add extra variables in
> synthesized uevents we're also able to wait for the event (via udev
> monitor) and that way, we can wait for the udev rules to be applied
> before continuing further with the tool that caused the synthesized
> uevent. Then, we can avoid code where we wait for some random time, for
> example (from the suggested patch):
>
> +	/* Give udev a moment to process the Change event caused
> +	 * by the close.
> +	 */
> +	usleep(100*1000);
> +	udev_unblock();
>
> In this case, for example, we can generate an event by writing to the
> /sys/.../uevent file instead of relying on the WATCH rule and then wait
> for that properly since we can set a variable which we recognize when it
> gets back to us via uevent monitor.
>
> So it's not adding that much complexity. Compare that to the time we
> spent on debugging these things when the timing doesn't happen to be ideal.
>
> -- 
> Peter
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [mdadm PATCH] Create: tell udev md device is not ready when first created.
  2017-05-02 13:32             ` Jes Sorensen
@ 2017-05-03 14:13               ` Peter Rajnoha
  2017-05-03 14:44                 ` Jes Sorensen
  2017-05-06 16:25                 ` Wols Lists
  0 siblings, 2 replies; 35+ messages in thread
From: Peter Rajnoha @ 2017-05-03 14:13 UTC (permalink / raw)
  To: Jes Sorensen, NeilBrown; +Cc: linux-raid, dm-devel

On 05/02/2017 03:32 PM, Jes Sorensen wrote:
> On 04/28/2017 05:28 AM, Peter Rajnoha wrote:
>> On 04/28/2017 07:05 AM, NeilBrown wrote:
>>>
>>> When an array is created the content is not initialized,
>>> so it could have remnants of an old filesystem or md array
>>> etc on it.
>>> udev will see this and might try to activate it, which is almost
>>> certainly not what is wanted.
>>>
>>> So create a mechanism for mdadm to communicate with udev to tell
>>> it that the device isn't ready.  This mechanism is the existance
>>> of a file /run/mdadm/created-mdXXX where mdXXX is the md device name.
>>>
>>> When creating an array, mdadm will create the file.
>>> A new udev rule file, 01-md-raid-creating.rules, will detect the
>>> precense of thst file and set ENV{SYSTEMD_READY}="0".
>>> This is fairly uniformly used to suppress actions based on the
>>> contents of the device.
>>
>> The scans in udev are primarily directed by blkid call which detects the
>> signatures and based on this information various other udev rules fire.
>>
>> The blkid as well as wipefs uses common libblkid library to detect these
>> signatures - is mdadm going to use libblkid to wipe the signatures on MD
>> device initialization or is it relying on external tools to do this? How
>> is mdadm actually initializing/wiping the newly created MD device?
> 
> mdadm doesn't wipe data and it isn't supposed to. Being able to create
> an array from drives with existing data is a feature.
> 
> It is the responsibility of the system administrator to handle drives
> with existing data, in the same way the administrator is expected to
> handle insertion of USB drives or external drives being powered on.

There's a difference though - when you're *creating* a completely new
device that is an abstraction over existing devices, you (most of the
time) expect that new device to be initialized. For those corner cases
where people do need to keep the old data, there can be an option to do
that. When you're inserting existing drives, you're not creating them -
when those device come from factory (they're "created"), they never
contain garbage and old data when you buy them.

-- 
Peter

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [dm-devel] [mdadm PATCH 4/4] Create: tell udev device is not ready when first created.
  2017-05-02 13:40                 ` Jes Sorensen
@ 2017-05-03 14:27                   ` Peter Rajnoha
  2017-05-03 14:41                     ` Jes Sorensen
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Rajnoha @ 2017-05-03 14:27 UTC (permalink / raw)
  To: Jes Sorensen, NeilBrown; +Cc: linux-raid, dm-devel

On 05/02/2017 03:40 PM, Jes Sorensen wrote:
> On 05/02/2017 07:40 AM, Peter Rajnoha wrote:
>> On 05/01/2017 06:35 AM, NeilBrown wrote:
>>> On Fri, Apr 28 2017, Peter Rajnoha wrote:
>>>> Then mdadm opens the devive, clears any old content/signatures the data
>>>> area may contain, then closes it - this generates the third event -
>>>> which is the "synthetic change" event (as a result of the inotify WATCH
>>>> rule). And this one would drop the "not initialized" flag in udev db
>>>> and
>>>> the scans in udev are now enabled.
>>>
>>> Nope, it would be incorrect for mdadm to clear any old content.
>>> Sometimes people want to ignore old content.  Sometimes they very
>>> definitely want to use it.  It would be wrong for any code to try to
>>> guess what is wanted.
>>
>> The mdadm is not going to guess - it can have an option to
>> enable/disable the wiping on demand directly on command line (which is
>> also what is actually done in LVM).
> 
> I know the anaconda team keeps pushing for the nonsense of being able to
> wipe drives on creation. It is wrong, it is broken, and it is not going
> to happen.
> 

I'm not thinking about anaconda at the moment at all. It's just one of
the many users of mdadm. I'm thinking about a fix in general for all the
users which expect the device to be initialized properly when mdadm
--create returns.

>> Otherwise, if mdadm is not going to wipe/initialize the device itself,
>> then it needs to wait for the external tool to do it (based on external
>> configuration or on some manual wipefs-like call). So the sequence
>> would be:
>>
>>  1) mdadm creating the device
>>  2) mdadm setting up the device, marking it as "not initialized yet"
>>  4a) mdadm waiting for the external decision to be made about wiping part
>>  4b) external tool doing the wiping (or not) based on configuration or
>> user's will
>>  5) mdadm continuing and finishing when the wiping part is complete
>>
>> I expect mdadm to return only if the device is properly initialized
>> otherwise it's harder for other tools called after mdadm to start
>> working with the device - they need to poll for the state laboriously
>> and check for readiness.
> 
> What defines readiness? Some believe a raid1 array must be fully
> assembled with all members, other setups are happy to have one running
> drive in place.....
> 

With "ready" I mean the time when it's safe to do a scan without seeing
old data (garbage) that may confuse udev hooks and udev event listeners.
That scan is done at some time - at the time the activating "change"
uevent comes and this rule does not pass
ATTR{md/array_state}=="|clear|inactive" (if it passes, the device is not
scanned yet).

> 4a/4b in your list here once again has no place in mdadm. Please kindly
> tell the anaconda team to go back and handle this properly instead.

The mdadm is creating the dev and so it should be responsible primarily
for providing a device which is cleared and ready for use without
causing any confusion on event-based system where various scans are
executed based on incoming udev events.

Alternatively, if mdadm is not going to be the place where the wiping
happens, I'd expect at least the sequence above (which is more complex,
yes, that's why I think having the wiping directly in mdadm is much
easier solution).

If you don't wipe the data and you don't give time for others to hook in
to do that, you make it harder for others (they need to deactivate all
the stack/garbage that is found in the data area after previous use).

Also, we can't reliably call wiping on the underlying components first,
because once they become MD components, the data are for the MD device
has an offset and new data range is revealed from the underlying devices
which may expose old signatures which were not visible on those
underlying devices before the MD device got created.

-- 
Peter

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [mdadm PATCH] Create: tell udev md device is not ready when first created.
  2017-05-02 13:42           ` Jes Sorensen
@ 2017-05-03 14:32             ` Peter Rajnoha
  2017-05-03 14:45               ` [dm-devel] " Jes Sorensen
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Rajnoha @ 2017-05-03 14:32 UTC (permalink / raw)
  To: Jes Sorensen, NeilBrown; +Cc: linux-raid, dm-devel

On 05/02/2017 03:42 PM, Jes Sorensen wrote:
> On 04/28/2017 01:05 AM, NeilBrown wrote:
>>
>> When an array is created the content is not initialized,
>> so it could have remnants of an old filesystem or md array
>> etc on it.
>> udev will see this and might try to activate it, which is almost
>> certainly not what is wanted.
>>
>> So create a mechanism for mdadm to communicate with udev to tell
>> it that the device isn't ready.  This mechanism is the existance
>> of a file /run/mdadm/created-mdXXX where mdXXX is the md device name.
>>
>> When creating an array, mdadm will create the file.
>> A new udev rule file, 01-md-raid-creating.rules, will detect the
>> precense of thst file and set ENV{SYSTEMD_READY}="0".
>> This is fairly uniformly used to suppress actions based on the
>> contents of the device.
>>
>> Signed-off-by: NeilBrown <neilb@suse.com>
>> ---
>>  Assemble.c                  |  2 +-
>>  Build.c                     |  2 +-
>>  Create.c                    |  9 +++++++-
>>  Incremental.c               |  4 ++--
>>  Makefile                    |  4 ++--
>>  lib.c                       | 29 +++++++++++++++++++++++++
>>  mdadm.h                     |  4 +++-
>>  mdopen.c                    | 52
>> ++++++++++++++++++++++++++++-----------------
>>  udev-md-raid-creating.rules |  7 ++++++
>>  9 files changed, 86 insertions(+), 27 deletions(-)
>>  create mode 100644 udev-md-raid-creating.rules
> 
> Applied!
> 

How is this solving the problem then? The patch is flagging a device as
not ready, then clearing the flag after some time. Where does the wiping
happen actually?

-- 
Peter

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [dm-devel] [mdadm PATCH 4/4] Create: tell udev device is not ready when first created.
  2017-05-03 14:27                   ` Peter Rajnoha
@ 2017-05-03 14:41                     ` Jes Sorensen
  0 siblings, 0 replies; 35+ messages in thread
From: Jes Sorensen @ 2017-05-03 14:41 UTC (permalink / raw)
  To: Peter Rajnoha, NeilBrown; +Cc: linux-raid, dm-devel

On 05/03/2017 10:27 AM, Peter Rajnoha wrote:
> On 05/02/2017 03:40 PM, Jes Sorensen wrote:
>> On 05/02/2017 07:40 AM, Peter Rajnoha wrote:
>>> On 05/01/2017 06:35 AM, NeilBrown wrote:
>>>> On Fri, Apr 28 2017, Peter Rajnoha wrote:
>>>>> Then mdadm opens the devive, clears any old content/signatures the data
>>>>> area may contain, then closes it - this generates the third event -
>>>>> which is the "synthetic change" event (as a result of the inotify WATCH
>>>>> rule). And this one would drop the "not initialized" flag in udev db
>>>>> and
>>>>> the scans in udev are now enabled.
>>>>
>>>> Nope, it would be incorrect for mdadm to clear any old content.
>>>> Sometimes people want to ignore old content.  Sometimes they very
>>>> definitely want to use it.  It would be wrong for any code to try to
>>>> guess what is wanted.
>>>
>>> The mdadm is not going to guess - it can have an option to
>>> enable/disable the wiping on demand directly on command line (which is
>>> also what is actually done in LVM).
>>
>> I know the anaconda team keeps pushing for the nonsense of being able to
>> wipe drives on creation. It is wrong, it is broken, and it is not going
>> to happen.
>
> I'm not thinking about anaconda at the moment at all. It's just one of
> the many users of mdadm. I'm thinking about a fix in general for all the
> users which expect the device to be initialized properly when mdadm
> --create returns.

Well mdadm --create already initializes the array properly, so problem 
solved.

I saw this request repeatedly when I worked at Red Hat, and I got it 
again from my replacement who had been asked to do the same dirty work 
that I refused to do. Now we see the same request from you, so it is 
rather easy to assume it came from there again.

>>> Otherwise, if mdadm is not going to wipe/initialize the device itself,
>>> then it needs to wait for the external tool to do it (based on external
>>> configuration or on some manual wipefs-like call). So the sequence
>>> would be:
>>>
>>>  1) mdadm creating the device
>>>  2) mdadm setting up the device, marking it as "not initialized yet"
>>>  4a) mdadm waiting for the external decision to be made about wiping part
>>>  4b) external tool doing the wiping (or not) based on configuration or
>>> user's will
>>>  5) mdadm continuing and finishing when the wiping part is complete
>>>
>>> I expect mdadm to return only if the device is properly initialized
>>> otherwise it's harder for other tools called after mdadm to start
>>> working with the device - they need to poll for the state laboriously
>>> and check for readiness.
>>
>> What defines readiness? Some believe a raid1 array must be fully
>> assembled with all members, other setups are happy to have one running
>> drive in place.....
>
> With "ready" I mean the time when it's safe to do a scan without seeing
> old data (garbage) that may confuse udev hooks and udev event listeners.
> That scan is done at some time - at the time the activating "change"
> uevent comes and this rule does not pass
> ATTR{md/array_state}=="|clear|inactive" (if it passes, the device is not
> scanned yet).

I am not sure what you mean by old data here?

>> 4a/4b in your list here once again has no place in mdadm. Please kindly
>> tell the anaconda team to go back and handle this properly instead.
>
> The mdadm is creating the dev and so it should be responsible primarily
> for providing a device which is cleared and ready for use without
> causing any confusion on event-based system where various scans are
> executed based on incoming udev events.

This is exactly what it does. There is no difference to how mdadm does 
it compared to how it works if you plugin a new USB drive which may have 
data on it or if you assemble an iSCSI device that has metadata on it. 
Are you also going to suggest the iSCSI tools have a wipe metadata mode 
added to them?

> Alternatively, if mdadm is not going to be the place where the wiping
> happens, I'd expect at least the sequence above (which is more complex,
> yes, that's why I think having the wiping directly in mdadm is much
> easier solution).
>
> If you don't wipe the data and you don't give time for others to hook in
> to do that, you make it harder for others (they need to deactivate all
> the stack/garbage that is found in the data area after previous use).
>
> Also, we can't reliably call wiping on the underlying components first,
> because once they become MD components, the data are for the MD device
> has an offset and new data range is revealed from the underlying devices
> which may expose old signatures which were not visible on those
> underlying devices before the MD device got created.

Yes there is a risk that creating an array that exactly matches old 
metadata despite those signatures haven gotten wiped. This is a feature, 
people use it to recover their data, and it is not going to be removed.

If you don't want this, disable automatic processing of udev events that 
may do something on top of it. This is what anaconda should be doing.

Jes


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [mdadm PATCH] Create: tell udev md device is not ready when first created.
  2017-05-03 14:13               ` Peter Rajnoha
@ 2017-05-03 14:44                 ` Jes Sorensen
  2017-05-06 16:25                 ` Wols Lists
  1 sibling, 0 replies; 35+ messages in thread
From: Jes Sorensen @ 2017-05-03 14:44 UTC (permalink / raw)
  To: Peter Rajnoha, NeilBrown; +Cc: linux-raid, dm-devel

On 05/03/2017 10:13 AM, Peter Rajnoha wrote:
> On 05/02/2017 03:32 PM, Jes Sorensen wrote:
>> On 04/28/2017 05:28 AM, Peter Rajnoha wrote:
>>> On 04/28/2017 07:05 AM, NeilBrown wrote:
>>>>
>>>> When an array is created the content is not initialized,
>>>> so it could have remnants of an old filesystem or md array
>>>> etc on it.
>>>> udev will see this and might try to activate it, which is almost
>>>> certainly not what is wanted.
>>>>
>>>> So create a mechanism for mdadm to communicate with udev to tell
>>>> it that the device isn't ready.  This mechanism is the existance
>>>> of a file /run/mdadm/created-mdXXX where mdXXX is the md device name.
>>>>
>>>> When creating an array, mdadm will create the file.
>>>> A new udev rule file, 01-md-raid-creating.rules, will detect the
>>>> precense of thst file and set ENV{SYSTEMD_READY}="0".
>>>> This is fairly uniformly used to suppress actions based on the
>>>> contents of the device.
>>>
>>> The scans in udev are primarily directed by blkid call which detects the
>>> signatures and based on this information various other udev rules fire.
>>>
>>> The blkid as well as wipefs uses common libblkid library to detect these
>>> signatures - is mdadm going to use libblkid to wipe the signatures on MD
>>> device initialization or is it relying on external tools to do this? How
>>> is mdadm actually initializing/wiping the newly created MD device?
>>
>> mdadm doesn't wipe data and it isn't supposed to. Being able to create
>> an array from drives with existing data is a feature.
>>
>> It is the responsibility of the system administrator to handle drives
>> with existing data, in the same way the administrator is expected to
>> handle insertion of USB drives or external drives being powered on.
>
> There's a difference though - when you're *creating* a completely new
> device that is an abstraction over existing devices, you (most of the
> time) expect that new device to be initialized. For those corner cases
> where people do need to keep the old data, there can be an option to do
> that. When you're inserting existing drives, you're not creating them -
> when those device come from factory (they're "created"), they never
> contain garbage and old data when you buy them.

Sorry, you are once again assuming *your* corner case to be the default 
that everyone else expects. mdadm users (except for you and anaconda) 
expect the current behavior.

Devices that come from the factory pretty much always contain old data, 
plugin a USB drive and it will be filled with windows backup crap.

Jes


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [dm-devel] [mdadm PATCH] Create: tell udev md device is not ready when first created.
  2017-05-03 14:32             ` Peter Rajnoha
@ 2017-05-03 14:45               ` Jes Sorensen
  0 siblings, 0 replies; 35+ messages in thread
From: Jes Sorensen @ 2017-05-03 14:45 UTC (permalink / raw)
  To: Peter Rajnoha, NeilBrown; +Cc: linux-raid, dm-devel

On 05/03/2017 10:32 AM, Peter Rajnoha wrote:
> On 05/02/2017 03:42 PM, Jes Sorensen wrote:
>> On 04/28/2017 01:05 AM, NeilBrown wrote:
>>>
>>> When an array is created the content is not initialized,
>>> so it could have remnants of an old filesystem or md array
>>> etc on it.
>>> udev will see this and might try to activate it, which is almost
>>> certainly not what is wanted.
>>>
>>> So create a mechanism for mdadm to communicate with udev to tell
>>> it that the device isn't ready.  This mechanism is the existance
>>> of a file /run/mdadm/created-mdXXX where mdXXX is the md device name.
>>>
>>> When creating an array, mdadm will create the file.
>>> A new udev rule file, 01-md-raid-creating.rules, will detect the
>>> precense of thst file and set ENV{SYSTEMD_READY}="0".
>>> This is fairly uniformly used to suppress actions based on the
>>> contents of the device.
>>>
>>> Signed-off-by: NeilBrown <neilb@suse.com>
>>> ---
>>>  Assemble.c                  |  2 +-
>>>  Build.c                     |  2 +-
>>>  Create.c                    |  9 +++++++-
>>>  Incremental.c               |  4 ++--
>>>  Makefile                    |  4 ++--
>>>  lib.c                       | 29 +++++++++++++++++++++++++
>>>  mdadm.h                     |  4 +++-
>>>  mdopen.c                    | 52
>>> ++++++++++++++++++++++++++++-----------------
>>>  udev-md-raid-creating.rules |  7 ++++++
>>>  9 files changed, 86 insertions(+), 27 deletions(-)
>>>  create mode 100644 udev-md-raid-creating.rules
>>
>> Applied!
>>
>
> How is this solving the problem then? The patch is flagging a device as
> not ready, then clearing the flag after some time. Where does the wiping
> happen actually?

There is *NO* wiping happening, there is *NO* wiping going to happen.

Can we stop this nonsense now!

Thank you!
Jes



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [mdadm PATCH] Create: tell udev md device is not ready when first created.
  2017-04-28  5:05         ` [mdadm PATCH] Create: tell udev md " NeilBrown
  2017-04-28  9:28           ` Peter Rajnoha
  2017-05-02 13:42           ` Jes Sorensen
@ 2017-05-04 10:58           ` Peter Rajnoha
  2017-05-05  5:16             ` [mdadm PATCH] Fix typo in new udev rule NeilBrown
  2 siblings, 1 reply; 35+ messages in thread
From: Peter Rajnoha @ 2017-05-04 10:58 UTC (permalink / raw)
  To: NeilBrown, Jes Sorensen; +Cc: linux-raid, dm-devel

On 04/28/2017 07:05 AM, NeilBrown wrote:
> 
> When an array is created the content is not initialized,
> so it could have remnants of an old filesystem or md array
> etc on it.
> udev will see this and might try to activate it, which is almost
> certainly not what is wanted.
> 
> So create a mechanism for mdadm to communicate with udev to tell
> it that the device isn't ready.  This mechanism is the existance
> of a file /run/mdadm/created-mdXXX where mdXXX is the md device name.
> 
> When creating an array, mdadm will create the file.
> A new udev rule file, 01-md-raid-creating.rules, will detect the
> precense of thst file and set ENV{SYSTEMD_READY}="0".
> This is fairly uniformly used to suppress actions based on the
> contents of the device.
> 

...

> diff --git a/udev-md-raid-creating.rules b/udev-md-raid-creating.rules
> new file mode 100644
> index 000000000000..2be466bcefd2
> --- /dev/null
> +++ b/udev-md-raid-creating.rules
> @@ -0,0 +1,7 @@
> +# do not edit this file, it will be overwritten on update
> +# While mdadm is creating an array, it creates a file
> +# /run/mdadm/creating-mdXXX.  If that file exists, then
> +# the array is not "ready" and we should make sure the
> +# content is ignored.
> +
> +KERNEL=="md*", TEST="/run/mdadm/creating-$kernel", ENV{SYSTEMD_READY}="0"
> 

This should be TEST=="/run/mdadm/creating-$kernel" (double equals sign,
otherwise it's ignored by udev as incorrect rule).

-- 
Peter

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [mdadm PATCH] Fix typo in new udev rule.
  2017-05-04 10:58           ` Peter Rajnoha
@ 2017-05-05  5:16             ` NeilBrown
  2017-05-05 15:07               ` Jes Sorensen
  0 siblings, 1 reply; 35+ messages in thread
From: NeilBrown @ 2017-05-05  5:16 UTC (permalink / raw)
  To: Peter Rajnoha, Jes Sorensen; +Cc: linux-raid, dm-devel

[-- Attachment #1: Type: text/plain, Size: 1492 bytes --]


As pointed out by Peter Rajnoha, the correct usage in udev is
TEST=="file", not TEST="file".

Also improve a related comment which was a bit informal.

Reported-by: Peter Rajnoha <prajnoha@redhat.com>
Fixes: cd6cbb08c458 ("Create: tell udev md device is not ready when first created.")
Signed-off-by: NeilBrown <neilb@suse.com>
---

Thanks Peter!  Properly tested as well :-)

NeilBrown


 lib.c                       | 2 +-
 udev-md-raid-creating.rules | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib.c b/lib.c
index 7e44b1f27fcc..be093e8c30cb 100644
--- a/lib.c
+++ b/lib.c
@@ -165,7 +165,7 @@ char *fd2devnm(int fd)
 
 /* When we create a new array, we don't want the content to
  * be immediately examined by udev - it is probably meaningless.
- * So create /run/mdadm/creating-FOO and expect that a udev
+ * So create /run/mdadm/creating-mdXXX and expect that a udev
  * rule will noticed this and act accordingly.
  */
 static char block_path[] = "/run/mdadm/creating-%s";
diff --git a/udev-md-raid-creating.rules b/udev-md-raid-creating.rules
index 2be466bcefd2..9bef8d1103f5 100644
--- a/udev-md-raid-creating.rules
+++ b/udev-md-raid-creating.rules
@@ -4,4 +4,4 @@
 # the array is not "ready" and we should make sure the
 # content is ignored.
 
-KERNEL=="md*", TEST="/run/mdadm/creating-$kernel", ENV{SYSTEMD_READY}="0"
+KERNEL=="md*", TEST=="/run/mdadm/creating-$kernel", ENV{SYSTEMD_READY}="0"
-- 
2.12.2


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [mdadm PATCH] Fix typo in new udev rule.
  2017-05-05  5:16             ` [mdadm PATCH] Fix typo in new udev rule NeilBrown
@ 2017-05-05 15:07               ` Jes Sorensen
  0 siblings, 0 replies; 35+ messages in thread
From: Jes Sorensen @ 2017-05-05 15:07 UTC (permalink / raw)
  To: NeilBrown, Peter Rajnoha; +Cc: linux-raid, dm-devel

On 05/05/2017 01:16 AM, NeilBrown wrote:
>
> As pointed out by Peter Rajnoha, the correct usage in udev is
> TEST=="file", not TEST="file".
>
> Also improve a related comment which was a bit informal.
>
> Reported-by: Peter Rajnoha <prajnoha@redhat.com>
> Fixes: cd6cbb08c458 ("Create: tell udev md device is not ready when first created.")
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>
> Thanks Peter!  Properly tested as well :-)
>
> NeilBrown

Applied!

Thanks both of you!

Jes



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [mdadm PATCH] Create: tell udev md device is not ready when first created.
  2017-05-03 14:13               ` Peter Rajnoha
  2017-05-03 14:44                 ` Jes Sorensen
@ 2017-05-06 16:25                 ` Wols Lists
  2017-05-06 19:50                   ` Phil Turmel
                                     ` (2 more replies)
  1 sibling, 3 replies; 35+ messages in thread
From: Wols Lists @ 2017-05-06 16:25 UTC (permalink / raw)
  To: Peter Rajnoha, Jes Sorensen, NeilBrown; +Cc: linux-raid, dm-devel

On 03/05/17 15:13, Peter Rajnoha wrote:
> There's a difference though - when you're *creating* a completely new
> device that is an abstraction over existing devices, you (most of the
> time) expect that new device to be initialized. For those corner cases
> where people do need to keep the old data, there can be an option to do
> that.

That's not a corner case. If there's old data that's the NORM. I get
what you're after, I'm inclined to agree with you, but the default
should be to DO NOTHING.

If you want mdadm to mess about with the content of the drives you
should either (a) explicitly tell it to (yes I would like that option
:-), or (b) do it yourself beforehand - dd if=/dev/zero etc etc.

It does seem weird to me that mdadm spends a lot of effort initialising
an array and calculating parity blah-di-blah, and you can't tell it to
just "set everything to zero". But there's no way it should mess about
with what was there before, without explicitly being told to.

 When you're inserting existing drives, you're not creating them -
> when those device come from factory (they're "created"), they never
> contain garbage and old data when you buy them.

As Jes says, USB devices rarely come with nothing on them. MS eventually
learnt their lesson, "doing something" BY DEFAULT with unknown/untrusted
data was a really stupid idea - it was far too easy to get your system
"pwned". Here it would be far too easy to trash an array you're trying
to recover.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [mdadm PATCH] Create: tell udev md device is not ready when first created.
  2017-05-06 16:25                 ` Wols Lists
@ 2017-05-06 19:50                   ` Phil Turmel
  2017-05-11 13:23                     ` Martin Wilck
  2017-05-09 11:57                   ` Peter Rajnoha
  2017-05-09 12:14                   ` Peter Rajnoha
  2 siblings, 1 reply; 35+ messages in thread
From: Phil Turmel @ 2017-05-06 19:50 UTC (permalink / raw)
  To: Wols Lists, Peter Rajnoha, Jes Sorensen, NeilBrown; +Cc: linux-raid, dm-devel

On 05/06/2017 12:25 PM, Wols Lists wrote:
> On 03/05/17 15:13, Peter Rajnoha wrote:
>> There's a difference though - when you're *creating* a completely new
>> device that is an abstraction over existing devices, you (most of the
>> time) expect that new device to be initialized. For those corner cases
>> where people do need to keep the old data, there can be an option to do
>> that.
> 
> That's not a corner case. If there's old data that's the NORM. I get
> what you're after, I'm inclined to agree with you, but the default
> should be to DO NOTHING.

There's some miscommunication here, it seems.  mdadm does NOT wipe the
primary device in a mirror during create, nor data blocks in a
parity-based array's stripes, but it DOES wipe the 2nd and following
copy of all mirrored blocks, and DOES wipe all parity and Q blocks.
This behaviour is really important for later consistency when degraded,
and for parallelizing reads of mirrors.

I think a case can be made for wiping it all during create, for
consistency with other kernel behaviour.  If you allocate memory for
userspace, or or create a new file on a filesystem with a specific size,
linux goes to some length to deliver zeros in both cases.  Arguably, it
should do so in MD, too.

However, changing the default behaviour would fit Linus' definition of a
regression, I would think.

Phil

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [mdadm PATCH] Create: tell udev md device is not ready when first created.
  2017-05-06 16:25                 ` Wols Lists
  2017-05-06 19:50                   ` Phil Turmel
@ 2017-05-09 11:57                   ` Peter Rajnoha
  2017-05-09 12:14                   ` Peter Rajnoha
  2 siblings, 0 replies; 35+ messages in thread
From: Peter Rajnoha @ 2017-05-09 11:57 UTC (permalink / raw)
  To: Wols Lists, Jes Sorensen, NeilBrown; +Cc: linux-raid, dm-devel

On 05/06/2017 06:25 PM, Wols Lists wrote:
> On 03/05/17 15:13, Peter Rajnoha wrote:
>> There's a difference though - when you're *creating* a completely new
>> device that is an abstraction over existing devices, you (most of the
>> time) expect that new device to be initialized. For those corner cases
>> where people do need to keep the old data, there can be an option to do
>> that.
> 
> That's not a corner case. If there's old data that's the NORM. I get
> what you're after, I'm inclined to agree with you, but the default
> should be to DO NOTHING.
> 
Well, if keeping old data is norm or not (IOW, if wiping is going to be
default or not) - I'm not going to hassle about this. But at least we
should let users to decide on command line directly. If mdadm decides to
do this part somewhere outside by some other external tool, it should at
least coordinate with that tool then - that's an alternative.

I just think it's not that common to take raw disks with some old data
(presumably starting at the beginning of the disk or some offset) and
then using these disks to create a new MD device on top of them and then
for us to expect we see exactly same data content - maybe yes with a
luck, but usually, the MD signature reserves some space at the beginning
(and/or end) of the device to write its signatures (which overwrites
some of the old content anyway).

The use case where keeping the old data would probably make more sense
is when you had those disks already used by an MD device, then you wrote
some data/content to the MD device, then you removed the MD device and
then you recreate it again with exactly the same raid level and with
exactly the same MD signature lengths as before so you end up with the
same data area revealed by the fresh MD device. Now, the question is,
how many users do this? (yes, for recovery, but probably not for fresh
new array)

Anyway, I'm still OK to have the wiping disabled by default, but what I
want to see is that option to do enable the wiping on demand at least
for mdadm.

The patch as it was proposed and written automatically marks device as
not usable (SYSTEMD_READY=0) on new MD device creation directly in udev
database. So it leaves the decision on something external to mdadm.

OK then, let's wipe it ourselves by calling, for example, "wipefs -a" on
that MD device - the wiping is done and the SYSTEMD_READY=0 flag is
dropped due to the WATCH udev rule that fires (and synthesizes CHANGE
uevent that causes the rules and udev db values to be reevaluated). This
works.

And now the other use case - to not wipe anything - user is not going to
call any wipefs-like tool, but just keeps things as they are. In this
case, for the udev database content to be correct we either need to wait
for next CHANGE uevent to happen (whatever it's cause is) OR we need to
synthesize that CHANGE uevent directly on command line by writing "echo
change > /sys/block/md.../uevent". Now, how should we educate users to
do this (quite low-level) extra call? If the extra uevent is not there,
the udev db is out-of-date simply.

If that patch stays, I think the next logical step for distributions is
to take the upstream version of mdadm and then to create a wrapper over
upstream mdadm to provide this option in addition.

So wrapped mdadm would look something like:

  call mdadm --create
  if (user_passed_option_to_do_wiping)
    call wipefs -a /dev/md_name
  else
    write "change" to /sys/block/<md_name>/uevent

I'm afraid it's going to end this way, uselessly - it could have been
directly a part of mdadm with very little cost. That cost is dependency
on libbblkid (to get the signature offsets/lenghts we need to wipe for
event-based systems to not hook on them by mistake).

Now, what is the environment where libblkid is not present and mdadm is?
libblkid is quite basic library (part of util-linux) so present on
systems widely, even in initrds. Even that systemd uses it - just to
mention that the mdadm patch uses SYSTEMD_READY variable which is on
systemd-based systems only, but anyway... I'm not going to be that picky :)

> If you want mdadm to mess about with the content of the drives you
> should either (a) explicitly tell it to (yes I would like that option
> :-), or (b) do it yourself beforehand - dd if=/dev/zero etc etc.
> 
> It does seem weird to me that mdadm spends a lot of effort initialising
> an array and calculating parity blah-di-blah, and you can't tell it to
> just "set everything to zero". But there's no way it should mess about
> with what was there before, without explicitly being told to.
> 
>  When you're inserting existing drives, you're not creating them -
>> when those device come from factory (they're "created"), they never
>> contain garbage and old data when you buy them.
> 
> As Jes says, USB devices rarely come with nothing on them. MS eventually
> learnt their lesson, "doing something" BY DEFAULT with unknown/untrusted
> data was a really stupid idea - it was far too easy to get your system
> "pwned". Here it would be far too easy to trash an array you're trying
> to recover.

Recovery, yes, I agree.

But I suppose that SUCH recovery is less frequent than creating a new md
array. Anyway, let's keep the default to not wipe, if we like it more,
but I'm begging for the mdadm to have an option to have the possibility
to do the wiping on demand so I can choose that when I'm creating a new
fresh MD device and I know for sure I don't care about old stuff that
was written before. And if I choose to not do the wiping, I want mdadm
to generate the extra uevent that's needed to update the udev database
accordingly (e.g. see the wrapper I mentioned above).

-- 
Peter

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [mdadm PATCH] Create: tell udev md device is not ready when first created.
  2017-05-06 16:25                 ` Wols Lists
  2017-05-06 19:50                   ` Phil Turmel
  2017-05-09 11:57                   ` Peter Rajnoha
@ 2017-05-09 12:14                   ` Peter Rajnoha
  2 siblings, 0 replies; 35+ messages in thread
From: Peter Rajnoha @ 2017-05-09 12:14 UTC (permalink / raw)
  To: Wols Lists, Jes Sorensen, NeilBrown; +Cc: linux-raid, dm-devel

On 05/06/2017 06:25 PM, Wols Lists wrote:
> On 03/05/17 15:13, Peter Rajnoha wrote:
>> There's a difference though - when you're *creating* a completely new
>> device that is an abstraction over existing devices, you (most of the
>> time) expect that new device to be initialized. For those corner cases
>> where people do need to keep the old data, there can be an option to do
>> that.
> 
> That's not a corner case. If there's old data that's the NORM. I get
> what you're after, I'm inclined to agree with you, but the default
> should be to DO NOTHING.
> 
> If you want mdadm to mess about with the content of the drives you
> should either (a) explicitly tell it to (yes I would like that option
> :-), or (b) do it yourself beforehand - dd if=/dev/zero etc etc.

As for (b), the "dd" with zeroing all the future mdadm components I'm
going to use for an MD device is quite time-consuming operation, mainly
for large devices.

And we don't need to do that full zeroing at all, what I'm after is to
have the wipefs-like functionality directly integrated into mdadm (that
functionality is already a part of libbblkid and it's very easy to use
it as a matter of fact). With this, we can detect and wipe all
signatures that blkid detects. The blkid is used as a primary source of
information when processing uevents to decide about further processing.
Once we wipe all that blkid can see, we make the device clear for udev
processing too and we don't end up automatically activating some old
garbage that we don't intend to actually based on these uevent hooks.

-- 
Peter

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [mdadm PATCH] Create: tell udev md device is not ready when first created.
  2017-05-06 19:50                   ` Phil Turmel
@ 2017-05-11 13:23                     ` Martin Wilck
  0 siblings, 0 replies; 35+ messages in thread
From: Martin Wilck @ 2017-05-11 13:23 UTC (permalink / raw)
  To: dm-devel

On Sat, 2017-05-06 at 15:50 -0400, Phil Turmel wrote:
> 
> I think a case can be made for wiping it all during create, for
> consistency with other kernel behaviour.  If you allocate memory for
> userspace, or or create a new file on a filesystem with a specific
> size,
> linux goes to some length to deliver zeros in both cases.  Arguably,
> it
> should do so in MD, too.
> 
> However, changing the default behaviour would fit Linus' definition
> of a
> regression, I would think.

+1 for the last paragraph.

Waiting for the day when a big corporate user creates an MD mirror from
a disk with invaluable data and expects the data to persist, but mdadm
just wipes it all ...

Martin

-- 
Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2017-05-11 13:23 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-20  2:40 [mdadm PATCH 0/4] Assorted mdadm patches NeilBrown
2017-04-20  2:40 ` [mdadm PATCH 2/4] systemd/mdadm-last-resort: use ConditionPathExists instead of Conflicts NeilBrown
2017-04-20 16:57   ` Jes Sorensen
2017-04-20  2:40 ` [mdadm PATCH 1/4] Grow_continue_command: ensure 'content' is properly initialised NeilBrown
2017-04-20 16:56   ` Jes Sorensen
2017-04-20  2:40 ` [mdadm PATCH 3/4] Detail: ensure --export names are acceptable as shell variables NeilBrown
2017-04-20 16:59   ` Jes Sorensen
2017-04-20  2:40 ` [mdadm PATCH 4/4] Create: tell udev device is not ready when first created NeilBrown
2017-04-20 17:29   ` Jes Sorensen
2017-04-20 21:35     ` NeilBrown
2017-04-26 10:19       ` Peter Rajnoha
2017-04-28  3:55         ` NeilBrown
2017-04-28  9:08           ` Peter Rajnoha
2017-05-01  4:35             ` [dm-devel] " NeilBrown
2017-05-02 11:40               ` Peter Rajnoha
2017-05-02 13:40                 ` Jes Sorensen
2017-05-03 14:27                   ` Peter Rajnoha
2017-05-03 14:41                     ` Jes Sorensen
2017-05-02 21:42                 ` NeilBrown
2017-04-28  5:05         ` [mdadm PATCH] Create: tell udev md " NeilBrown
2017-04-28  9:28           ` Peter Rajnoha
2017-05-02 13:32             ` Jes Sorensen
2017-05-03 14:13               ` Peter Rajnoha
2017-05-03 14:44                 ` Jes Sorensen
2017-05-06 16:25                 ` Wols Lists
2017-05-06 19:50                   ` Phil Turmel
2017-05-11 13:23                     ` Martin Wilck
2017-05-09 11:57                   ` Peter Rajnoha
2017-05-09 12:14                   ` Peter Rajnoha
2017-05-02 13:42           ` Jes Sorensen
2017-05-03 14:32             ` Peter Rajnoha
2017-05-03 14:45               ` [dm-devel] " Jes Sorensen
2017-05-04 10:58           ` Peter Rajnoha
2017-05-05  5:16             ` [mdadm PATCH] Fix typo in new udev rule NeilBrown
2017-05-05 15:07               ` Jes Sorensen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.