linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/16] Diskseq support in loop, device-mapper, and blkback
@ 2023-05-30 20:31 Demi Marie Obenour
  2023-05-30 20:31 ` [PATCH v2 01/16] device-mapper: Check that target specs are sufficiently aligned Demi Marie Obenour
                   ` (16 more replies)
  0 siblings, 17 replies; 38+ messages in thread
From: Demi Marie Obenour @ 2023-05-30 20:31 UTC (permalink / raw)
  To: Jens Axboe, Roger Pau Monné,
	Alasdair Kergon, Mike Snitzer, dm-devel
  Cc: Demi Marie Obenour, Marek Marczykowski-Górecki, linux-block,
	linux-kernel, xen-devel

This work aims to allow userspace to create and destroy block devices
in a race-free way, and to allow them to be exposed to other Xen VMs via
blkback without races.

Changes since v1:

- Several device-mapper fixes added.
- The diskseq is now a separate Xenstore node, rather than being part of
  physical-device.
- Potentially backwards-incompatible changes to device-mapper now
  require userspace opt-in.
- The code has been tested: I have a block script written in C that uses
  these changes to successfully boot a Xen VM.
- The core block layer is almost completely untouched.  Instead of
  exposing a block device inode directly to userspace, device-mapper
  ioctls that create a block device now return that device's diskseq.
  Userspace can then use that diskseq to safely open the device.
  Furthermore, ioctls that operate on an existing device-mapper device
  now accept a diskseq parameter, which can be used to prevent races.

There are a few changes that make device-mapper's table validation
stricter.  Two of them are clear-cut fixes for memory safety bugs: one
prevents a misaligned pointer dereference in the kernel, and the other
prevents pointer arithmetic overflow that could cause the kernel to
dereference a userspace pointer, especially on 32-bit systems.  One
fixes a double-fetch bug that happens to be harmless right now, but
would make distribution backports of subsequent patches unsafe.  The
remaining fixes prevent totally nonsensical tables from being accepted.
This includes parameter strings that overlap the subsequent target spec,
and target specs that overlap the 'struct dm_ioctl' or each other.  I
doubt there is any userspace extant that generates such tables.

Finally, one patch forbids device-mapper devices to be named ".", "..",
or "control".  Since device-mapper devices are often accessed via
/dev/mapper/NAME, such names would likely greatly confuse userspace.  I
consider this to be an extension of the existing check that prohibits
device mapper names or UUIDs from containing '/'.

Demi Marie Obenour (16):
  device-mapper: Check that target specs are sufficiently aligned
  device-mapper: Avoid pointer arithmetic overflow
  device-mapper: do not allow targets to overlap 'struct dm_ioctl'
  device-mapper: Better error message for too-short target spec
  device-mapper: Target parameters must not overlap next target spec
  device-mapper: Avoid double-fetch of version
  device-mapper: Allow userspace to opt-in to strict parameter checks
  device-mapper: Allow userspace to provide expected diskseq
  device-mapper: Allow userspace to suppress uevent generation
  device-mapper: Refuse to create device named "control"
  device-mapper: "." and ".." are not valid symlink names
  device-mapper: inform caller about already-existing device
  xen-blkback: Implement diskseq checks
  block, loop: Increment diskseq when releasing a loop device
  xen-blkback: Minor cleanups
  xen-blkback: Inform userspace that device has been opened

 block/genhd.c                       |   1 +
 drivers/block/loop.c                |   6 +
 drivers/block/xen-blkback/blkback.c |   8 +-
 drivers/block/xen-blkback/xenbus.c  | 147 ++++++++--
 drivers/md/dm-core.h                |   2 +
 drivers/md/dm-ioctl.c               | 432 ++++++++++++++++++++++------
 drivers/md/dm.c                     |   5 +-
 include/linux/device-mapper.h       |   2 +-
 include/uapi/linux/dm-ioctl.h       |  67 ++++-
 9 files changed, 551 insertions(+), 119 deletions(-)

-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2 01/16] device-mapper: Check that target specs are sufficiently aligned
  2023-05-30 20:31 [PATCH v2 00/16] Diskseq support in loop, device-mapper, and blkback Demi Marie Obenour
@ 2023-05-30 20:31 ` Demi Marie Obenour
  2023-05-30 20:31 ` [PATCH v2 02/16] device-mapper: Avoid pointer arithmetic overflow Demi Marie Obenour
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Demi Marie Obenour @ 2023-05-30 20:31 UTC (permalink / raw)
  To: Jens Axboe, Roger Pau Monné,
	Alasdair Kergon, Mike Snitzer, dm-devel
  Cc: Demi Marie Obenour, Marek Marczykowski-Górecki, linux-block,
	linux-kernel, xen-devel, stable

Otherwise subsequent code will dereference a misaligned
`struct dm_target_spec *`, which is undefined behavior.

Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
---
 drivers/md/dm-ioctl.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
index cc77cf3d410921432eb0c62cdede7d55b9aa674a..34fa74c6a70db8aa67aaba3f6a2fc4f38ef736bc 100644
--- a/drivers/md/dm-ioctl.c
+++ b/drivers/md/dm-ioctl.c
@@ -1394,6 +1394,13 @@ static inline fmode_t get_mode(struct dm_ioctl *param)
 static int next_target(struct dm_target_spec *last, uint32_t next, void *end,
 		       struct dm_target_spec **spec, char **target_params)
 {
+	static_assert(_Alignof(struct dm_target_spec) <= 8,
+		      "struct dm_target_spec has excessive alignment requirements");
+	if (next % 8) {
+		DMERR("Next target spec (offset %u) is not 8-byte aligned", next);
+		return -EINVAL;
+	}
+
 	*spec = (struct dm_target_spec *) ((unsigned char *) last + next);
 	*target_params = (char *) (*spec + 1);
 
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 02/16] device-mapper: Avoid pointer arithmetic overflow
  2023-05-30 20:31 [PATCH v2 00/16] Diskseq support in loop, device-mapper, and blkback Demi Marie Obenour
  2023-05-30 20:31 ` [PATCH v2 01/16] device-mapper: Check that target specs are sufficiently aligned Demi Marie Obenour
@ 2023-05-30 20:31 ` Demi Marie Obenour
  2023-05-30 20:31 ` [PATCH v2 03/16] device-mapper: do not allow targets to overlap 'struct dm_ioctl' Demi Marie Obenour
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Demi Marie Obenour @ 2023-05-30 20:31 UTC (permalink / raw)
  To: Jens Axboe, Roger Pau Monné,
	Alasdair Kergon, Mike Snitzer, dm-devel
  Cc: Demi Marie Obenour, Marek Marczykowski-Górecki, linux-block,
	linux-kernel, xen-devel, stable

Especially on 32-bit systems, it is possible for the pointer arithmetic
to overflow and cause a userspace pointer to be dereferenced in the
kernel.

Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
---
 drivers/md/dm-ioctl.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
index 34fa74c6a70db8aa67aaba3f6a2fc4f38ef736bc..64e8f16d344c47057de5e2d29e3d63202197dca0 100644
--- a/drivers/md/dm-ioctl.c
+++ b/drivers/md/dm-ioctl.c
@@ -1396,6 +1396,25 @@ static int next_target(struct dm_target_spec *last, uint32_t next, void *end,
 {
 	static_assert(_Alignof(struct dm_target_spec) <= 8,
 		      "struct dm_target_spec has excessive alignment requirements");
+	static_assert(offsetof(struct dm_ioctl, data) >= sizeof(struct dm_target_spec),
+		      "struct dm_target_spec too big");
+
+	/*
+	 * Number of bytes remaining, starting with last. This is always
+	 * sizeof(struct dm_target_spec) or more, as otherwise *last was
+	 * out of bounds already.
+	 */
+	size_t remaining = (char *)end - (char *)last;
+
+	/*
+	 * There must be room for both the next target spec and the
+	 * NUL-terminator of the target itself.
+	 */
+	if (remaining - sizeof(struct dm_target_spec) <= next) {
+		DMERR("Target spec extends beyond end of parameters");
+		return -EINVAL;
+	}
+
 	if (next % 8) {
 		DMERR("Next target spec (offset %u) is not 8-byte aligned", next);
 		return -EINVAL;
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 03/16] device-mapper: do not allow targets to overlap 'struct dm_ioctl'
  2023-05-30 20:31 [PATCH v2 00/16] Diskseq support in loop, device-mapper, and blkback Demi Marie Obenour
  2023-05-30 20:31 ` [PATCH v2 01/16] device-mapper: Check that target specs are sufficiently aligned Demi Marie Obenour
  2023-05-30 20:31 ` [PATCH v2 02/16] device-mapper: Avoid pointer arithmetic overflow Demi Marie Obenour
@ 2023-05-30 20:31 ` Demi Marie Obenour
  2023-05-30 20:31 ` [PATCH v2 04/16] device-mapper: Better error message for too-short target spec Demi Marie Obenour
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Demi Marie Obenour @ 2023-05-30 20:31 UTC (permalink / raw)
  To: Jens Axboe, Roger Pau Monné,
	Alasdair Kergon, Mike Snitzer, dm-devel
  Cc: Demi Marie Obenour, Marek Marczykowski-Górecki, linux-block,
	linux-kernel, xen-devel, stable

This prevents dm_split_args() from corrupting this struct.

Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
---
 drivers/md/dm-ioctl.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
index 64e8f16d344c47057de5e2d29e3d63202197dca0..a1d5fe64e1d0d9d3dcb06924249b89fe661944ab 100644
--- a/drivers/md/dm-ioctl.c
+++ b/drivers/md/dm-ioctl.c
@@ -1444,6 +1444,12 @@ static int populate_table(struct dm_table *table,
 		return -EINVAL;
 	}
 
+	if (next < sizeof(struct dm_ioctl)) {
+		DMERR("%s: first target spec (offset %u) overlaps 'struct dm_ioctl'",
+		      __func__, next);
+		return -EINVAL;
+	}
+
 	for (i = 0; i < param->target_count; i++) {
 
 		r = next_target(spec, next, end, &spec, &target_params);
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 04/16] device-mapper: Better error message for too-short target spec
  2023-05-30 20:31 [PATCH v2 00/16] Diskseq support in loop, device-mapper, and blkback Demi Marie Obenour
                   ` (2 preceding siblings ...)
  2023-05-30 20:31 ` [PATCH v2 03/16] device-mapper: do not allow targets to overlap 'struct dm_ioctl' Demi Marie Obenour
@ 2023-05-30 20:31 ` Demi Marie Obenour
  2023-05-30 20:31 ` [PATCH v2 05/16] device-mapper: Target parameters must not overlap next " Demi Marie Obenour
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Demi Marie Obenour @ 2023-05-30 20:31 UTC (permalink / raw)
  To: Jens Axboe, Roger Pau Monné,
	Alasdair Kergon, Mike Snitzer, dm-devel
  Cc: Demi Marie Obenour, Marek Marczykowski-Górecki, linux-block,
	linux-kernel, xen-devel

Previously the error was "unable to find target", which is not helpful.

Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
---
 drivers/md/dm-ioctl.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
index a1d5fe64e1d0d9d3dcb06924249b89fe661944ab..9f505abba3dc22bffc6acb335c0bf29fec288fd5 100644
--- a/drivers/md/dm-ioctl.c
+++ b/drivers/md/dm-ioctl.c
@@ -1423,9 +1423,6 @@ static int next_target(struct dm_target_spec *last, uint32_t next, void *end,
 	*spec = (struct dm_target_spec *) ((unsigned char *) last + next);
 	*target_params = (char *) (*spec + 1);
 
-	if (*spec < (last + 1))
-		return -EINVAL;
-
 	return invalid_str(*target_params, end);
 }
 
@@ -1451,6 +1448,11 @@ static int populate_table(struct dm_table *table,
 	}
 
 	for (i = 0; i < param->target_count; i++) {
+		if (next < sizeof(*spec)) {
+			DMERR("%s: next target spec (offset %u) overlaps 'struct dm_target_spec'",
+			      __func__, next);
+			return -EINVAL;
+		}
 
 		r = next_target(spec, next, end, &spec, &target_params);
 		if (r) {
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 05/16] device-mapper: Target parameters must not overlap next target spec
  2023-05-30 20:31 [PATCH v2 00/16] Diskseq support in loop, device-mapper, and blkback Demi Marie Obenour
                   ` (3 preceding siblings ...)
  2023-05-30 20:31 ` [PATCH v2 04/16] device-mapper: Better error message for too-short target spec Demi Marie Obenour
@ 2023-05-30 20:31 ` Demi Marie Obenour
  2023-05-30 20:31 ` [PATCH v2 06/16] device-mapper: Avoid double-fetch of version Demi Marie Obenour
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Demi Marie Obenour @ 2023-05-30 20:31 UTC (permalink / raw)
  To: Jens Axboe, Roger Pau Monné,
	Alasdair Kergon, Mike Snitzer, dm-devel
  Cc: Demi Marie Obenour, Marek Marczykowski-Górecki, linux-block,
	linux-kernel, xen-devel, stable

The NUL terminator for each target parameter string must preceed the
following 'struct dm_target_spec'.  Otherwise, dm_split_args() might
corrupt this struct.

Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
---
 drivers/md/dm-ioctl.c | 32 +++++++++++++++++++-------------
 1 file changed, 19 insertions(+), 13 deletions(-)

diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
index 9f505abba3dc22bffc6acb335c0bf29fec288fd5..491ef55b9e8662c3b02a2162b8c93ee086c078a1 100644
--- a/drivers/md/dm-ioctl.c
+++ b/drivers/md/dm-ioctl.c
@@ -1391,7 +1391,7 @@ static inline fmode_t get_mode(struct dm_ioctl *param)
 	return mode;
 }
 
-static int next_target(struct dm_target_spec *last, uint32_t next, void *end,
+static int next_target(struct dm_target_spec *last, uint32_t next, const char *end,
 		       struct dm_target_spec **spec, char **target_params)
 {
 	static_assert(_Alignof(struct dm_target_spec) <= 8,
@@ -1404,7 +1404,7 @@ static int next_target(struct dm_target_spec *last, uint32_t next, void *end,
 	 * sizeof(struct dm_target_spec) or more, as otherwise *last was
 	 * out of bounds already.
 	 */
-	size_t remaining = (char *)end - (char *)last;
+	size_t remaining = end - (char *)last;
 
 	/*
 	 * There must be room for both the next target spec and the
@@ -1423,7 +1423,7 @@ static int next_target(struct dm_target_spec *last, uint32_t next, void *end,
 	*spec = (struct dm_target_spec *) ((unsigned char *) last + next);
 	*target_params = (char *) (*spec + 1);
 
-	return invalid_str(*target_params, end);
+	return 0;
 }
 
 static int populate_table(struct dm_table *table,
@@ -1433,24 +1433,21 @@ static int populate_table(struct dm_table *table,
 	unsigned int i = 0;
 	struct dm_target_spec *spec = (struct dm_target_spec *) param;
 	uint32_t next = param->data_start;
-	void *end = (void *) param + param_size;
+	const char *const end = (const char *) param + param_size;
 	char *target_params;
+	size_t min_size = sizeof(struct dm_ioctl);
 
 	if (!param->target_count) {
 		DMERR("%s: no targets specified", __func__);
 		return -EINVAL;
 	}
 
-	if (next < sizeof(struct dm_ioctl)) {
-		DMERR("%s: first target spec (offset %u) overlaps 'struct dm_ioctl'",
-		      __func__, next);
-		return -EINVAL;
-	}
-
 	for (i = 0; i < param->target_count; i++) {
-		if (next < sizeof(*spec)) {
-			DMERR("%s: next target spec (offset %u) overlaps 'struct dm_target_spec'",
-			      __func__, next);
+		const char *nul_terminator;
+
+		if (next < min_size) {
+			DMERR("%s: next target spec (offset %u) overlaps %s",
+			      __func__, next, i ? "previous target" : "'struct dm_ioctl'");
 			return -EINVAL;
 		}
 
@@ -1460,6 +1457,15 @@ static int populate_table(struct dm_table *table,
 			return r;
 		}
 
+		nul_terminator = memchr(target_params, 0, (size_t)(end - target_params));
+		if (nul_terminator == NULL) {
+			DMERR("%s: target parameters not NUL-terminated", __func__);
+			return -EINVAL;
+		}
+
+		/* Add 1 for NUL terminator */
+		min_size = (nul_terminator - (const char *)spec) + 1;
+
 		r = dm_table_add_target(table, spec->target_type,
 					(sector_t) spec->sector_start,
 					(sector_t) spec->length,
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 06/16] device-mapper: Avoid double-fetch of version
  2023-05-30 20:31 [PATCH v2 00/16] Diskseq support in loop, device-mapper, and blkback Demi Marie Obenour
                   ` (4 preceding siblings ...)
  2023-05-30 20:31 ` [PATCH v2 05/16] device-mapper: Target parameters must not overlap next " Demi Marie Obenour
@ 2023-05-30 20:31 ` Demi Marie Obenour
  2023-05-30 20:31 ` [PATCH v2 07/16] device-mapper: Allow userspace to opt-in to strict parameter checks Demi Marie Obenour
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Demi Marie Obenour @ 2023-05-30 20:31 UTC (permalink / raw)
  To: Jens Axboe, Roger Pau Monné,
	Alasdair Kergon, Mike Snitzer, dm-devel
  Cc: Demi Marie Obenour, Marek Marczykowski-Górecki, linux-block,
	linux-kernel, xen-devel, stable

The version is fetched once in check_version(), which then does some
validation and then overwrites the version in userspace with the API
version supported by the kernel.  copy_params() then fetches the version
from userspace *again*, and this time no validation is done.  The result
is that the kernel's version number is completely controllable by
userspace, provided that userspace can win a race condition.

Fix this flaw by not copying the version back to the kernel the second
time.  This is not exploitable as the version is not further used in the
kernel.  However, it could become a problem if future patches start
relying on the version field.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
---
 drivers/md/dm-ioctl.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
index 491ef55b9e8662c3b02a2162b8c93ee086c078a1..20f452b6c61c1c4d20259fd0fc5443977e4454a0 100644
--- a/drivers/md/dm-ioctl.c
+++ b/drivers/md/dm-ioctl.c
@@ -1873,12 +1873,13 @@ static ioctl_fn lookup_ioctl(unsigned int cmd, int *ioctl_flags)
  * As well as checking the version compatibility this always
  * copies the kernel interface version out.
  */
-static int check_version(unsigned int cmd, struct dm_ioctl __user *user)
+static int check_version(unsigned int cmd, struct dm_ioctl __user *user,
+			 struct dm_ioctl *kernel_params)
 {
-	uint32_t version[3];
 	int r = 0;
+	uint32_t *version = kernel_params->version;
 
-	if (copy_from_user(version, user->version, sizeof(version)))
+	if (copy_from_user(version, user->version, sizeof(user->version)))
 		return -EFAULT;
 
 	if ((version[0] != DM_VERSION_MAJOR) ||
@@ -1922,7 +1923,10 @@ static int copy_params(struct dm_ioctl __user *user, struct dm_ioctl *param_kern
 	const size_t minimum_data_size = offsetof(struct dm_ioctl, data);
 	unsigned int noio_flag;
 
-	if (copy_from_user(param_kernel, user, minimum_data_size))
+	/* Version has been copied from userspace already, avoid TOCTOU */
+	if (copy_from_user((char *)param_kernel + sizeof(param_kernel->version),
+			   (char __user *)user + sizeof(param_kernel->version),
+			   minimum_data_size - sizeof(param_kernel->version)))
 		return -EFAULT;
 
 	if (param_kernel->data_size < minimum_data_size) {
@@ -2034,7 +2038,7 @@ static int ctl_ioctl(struct file *file, uint command, struct dm_ioctl __user *us
 	 * Check the interface version passed in.  This also
 	 * writes out the kernel's interface version.
 	 */
-	r = check_version(cmd, user);
+	r = check_version(cmd, user, &param_kernel);
 	if (r)
 		return r;
 
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 07/16] device-mapper: Allow userspace to opt-in to strict parameter checks
  2023-05-30 20:31 [PATCH v2 00/16] Diskseq support in loop, device-mapper, and blkback Demi Marie Obenour
                   ` (5 preceding siblings ...)
  2023-05-30 20:31 ` [PATCH v2 06/16] device-mapper: Avoid double-fetch of version Demi Marie Obenour
@ 2023-05-30 20:31 ` Demi Marie Obenour
  2023-05-30 20:31 ` [PATCH v2 08/16] device-mapper: Allow userspace to provide expected diskseq Demi Marie Obenour
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Demi Marie Obenour @ 2023-05-30 20:31 UTC (permalink / raw)
  To: Jens Axboe, Roger Pau Monné,
	Alasdair Kergon, Mike Snitzer, dm-devel
  Cc: Demi Marie Obenour, Marek Marczykowski-Górecki, linux-block,
	linux-kernel, xen-devel

Currently, device-mapper ioctls ignore unknown flags.  This makes
adding new flags to a given ioctl risky, as it could potentially break
old userspace.

To solve this problem, allow userspace to pass 5 as the major version to
any ioctl.  This causes the kernel to reject any flags that are not
supported by the ioctl, as well as nonzero padding and names and UUIDs
that are not NUL-terminated.  New flags will only be recognized if major
version 5 is used.  Kernels without this patch return -EINVAL if the
major version is 5, so this is backwards compatible.

Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
---
 drivers/md/dm-ioctl.c         | 301 ++++++++++++++++++++++++++--------
 include/uapi/linux/dm-ioctl.h |  30 +++-
 2 files changed, 260 insertions(+), 71 deletions(-)

diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
index 20f452b6c61c1c4d20259fd0fc5443977e4454a0..cf752e72ef6a2d8f8230e5bd6d1a6dc817a4f597 100644
--- a/drivers/md/dm-ioctl.c
+++ b/drivers/md/dm-ioctl.c
@@ -64,7 +64,8 @@ struct vers_iter {
 static struct rb_root name_rb_tree = RB_ROOT;
 static struct rb_root uuid_rb_tree = RB_ROOT;
 
-static void dm_hash_remove_all(bool keep_open_devices, bool mark_deferred, bool only_deferred);
+static void dm_hash_remove_all(bool keep_open_devices, bool mark_deferred, bool only_deferred,
+			       struct dm_ioctl *param);
 
 /*
  * Guards access to both hash tables.
@@ -78,7 +79,7 @@ static DEFINE_MUTEX(dm_hash_cells_mutex);
 
 static void dm_hash_exit(void)
 {
-	dm_hash_remove_all(false, false, false);
+	dm_hash_remove_all(false, false, false, NULL);
 }
 
 /*
@@ -333,7 +334,8 @@ static struct dm_table *__hash_remove(struct hash_cell *hc)
 	return table;
 }
 
-static void dm_hash_remove_all(bool keep_open_devices, bool mark_deferred, bool only_deferred)
+static void dm_hash_remove_all(bool keep_open_devices, bool mark_deferred, bool only_deferred,
+			       struct dm_ioctl *param)
 {
 	int dev_skipped;
 	struct rb_node *n;
@@ -367,6 +369,8 @@ static void dm_hash_remove_all(bool keep_open_devices, bool mark_deferred, bool
 			dm_table_destroy(t);
 		}
 		dm_ima_measure_on_device_remove(md, true);
+		if (param != NULL && !dm_kobject_uevent(md, KOBJ_REMOVE, param->event_nr, false))
+			param->flags |= DM_UEVENT_GENERATED_FLAG;
 		dm_put(md);
 		if (likely(keep_open_devices))
 			dm_destroy(md);
@@ -513,7 +517,7 @@ static struct mapped_device *dm_hash_rename(struct dm_ioctl *param,
 
 void dm_deferred_remove(void)
 {
-	dm_hash_remove_all(true, false, true);
+	dm_hash_remove_all(true, false, true, NULL);
 }
 
 /*
@@ -529,7 +533,7 @@ typedef int (*ioctl_fn)(struct file *filp, struct dm_ioctl *param, size_t param_
 
 static int remove_all(struct file *filp, struct dm_ioctl *param, size_t param_size)
 {
-	dm_hash_remove_all(true, !!(param->flags & DM_DEFERRED_REMOVE), false);
+	dm_hash_remove_all(true, !!(param->flags & DM_DEFERRED_REMOVE), false, param);
 	param->data_size = 0;
 	return 0;
 }
@@ -892,8 +896,6 @@ static int dev_create(struct file *filp, struct dm_ioctl *param, size_t param_si
 		return r;
 	}
 
-	param->flags &= ~DM_INACTIVE_PRESENT_FLAG;
-
 	__dev_status(md, param);
 
 	dm_put(md);
@@ -947,8 +949,6 @@ static struct hash_cell *__find_device_hash_cell(struct dm_ioctl *param)
 
 	if (hc->new_map)
 		param->flags |= DM_INACTIVE_PRESENT_FLAG;
-	else
-		param->flags &= ~DM_INACTIVE_PRESENT_FLAG;
 
 	return hc;
 }
@@ -1161,7 +1161,6 @@ static int do_resume(struct dm_ioctl *param)
 
 	new_map = hc->new_map;
 	hc->new_map = NULL;
-	param->flags &= ~DM_INACTIVE_PRESENT_FLAG;
 
 	up_write(&_hash_lock);
 
@@ -1426,6 +1425,32 @@ static int next_target(struct dm_target_spec *last, uint32_t next, const char *e
 	return 0;
 }
 
+static inline bool sloppy_checks(const struct dm_ioctl *param)
+{
+	return param->version[0] < DM_VERSION_MAJOR_STRICT;
+}
+
+static bool no_non_nul_after_nul(const char *untrusted_str, size_t size,
+				 unsigned int cmd, const char *msg)
+{
+	const char *cursor;
+	const char *endp = untrusted_str + size;
+	const char *nul_terminator = memchr(untrusted_str, '\0', size);
+
+	if (nul_terminator == NULL) {
+		DMERR("%s not NUL-terminated, cmd(%u)", msg, cmd);
+		return false;
+	}
+	for (cursor = nul_terminator; cursor < endp; cursor++) {
+		if (*cursor != 0) {
+			DMERR("%s has non-NUL byte at %zd after NUL byte at %zd, cmd(%u)",
+			      msg, cursor - untrusted_str, nul_terminator - untrusted_str, cmd);
+			return false;
+		}
+	}
+	return true;
+}
+
 static int populate_table(struct dm_table *table,
 			  struct dm_ioctl *param, size_t param_size)
 {
@@ -1436,12 +1461,19 @@ static int populate_table(struct dm_table *table,
 	const char *const end = (const char *) param + param_size;
 	char *target_params;
 	size_t min_size = sizeof(struct dm_ioctl);
+	bool const strict = !sloppy_checks(param);
 
 	if (!param->target_count) {
 		DMERR("%s: no targets specified", __func__);
 		return -EINVAL;
 	}
 
+	if (strict && param_size % 8 != 0) {
+		DMERR("%s: parameter size %zu not multiple of 8",
+		      __func__, param_size);
+		return -EINVAL;
+	}
+
 	for (i = 0; i < param->target_count; i++) {
 		const char *nul_terminator;
 
@@ -1466,6 +1498,18 @@ static int populate_table(struct dm_table *table,
 		/* Add 1 for NUL terminator */
 		min_size = (nul_terminator - (const char *)spec) + 1;
 
+		if (strict) {
+			if (!no_non_nul_after_nul(spec->target_type, sizeof(spec->target_type),
+						  DM_TABLE_LOAD_CMD, "target type"))
+				return -EINVAL;
+
+			if (spec->status) {
+				DMERR("%s: status in target spec must be zero, not %u",
+				      __func__, spec->status);
+				return -EINVAL;
+			}
+		}
+
 		r = dm_table_add_target(table, spec->target_type,
 					(sector_t) spec->sector_start,
 					(sector_t) spec->length,
@@ -1476,6 +1520,32 @@ static int populate_table(struct dm_table *table,
 		}
 
 		next = spec->next;
+
+		if (strict) {
+			uint64_t zero = 0;
+			/*
+			 * param_size is a multiple of 8 so this is still in
+			 * bounds (or 1 past the end).
+			 */
+			size_t expected_next = round_up(min_size, 8);
+
+			if (expected_next != next) {
+				DMERR("%s: in strict mode, expected next to be %zu but it was %u",
+				      __func__, expected_next, next);
+				return -EINVAL;
+			}
+
+			if (memcmp(&zero, nul_terminator, next - min_size + 1) != 0) {
+				DMERR("%s: in strict mode, padding must be zeroed", __func__);
+				return -EINVAL;
+			}
+		}
+	}
+
+	if (strict && next != (size_t)(end - (const char *)spec)) {
+		DMERR("%s: last target size is %u, but %zd bytes remaining in target spec",
+		      __func__, next, end - (const char *)spec);
+		return -EINVAL;
 	}
 
 	return dm_table_complete(table);
@@ -1823,48 +1893,67 @@ static int target_message(struct file *filp, struct dm_ioctl *param, size_t para
  * the ioctl.
  */
 #define IOCTL_FLAGS_NO_PARAMS		1
-#define IOCTL_FLAGS_ISSUE_GLOBAL_EVENT	2
+#define IOCTL_FLAGS_TAKES_EVENT_NR      2
+#define IOCTL_FLAGS_ISSUE_GLOBAL_EVENT	(IOCTL_FLAGS_TAKES_EVENT_NR | 4)
 
 /*
  *---------------------------------------------------------------
  * Implementation of open/close/ioctl on the special char device.
  *---------------------------------------------------------------
  */
-static ioctl_fn lookup_ioctl(unsigned int cmd, int *ioctl_flags)
+static ioctl_fn lookup_ioctl(unsigned int cmd, int *ioctl_flags, uint32_t *supported_flags)
 {
 	static const struct {
 		int cmd;
 		int flags;
 		ioctl_fn fn;
+		uint32_t supported_flags;
 	} _ioctls[] = {
-		{DM_VERSION_CMD, 0, NULL}, /* version is dealt with elsewhere */
-		{DM_REMOVE_ALL_CMD, IOCTL_FLAGS_NO_PARAMS | IOCTL_FLAGS_ISSUE_GLOBAL_EVENT, remove_all},
-		{DM_LIST_DEVICES_CMD, 0, list_devices},
+		/* Macro to make the structure initializers somewhat readable */
+#define I(cmd, flags, fn, supported_flags) {							\
+	(cmd),											\
+	(flags),										\
+	(fn),											\
+	/*											\
+	 * Supported flags in sloppy mode must not include anything in DM_STRICT_ONLY_FLAGS.	\
+	 * Use BUILD_BUG_ON_ZERO to check for that.						\
+	 */											\
+	(supported_flags) | BUILD_BUG_ON_ZERO((supported_flags) & DM_STRICT_ONLY_FLAGS),	\
+}
+		I(DM_VERSION_CMD, 0, NULL, 0), /* version is dealt with elsewhere */
+		I(DM_REMOVE_ALL_CMD, IOCTL_FLAGS_NO_PARAMS | IOCTL_FLAGS_ISSUE_GLOBAL_EVENT, remove_all,
+		 DM_DEFERRED_REMOVE),
+		I(DM_LIST_DEVICES_CMD, 0, list_devices, DM_UUID_FLAG),
+		I(DM_DEV_CREATE_CMD, IOCTL_FLAGS_NO_PARAMS | IOCTL_FLAGS_ISSUE_GLOBAL_EVENT, dev_create,
+		 DM_PERSISTENT_DEV_FLAG),
+		I(DM_DEV_REMOVE_CMD, IOCTL_FLAGS_NO_PARAMS | IOCTL_FLAGS_ISSUE_GLOBAL_EVENT, dev_remove,
+		 DM_DEFERRED_REMOVE),
+		I(DM_DEV_RENAME_CMD, IOCTL_FLAGS_ISSUE_GLOBAL_EVENT, dev_rename,
+		 DM_QUERY_INACTIVE_TABLE_FLAG | DM_UUID_FLAG),
+		I(DM_DEV_SUSPEND_CMD, IOCTL_FLAGS_NO_PARAMS, dev_suspend,
+		 DM_QUERY_INACTIVE_TABLE_FLAG | DM_SUSPEND_FLAG | DM_SKIP_LOCKFS_FLAG | DM_NOFLUSH_FLAG),
+		I(DM_DEV_STATUS_CMD, IOCTL_FLAGS_NO_PARAMS, dev_status, DM_QUERY_INACTIVE_TABLE_FLAG),
+		I(DM_DEV_WAIT_CMD, IOCTL_FLAGS_TAKES_EVENT_NR, dev_wait,
+		 DM_QUERY_INACTIVE_TABLE_FLAG | DM_STATUS_TABLE_FLAG | DM_NOFLUSH_FLAG),
+		I(DM_TABLE_LOAD_CMD, 0, table_load, DM_QUERY_INACTIVE_TABLE_FLAG | DM_READONLY_FLAG),
+		I(DM_TABLE_CLEAR_CMD, IOCTL_FLAGS_NO_PARAMS, table_clear, DM_QUERY_INACTIVE_TABLE_FLAG),
+		I(DM_TABLE_DEPS_CMD, 0, table_deps, DM_QUERY_INACTIVE_TABLE_FLAG),
+		I(DM_TABLE_STATUS_CMD, 0, table_status,
+		 DM_QUERY_INACTIVE_TABLE_FLAG | DM_STATUS_TABLE_FLAG | DM_NOFLUSH_FLAG),
 
-		{DM_DEV_CREATE_CMD, IOCTL_FLAGS_NO_PARAMS | IOCTL_FLAGS_ISSUE_GLOBAL_EVENT, dev_create},
-		{DM_DEV_REMOVE_CMD, IOCTL_FLAGS_NO_PARAMS | IOCTL_FLAGS_ISSUE_GLOBAL_EVENT, dev_remove},
-		{DM_DEV_RENAME_CMD, IOCTL_FLAGS_ISSUE_GLOBAL_EVENT, dev_rename},
-		{DM_DEV_SUSPEND_CMD, IOCTL_FLAGS_NO_PARAMS, dev_suspend},
-		{DM_DEV_STATUS_CMD, IOCTL_FLAGS_NO_PARAMS, dev_status},
-		{DM_DEV_WAIT_CMD, 0, dev_wait},
+		I(DM_LIST_VERSIONS_CMD, 0, list_versions, 0),
 
-		{DM_TABLE_LOAD_CMD, 0, table_load},
-		{DM_TABLE_CLEAR_CMD, IOCTL_FLAGS_NO_PARAMS, table_clear},
-		{DM_TABLE_DEPS_CMD, 0, table_deps},
-		{DM_TABLE_STATUS_CMD, 0, table_status},
-
-		{DM_LIST_VERSIONS_CMD, 0, list_versions},
-
-		{DM_TARGET_MSG_CMD, 0, target_message},
-		{DM_DEV_SET_GEOMETRY_CMD, 0, dev_set_geometry},
-		{DM_DEV_ARM_POLL_CMD, IOCTL_FLAGS_NO_PARAMS, dev_arm_poll},
-		{DM_GET_TARGET_VERSION_CMD, 0, get_target_version},
+		I(DM_TARGET_MSG_CMD, 0, target_message, DM_QUERY_INACTIVE_TABLE_FLAG),
+		I(DM_DEV_SET_GEOMETRY_CMD, 0, dev_set_geometry, 0),
+		I(DM_DEV_ARM_POLL_CMD, IOCTL_FLAGS_NO_PARAMS, dev_arm_poll, 0),
+		I(DM_GET_TARGET_VERSION_CMD, 0, get_target_version, 0),
 	};
 
 	if (unlikely(cmd >= ARRAY_SIZE(_ioctls)))
 		return NULL;
 
 	cmd = array_index_nospec(cmd, ARRAY_SIZE(_ioctls));
+	*supported_flags = _ioctls[cmd].supported_flags;
 	*ioctl_flags = _ioctls[cmd].flags;
 	return _ioctls[cmd].fn;
 }
@@ -1877,27 +1966,34 @@ static int check_version(unsigned int cmd, struct dm_ioctl __user *user,
 			 struct dm_ioctl *kernel_params)
 {
 	int r = 0;
-	uint32_t *version = kernel_params->version;
+	uint32_t expected_major_version = DM_VERSION_MAJOR;
 
-	if (copy_from_user(version, user->version, sizeof(user->version)))
+	if (copy_from_user(kernel_params->version, user->version, sizeof(kernel_params->version)))
 		return -EFAULT;
 
-	if ((version[0] != DM_VERSION_MAJOR) ||
-	    (version[1] > DM_VERSION_MINOR)) {
+	if (kernel_params->version[0] >= DM_VERSION_MAJOR_STRICT)
+		expected_major_version = DM_VERSION_MAJOR_STRICT;
+
+	if ((kernel_params->version[0] != expected_major_version) ||
+	    (kernel_params->version[1] > DM_VERSION_MINOR)) {
 		DMERR("ioctl interface mismatch: kernel(%u.%u.%u), user(%u.%u.%u), cmd(%d)",
-		      DM_VERSION_MAJOR, DM_VERSION_MINOR,
+		      expected_major_version,
+		      DM_VERSION_MINOR,
 		      DM_VERSION_PATCHLEVEL,
-		      version[0], version[1], version[2], cmd);
+		      kernel_params->version[0],
+		      kernel_params->version[1],
+		      kernel_params->version[2],
+		      cmd);
 		r = -EINVAL;
 	}
 
 	/*
 	 * Fill in the kernel version.
 	 */
-	version[0] = DM_VERSION_MAJOR;
-	version[1] = DM_VERSION_MINOR;
-	version[2] = DM_VERSION_PATCHLEVEL;
-	if (copy_to_user(user->version, version, sizeof(version)))
+	kernel_params->version[0] = expected_major_version;
+	kernel_params->version[1] = DM_VERSION_MINOR;
+	kernel_params->version[2] = DM_VERSION_PATCHLEVEL;
+	if (copy_to_user(user->version, kernel_params->version, sizeof(kernel_params->version)))
 		return -EFAULT;
 
 	return r;
@@ -1920,9 +2016,12 @@ static int copy_params(struct dm_ioctl __user *user, struct dm_ioctl *param_kern
 {
 	struct dm_ioctl *dmi;
 	int secure_data;
-	const size_t minimum_data_size = offsetof(struct dm_ioctl, data);
+	const size_t minimum_data_size = sloppy_checks(param_kernel) ?
+		offsetof(struct dm_ioctl, data) : sizeof(struct dm_ioctl);
 	unsigned int noio_flag;
 
+	static_assert(offsetof(struct dm_ioctl, data_size) == sizeof(param_kernel->version));
+	static_assert(offsetof(struct dm_ioctl, data_size) == 12);
 	/* Version has been copied from userspace already, avoid TOCTOU */
 	if (copy_from_user((char *)param_kernel + sizeof(param_kernel->version),
 			   (char __user *)user + sizeof(param_kernel->version),
@@ -1930,12 +2029,13 @@ static int copy_params(struct dm_ioctl __user *user, struct dm_ioctl *param_kern
 		return -EFAULT;
 
 	if (param_kernel->data_size < minimum_data_size) {
-		DMERR("Invalid data size in the ioctl structure: %u",
-		      param_kernel->data_size);
+		DMERR("Invalid data size in the ioctl structure: %u (minimum %zu)",
+		      param_kernel->data_size, minimum_data_size);
 		return -EINVAL;
 	}
 
 	secure_data = param_kernel->flags & DM_SECURE_DATA_FLAG;
+	param_kernel->flags &= ~DM_SECURE_DATA_FLAG;
 
 	*param_flags = secure_data ? DM_WIPE_BUFFER : 0;
 
@@ -1966,7 +2066,8 @@ static int copy_params(struct dm_ioctl __user *user, struct dm_ioctl *param_kern
 	/* Copy from param_kernel (which was already copied from user) */
 	memcpy(dmi, param_kernel, minimum_data_size);
 
-	if (copy_from_user(&dmi->data, (char __user *)user + minimum_data_size,
+	if (copy_from_user((char *)dmi + minimum_data_size,
+			   (char __user *)user + minimum_data_size,
 			   param_kernel->data_size - minimum_data_size))
 		goto bad;
 data_copied:
@@ -1983,33 +2084,86 @@ static int copy_params(struct dm_ioctl __user *user, struct dm_ioctl *param_kern
 	return -EFAULT;
 }
 
-static int validate_params(uint cmd, struct dm_ioctl *param)
+static int validate_params(uint cmd, struct dm_ioctl *param,
+			   uint32_t ioctl_flags, uint32_t supported_flags)
 {
-	/* Always clear this flag */
-	param->flags &= ~DM_BUFFER_FULL_FLAG;
-	param->flags &= ~DM_UEVENT_GENERATED_FLAG;
-	param->flags &= ~DM_SECURE_DATA_FLAG;
-	param->flags &= ~DM_DATA_OUT_FLAG;
-
-	/* Ignores parameters */
-	if (cmd == DM_REMOVE_ALL_CMD ||
-	    cmd == DM_LIST_DEVICES_CMD ||
-	    cmd == DM_LIST_VERSIONS_CMD)
-		return 0;
+	static_assert(__same_type(param->flags, supported_flags));
+	u64 zero = 0;
 
 	if (cmd == DM_DEV_CREATE_CMD) {
 		if (!*param->name) {
 			DMERR("name not supplied when creating device");
 			return -EINVAL;
 		}
-	} else if (*param->uuid && *param->name) {
-		DMERR("only supply one of name or uuid, cmd(%u)", cmd);
+	} else {
+		if (*param->uuid && *param->name) {
+			DMERR("only supply one of name or uuid, cmd(%u)", cmd);
+			return -EINVAL;
+		}
+	}
+
+	if (sloppy_checks(param)) {
+		/* Ensure strings are terminated */
+		param->name[DM_NAME_LEN - 1] = '\0';
+		param->uuid[DM_UUID_LEN - 1] = '\0';
+		/* Mask off bits that could confuse other code */
+		param->flags &= ~DM_STRICT_ONLY_FLAGS;
+		/* Skip strict checks */
+		return 0;
+	}
+
+	/* Check that strings are terminated */
+	if (!no_non_nul_after_nul(param->name, DM_NAME_LEN, cmd, "Name") ||
+	    !no_non_nul_after_nul(param->uuid, DM_UUID_LEN, cmd, "UUID")) {
 		return -EINVAL;
 	}
 
-	/* Ensure strings are terminated */
-	param->name[DM_NAME_LEN - 1] = '\0';
-	param->uuid[DM_UUID_LEN - 1] = '\0';
+	if (memcmp(param->data, &zero, sizeof(param->data)) != 0) {
+		DMERR("second padding field not zeroed in strict mode (cmd %u)", cmd);
+		return -EINVAL;
+	}
+
+	if (param->flags & ~supported_flags) {
+		DMERR("unsupported flags 0x%x specified, cmd(%u)",
+		      param->flags & ~supported_flags, cmd);
+		return -EINVAL;
+	}
+
+	if (param->padding) {
+		DMERR("padding not zeroed in strict mode (got %u, cmd %u)",
+		      param->padding, cmd);
+		return -EINVAL;
+	}
+
+	if (param->open_count != 0) {
+		DMERR("open_count not zeroed in strict mode (got %d, cmd %u)",
+		      param->open_count, cmd);
+		return -EINVAL;
+	}
+
+	if (param->event_nr != 0 && (ioctl_flags & IOCTL_FLAGS_TAKES_EVENT_NR) == 0) {
+		DMERR("Event number not zeroed for command that does not take one (got %u, cmd %u)",
+		      param->event_nr, cmd);
+		return -EINVAL;
+	}
+
+	if (ioctl_flags & IOCTL_FLAGS_NO_PARAMS) {
+		/* Ignores parameters */
+		if (param->data_size != sizeof(struct dm_ioctl)) {
+			DMERR("command %u must not have parameters", cmd);
+			return -EINVAL;
+		}
+
+		if (param->target_count != 0) {
+			DMERR("command %u must have zero target_count", cmd);
+			return -EINVAL;
+		}
+
+		if (param->data_start) {
+			DMERR("command %u must have zero data_start", cmd);
+			return -EINVAL;
+		}
+	}
 
 	return 0;
 }
@@ -2024,6 +2178,7 @@ static int ctl_ioctl(struct file *file, uint command, struct dm_ioctl __user *us
 	ioctl_fn fn = NULL;
 	size_t input_param_size;
 	struct dm_ioctl param_kernel;
+	uint32_t supported_flags, old_flags;
 
 	/* only root can play with this */
 	if (!capable(CAP_SYS_ADMIN))
@@ -2039,7 +2194,7 @@ static int ctl_ioctl(struct file *file, uint command, struct dm_ioctl __user *us
 	 * writes out the kernel's interface version.
 	 */
 	r = check_version(cmd, user, &param_kernel);
-	if (r)
+	if (r != 0)
 		return r;
 
 	/*
@@ -2048,7 +2203,7 @@ static int ctl_ioctl(struct file *file, uint command, struct dm_ioctl __user *us
 	if (cmd == DM_VERSION_CMD)
 		return 0;
 
-	fn = lookup_ioctl(cmd, &ioctl_flags);
+	fn = lookup_ioctl(cmd, &ioctl_flags, &supported_flags);
 	if (!fn) {
 		DMERR("dm_ctl_ioctl: unknown command 0x%x", command);
 		return -ENOTTY;
@@ -2063,11 +2218,20 @@ static int ctl_ioctl(struct file *file, uint command, struct dm_ioctl __user *us
 		return r;
 
 	input_param_size = param->data_size;
-	r = validate_params(cmd, param);
+
+	/*
+	 * In sloppy mode, validate_params will clear some
+	 * flags to ensure other code does not get confused.
+	 * Save the original flags here.
+	 */
+	old_flags = param->flags;
+	r = validate_params(cmd, param, ioctl_flags, supported_flags);
 	if (r)
 		goto out;
+	/* This XOR keeps only the flags validate_params has changed. */
+	old_flags ^= param->flags;
 
-	param->data_size = offsetof(struct dm_ioctl, data);
+	param->data_size = sloppy_checks(param) ? offsetof(struct dm_ioctl, data) : sizeof(struct dm_ioctl);
 	r = fn(file, param, input_param_size);
 
 	if (unlikely(param->flags & DM_BUFFER_FULL_FLAG) &&
@@ -2077,6 +2241,9 @@ static int ctl_ioctl(struct file *file, uint command, struct dm_ioctl __user *us
 	if (!r && ioctl_flags & IOCTL_FLAGS_ISSUE_GLOBAL_EVENT)
 		dm_issue_global_event();
 
+	/* Resture the flags that validate_params cleared */
+	param->flags |= old_flags;
+
 	/*
 	 * Copy the results back to userland.
 	 */
diff --git a/include/uapi/linux/dm-ioctl.h b/include/uapi/linux/dm-ioctl.h
index 1990b5700f6948243def314cec22f380926aca2e..81103e1dcdac3015204e9c05d73037191e965d59 100644
--- a/include/uapi/linux/dm-ioctl.h
+++ b/include/uapi/linux/dm-ioctl.h
@@ -171,8 +171,11 @@ struct dm_target_spec {
 
 	/*
 	 * Parameter string starts immediately after this object.
-	 * Be careful to add padding after string to ensure correct
-	 * alignment of subsequent dm_target_spec.
+	 * Be careful to add padding after string to ensure 8-byte
+	 * alignment of subsequent dm_target_spec.  If the major version
+	 * is DM_VERSION_MAJOR_STRICT, the padding must be at most 7 bytes,
+	 * (not including the terminating NULt that ends the string) and
+	 * must be zeroed.
 	 */
 };
 
@@ -285,14 +288,25 @@ enum {
 #define DM_TARGET_MSG	 _IOWR(DM_IOCTL, DM_TARGET_MSG_CMD, struct dm_ioctl)
 #define DM_DEV_SET_GEOMETRY	_IOWR(DM_IOCTL, DM_DEV_SET_GEOMETRY_CMD, struct dm_ioctl)
 
+/* Legacy major version */
 #define DM_VERSION_MAJOR	4
-#define DM_VERSION_MINOR	48
+/*
+ * New major version.  Enforces strict parameter checks and is required for
+ * using some new features, such as new flags.  Should be used by all new code.
+ *
+ * If one uses DM_VERSION_MAJOR_STRICT, it is possible for the behavior of
+ * ioctls to depend on the minor version passed by userspace.  Userspace must
+ * not pass a minor version greater than the version it was designed for.
+ */
+#define DM_VERSION_MAJOR_STRICT 5
+#define DM_VERSION_MINOR	49
 #define DM_VERSION_PATCHLEVEL	0
 #define DM_VERSION_EXTRA	"-ioctl (2023-03-01)"
 
 /* Status bits */
 #define DM_READONLY_FLAG	(1 << 0) /* In/Out */
 #define DM_SUSPEND_FLAG		(1 << 1) /* In/Out */
+#define DM_EXISTS_FLAG		(1 << 2) /* Not used by kernel, reserved for libdevmapper in userland */
 #define DM_PERSISTENT_DEV_FLAG	(1 << 3) /* In */
 
 /*
@@ -315,7 +329,8 @@ enum {
 #define DM_BUFFER_FULL_FLAG	(1 << 8) /* Out */
 
 /*
- * This flag is now ignored.
+ * This flag is now ignored if DM_VERSION_MAJOR is used, and causes
+ * -EINVAL if DM_VERSION_MAJOR_STRICT is used.
  */
 #define DM_SKIP_BDGET_FLAG	(1 << 9) /* In */
 
@@ -382,4 +397,11 @@ enum {
  */
 #define DM_IMA_MEASUREMENT_FLAG	(1 << 19) /* In */
 
+/*
+ * If DM_VERSION_MAJOR is used, these flags are ignored by the kernel.
+ * If DM_VERSION_MAJOR_STRICT is used, these flags are reserved and
+ * must be zeroed.
+ */
+#define DM_STRICT_ONLY_FLAGS ((__u32)0xFFF00004)
+
 #endif				/* _LINUX_DM_IOCTL_H */
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 08/16] device-mapper: Allow userspace to provide expected diskseq
  2023-05-30 20:31 [PATCH v2 00/16] Diskseq support in loop, device-mapper, and blkback Demi Marie Obenour
                   ` (6 preceding siblings ...)
  2023-05-30 20:31 ` [PATCH v2 07/16] device-mapper: Allow userspace to opt-in to strict parameter checks Demi Marie Obenour
@ 2023-05-30 20:31 ` Demi Marie Obenour
  2023-05-30 20:31 ` [PATCH v2 09/16] device-mapper: Allow userspace to suppress uevent generation Demi Marie Obenour
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Demi Marie Obenour @ 2023-05-30 20:31 UTC (permalink / raw)
  To: Jens Axboe, Roger Pau Monné,
	Alasdair Kergon, Mike Snitzer, dm-devel
  Cc: Demi Marie Obenour, Marek Marczykowski-Górecki, linux-block,
	linux-kernel, xen-devel

This can be used to avoid race conditions in which a device is destroyed
and recreated with the same major/minor, name, or UUID.  diskseqs are
only honored if strict parameter checking is on, to avoid any risk of
breaking old userspace.

Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
---
 drivers/md/dm-ioctl.c         | 48 ++++++++++++++++++++++++++++-------
 include/uapi/linux/dm-ioctl.h | 33 +++++++++++++++++++++---
 2 files changed, 69 insertions(+), 12 deletions(-)

diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
index cf752e72ef6a2d8f8230e5bd6d1a6dc817a4f597..01cdf57bcafbf7f3e1b8304eec28792c6b24642d 100644
--- a/drivers/md/dm-ioctl.c
+++ b/drivers/md/dm-ioctl.c
@@ -871,6 +871,9 @@ static void __dev_status(struct mapped_device *md, struct dm_ioctl *param)
 		}
 		dm_put_live_table(md, srcu_idx);
 	}
+
+	if (param->version[0] >= DM_VERSION_MAJOR_STRICT)
+		dm_set_diskseq(param, disk->diskseq);
 }
 
 static int dev_create(struct file *filp, struct dm_ioctl *param, size_t param_size)
@@ -889,6 +892,8 @@ static int dev_create(struct file *filp, struct dm_ioctl *param, size_t param_si
 	if (r)
 		return r;
 
+	param->flags &= ~DM_INACTIVE_PRESENT_FLAG;
+
 	r = dm_hash_insert(param->name, *param->uuid ? param->uuid : NULL, md);
 	if (r) {
 		dm_put(md);
@@ -909,6 +914,7 @@ static int dev_create(struct file *filp, struct dm_ioctl *param, size_t param_si
 static struct hash_cell *__find_device_hash_cell(struct dm_ioctl *param)
 {
 	struct hash_cell *hc = NULL;
+	static_assert(offsetof(struct dm_ioctl, diskseq_high) == offsetof(struct dm_ioctl, data) + 3);
 
 	if (*param->uuid) {
 		if (*param->name || param->dev) {
@@ -937,6 +943,27 @@ static struct hash_cell *__find_device_hash_cell(struct dm_ioctl *param)
 	} else
 		return NULL;
 
+	if (param->version[0] >= DM_VERSION_MAJOR_STRICT) {
+		u64 expected_diskseq = dm_get_diskseq(param);
+		u64 diskseq;
+		struct mapped_device *md = hc->md;
+
+		if (WARN_ON_ONCE(md->disk == NULL))
+			return NULL;
+		diskseq = md->disk->diskseq;
+		if (WARN_ON_ONCE(diskseq == 0))
+			return NULL;
+		if (expected_diskseq != 0) {
+			if (expected_diskseq != diskseq) {
+				DMERR("Diskseq mismatch: expected %llu actual %llu",
+				      expected_diskseq, diskseq);
+				return NULL;
+			}
+		} else {
+			dm_set_diskseq(param, diskseq);
+		}
+	}
+
 	/*
 	 * Sneakily write in both the name and the uuid
 	 * while we have the cell.
@@ -2088,7 +2115,6 @@ static int validate_params(uint cmd, struct dm_ioctl *param,
 			   uint32_t ioctl_flags, uint32_t supported_flags)
 {
 	static_assert(__same_type(param->flags, supported_flags));
-	u64 zero = 0;
 
 	if (cmd == DM_DEV_CREATE_CMD) {
 		if (!*param->name) {
@@ -2112,14 +2138,24 @@ static int validate_params(uint cmd, struct dm_ioctl *param,
 		return 0;
 	}
 
+	if (param->data_size < sizeof(struct dm_ioctl)) {
+		DMERR("Entire struct dm_ioctl (size %zu) must be valid, but only %u was valid",
+		      sizeof(struct dm_ioctl), param->data_size);
+		return -EINVAL;
+	}
+
 	/* Check that strings are terminated */
 	if (!no_non_nul_after_nul(param->name, DM_NAME_LEN, cmd, "Name") ||
 	    !no_non_nul_after_nul(param->uuid, DM_UUID_LEN, cmd, "UUID")) {
 		return -EINVAL;
 	}
 
-	if (memcmp(param->data, &zero, sizeof(param->data)) != 0) {
-		DMERR("second padding field not zeroed in strict mode (cmd %u)", cmd);
+	/*
+	 * This also reads the NUL terminator of the UUID, but that has already been
+	 * checked to be zero by no_non_nul_after_nul().
+	 */
+	if (*(const u32 *)((const char *)param + sizeof(struct dm_ioctl) - 8) != 0) {
+		DMERR("padding field not zeroed in strict mode (cmd %u)", cmd);
 		return -EINVAL;
 	}
 
@@ -2129,12 +2165,6 @@ static int validate_params(uint cmd, struct dm_ioctl *param,
 		return -EINVAL;
 	}
 
-	if (param->padding) {
-		DMERR("padding not zeroed in strict mode (got %u, cmd %u)",
-		      param->padding, cmd);
-		return -EINVAL;
-	}
-
 	if (param->open_count != 0) {
 		DMERR("open_count not zeroed in strict mode (got %d, cmd %u)",
 		      param->open_count, cmd);
diff --git a/include/uapi/linux/dm-ioctl.h b/include/uapi/linux/dm-ioctl.h
index 81103e1dcdac3015204e9c05d73037191e965d59..5647b218f24b626f5c1cefe8bec18dc04373c3d0 100644
--- a/include/uapi/linux/dm-ioctl.h
+++ b/include/uapi/linux/dm-ioctl.h
@@ -136,16 +136,43 @@ struct dm_ioctl {
 	 * For output, the ioctls return the event number, not the cookie.
 	 */
 	__u32 event_nr;      	/* in/out */
-	__u32 padding;
+
+	union {
+		/* valid if DM_VERSION_MAJOR is used */
+		__u32 padding;		/* padding */
+		/* valid if DM_VERSION_MAJOR_STRICT is used */
+		__u32 diskseq_low;	/* in/out: low 4 bytes of the diskseq */
+	};
 
 	__u64 dev;		/* in/out */
 
 	char name[DM_NAME_LEN];	/* device name */
 	char uuid[DM_UUID_LEN];	/* unique identifier for
 				 * the block device */
-	char data[7];		/* padding or data */
+	union {
+		/* valid if DM_VERSION_MAJOR is used */
+		char data[7];	/* padding or data */
+		/* valid if DM_VERSION_MAJOR_STRICT is used */
+		struct {
+			char _padding[3];	/* padding */
+			__u32 diskseq_high;	/* in/out: high 4 bytes of the diskseq */
+		} __attribute__((packed));
+	};
 };
 
+__attribute__((always_inline)) static inline __u64
+dm_get_diskseq(const struct dm_ioctl *_i)
+{
+	return (__u64)_i->diskseq_high << 32 | (__u64)_i->diskseq_low;
+}
+
+__attribute__((always_inline)) static inline void
+dm_set_diskseq(struct dm_ioctl *_i, __u64 _diskseq)
+{
+	_i->diskseq_low = (__u32)(_diskseq & 0xFFFFFFFFU);
+	_i->diskseq_high = (__u32)(_diskseq >> 32);
+}
+
 /*
  * Used to specify tables.  These structures appear after the
  * dm_ioctl.
@@ -402,6 +429,6 @@ enum {
  * If DM_VERSION_MAJOR_STRICT is used, these flags are reserved and
  * must be zeroed.
  */
-#define DM_STRICT_ONLY_FLAGS ((__u32)0xFFF00004)
+#define DM_STRICT_ONLY_FLAGS ((__u32)(~((1UL << 19) - 1) | 1 << 9 | 1 << 7))
 
 #endif				/* _LINUX_DM_IOCTL_H */
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 09/16] device-mapper: Allow userspace to suppress uevent generation
  2023-05-30 20:31 [PATCH v2 00/16] Diskseq support in loop, device-mapper, and blkback Demi Marie Obenour
                   ` (7 preceding siblings ...)
  2023-05-30 20:31 ` [PATCH v2 08/16] device-mapper: Allow userspace to provide expected diskseq Demi Marie Obenour
@ 2023-05-30 20:31 ` Demi Marie Obenour
  2023-05-30 20:31 ` [PATCH v2 10/16] device-mapper: Refuse to create device named "control" Demi Marie Obenour
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Demi Marie Obenour @ 2023-05-30 20:31 UTC (permalink / raw)
  To: Jens Axboe, Roger Pau Monné,
	Alasdair Kergon, Mike Snitzer, dm-devel
  Cc: Demi Marie Obenour, Marek Marczykowski-Górecki, linux-block,
	linux-kernel, xen-devel

Userspace can use this to avoid spamming udev with events that udev
should ignore.

Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
---
 drivers/md/dm-core.h          |  2 +
 drivers/md/dm-ioctl.c         | 78 ++++++++++++++++++-----------------
 drivers/md/dm.c               |  5 ++-
 include/linux/device-mapper.h |  2 +-
 include/uapi/linux/dm-ioctl.h | 14 +++++--
 5 files changed, 57 insertions(+), 44 deletions(-)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index aecab0c0720f77ae2a0ab048304ea3d1023f9959..a033f85d1a9d9b3d8ec893efd6552fb48d2b3541 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -115,6 +115,8 @@ struct mapped_device {
 
 	/* for blk-mq request-based DM support */
 	bool init_tio_pdu:1;
+	/* If set, do not emit any uevents. */
+	bool disable_uevents:1;
 	struct blk_mq_tag_set *tag_set;
 
 	struct dm_stats stats;
diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
index 01cdf57bcafbf7f3e1b8304eec28792c6b24642d..52aa5505d23b2f3d9c0faf6e8a91b74cd7845581 100644
--- a/drivers/md/dm-ioctl.c
+++ b/drivers/md/dm-ioctl.c
@@ -814,6 +814,11 @@ static struct dm_table *dm_get_live_or_inactive_table(struct mapped_device *md,
 		dm_get_inactive_table(md, srcu_idx) : dm_get_live_table(md, srcu_idx);
 }
 
+static inline bool sloppy_checks(const struct dm_ioctl *param)
+{
+	return param->version[0] < DM_VERSION_MAJOR_STRICT;
+}
+
 /*
  * Fills in a dm_ioctl structure, ready for sending back to
  * userland.
@@ -872,7 +877,7 @@ static void __dev_status(struct mapped_device *md, struct dm_ioctl *param)
 		dm_put_live_table(md, srcu_idx);
 	}
 
-	if (param->version[0] >= DM_VERSION_MAJOR_STRICT)
+	if (!sloppy_checks(param))
 		dm_set_diskseq(param, disk->diskseq);
 }
 
@@ -888,7 +893,7 @@ static int dev_create(struct file *filp, struct dm_ioctl *param, size_t param_si
 	if (param->flags & DM_PERSISTENT_DEV_FLAG)
 		m = MINOR(huge_decode_dev(param->dev));
 
-	r = dm_create(m, &md);
+	r = dm_create(m, &md, param->flags & DM_DISABLE_UEVENTS_FLAG);
 	if (r)
 		return r;
 
@@ -1452,11 +1457,6 @@ static int next_target(struct dm_target_spec *last, uint32_t next, const char *e
 	return 0;
 }
 
-static inline bool sloppy_checks(const struct dm_ioctl *param)
-{
-	return param->version[0] < DM_VERSION_MAJOR_STRICT;
-}
-
 static bool no_non_nul_after_nul(const char *untrusted_str, size_t size,
 				 unsigned int cmd, const char *msg)
 {
@@ -1928,59 +1928,61 @@ static int target_message(struct file *filp, struct dm_ioctl *param, size_t para
  * Implementation of open/close/ioctl on the special char device.
  *---------------------------------------------------------------
  */
-static ioctl_fn lookup_ioctl(unsigned int cmd, int *ioctl_flags, uint32_t *supported_flags)
+static ioctl_fn lookup_ioctl(unsigned int cmd, bool strict, int *ioctl_flags, uint32_t *supported_flags)
 {
 	static const struct {
 		int cmd;
 		int flags;
 		ioctl_fn fn;
 		uint32_t supported_flags;
+		uint32_t strict_flags;
 	} _ioctls[] = {
 		/* Macro to make the structure initializers somewhat readable */
-#define I(cmd, flags, fn, supported_flags) {							\
-	(cmd),											\
-	(flags),										\
-	(fn),											\
-	/*											\
-	 * Supported flags in sloppy mode must not include anything in DM_STRICT_ONLY_FLAGS.	\
-	 * Use BUILD_BUG_ON_ZERO to check for that.						\
-	 */											\
-	(supported_flags) | BUILD_BUG_ON_ZERO((supported_flags) & DM_STRICT_ONLY_FLAGS),	\
+#define I(cmd, flags, fn, supported_flags, strict_flags) {						\
+	(cmd),												\
+	(flags),											\
+	(fn),												\
+	/*												\
+	 * Supported flags in sloppy mode must not include anything in DM_STRICT_ONLY_FLAGS.		\
+	 * Use BUILD_BUG_ON_ZERO to check for that.							\
+	 */												\
+	(supported_flags) | BUILD_BUG_ON_ZERO((supported_flags) & DM_STRICT_ONLY_FLAGS),		\
+	(strict_flags) | (supported_flags) | BUILD_BUG_ON_ZERO((supported_flags) & (strict_flags)),	\
 }
-		I(DM_VERSION_CMD, 0, NULL, 0), /* version is dealt with elsewhere */
+		I(DM_VERSION_CMD, 0, NULL, 0, 0), /* version is dealt with elsewhere */
 		I(DM_REMOVE_ALL_CMD, IOCTL_FLAGS_NO_PARAMS | IOCTL_FLAGS_ISSUE_GLOBAL_EVENT, remove_all,
-		 DM_DEFERRED_REMOVE),
-		I(DM_LIST_DEVICES_CMD, 0, list_devices, DM_UUID_FLAG),
+		 DM_DEFERRED_REMOVE, 0),
+		I(DM_LIST_DEVICES_CMD, 0, list_devices, DM_UUID_FLAG, 0),
 		I(DM_DEV_CREATE_CMD, IOCTL_FLAGS_NO_PARAMS | IOCTL_FLAGS_ISSUE_GLOBAL_EVENT, dev_create,
-		 DM_PERSISTENT_DEV_FLAG),
+		 DM_PERSISTENT_DEV_FLAG, DM_DISABLE_UEVENTS_FLAG),
 		I(DM_DEV_REMOVE_CMD, IOCTL_FLAGS_NO_PARAMS | IOCTL_FLAGS_ISSUE_GLOBAL_EVENT, dev_remove,
-		 DM_DEFERRED_REMOVE),
+		 DM_DEFERRED_REMOVE, 0),
 		I(DM_DEV_RENAME_CMD, IOCTL_FLAGS_ISSUE_GLOBAL_EVENT, dev_rename,
-		 DM_QUERY_INACTIVE_TABLE_FLAG | DM_UUID_FLAG),
+		 DM_QUERY_INACTIVE_TABLE_FLAG | DM_UUID_FLAG, 0),
 		I(DM_DEV_SUSPEND_CMD, IOCTL_FLAGS_NO_PARAMS, dev_suspend,
-		 DM_QUERY_INACTIVE_TABLE_FLAG | DM_SUSPEND_FLAG | DM_SKIP_LOCKFS_FLAG | DM_NOFLUSH_FLAG),
-		I(DM_DEV_STATUS_CMD, IOCTL_FLAGS_NO_PARAMS, dev_status, DM_QUERY_INACTIVE_TABLE_FLAG),
+		 DM_QUERY_INACTIVE_TABLE_FLAG | DM_SUSPEND_FLAG | DM_SKIP_LOCKFS_FLAG | DM_NOFLUSH_FLAG, 0),
+		I(DM_DEV_STATUS_CMD, IOCTL_FLAGS_NO_PARAMS, dev_status, DM_QUERY_INACTIVE_TABLE_FLAG, 0),
 		I(DM_DEV_WAIT_CMD, IOCTL_FLAGS_TAKES_EVENT_NR, dev_wait,
-		 DM_QUERY_INACTIVE_TABLE_FLAG | DM_STATUS_TABLE_FLAG | DM_NOFLUSH_FLAG),
-		I(DM_TABLE_LOAD_CMD, 0, table_load, DM_QUERY_INACTIVE_TABLE_FLAG | DM_READONLY_FLAG),
-		I(DM_TABLE_CLEAR_CMD, IOCTL_FLAGS_NO_PARAMS, table_clear, DM_QUERY_INACTIVE_TABLE_FLAG),
-		I(DM_TABLE_DEPS_CMD, 0, table_deps, DM_QUERY_INACTIVE_TABLE_FLAG),
+		 DM_QUERY_INACTIVE_TABLE_FLAG | DM_STATUS_TABLE_FLAG | DM_NOFLUSH_FLAG, 0),
+		I(DM_TABLE_LOAD_CMD, 0, table_load, DM_QUERY_INACTIVE_TABLE_FLAG | DM_READONLY_FLAG, 0),
+		I(DM_TABLE_CLEAR_CMD, IOCTL_FLAGS_NO_PARAMS, table_clear, DM_QUERY_INACTIVE_TABLE_FLAG, 0),
+		I(DM_TABLE_DEPS_CMD, 0, table_deps, DM_QUERY_INACTIVE_TABLE_FLAG, 0),
 		I(DM_TABLE_STATUS_CMD, 0, table_status,
-		 DM_QUERY_INACTIVE_TABLE_FLAG | DM_STATUS_TABLE_FLAG | DM_NOFLUSH_FLAG),
+		 DM_QUERY_INACTIVE_TABLE_FLAG | DM_STATUS_TABLE_FLAG | DM_NOFLUSH_FLAG, 0),
 
-		I(DM_LIST_VERSIONS_CMD, 0, list_versions, 0),
+		I(DM_LIST_VERSIONS_CMD, 0, list_versions, 0, 0),
 
-		I(DM_TARGET_MSG_CMD, 0, target_message, DM_QUERY_INACTIVE_TABLE_FLAG),
-		I(DM_DEV_SET_GEOMETRY_CMD, 0, dev_set_geometry, 0),
-		I(DM_DEV_ARM_POLL_CMD, IOCTL_FLAGS_NO_PARAMS, dev_arm_poll, 0),
-		I(DM_GET_TARGET_VERSION_CMD, 0, get_target_version, 0),
+		I(DM_TARGET_MSG_CMD, 0, target_message, DM_QUERY_INACTIVE_TABLE_FLAG, 0),
+		I(DM_DEV_SET_GEOMETRY_CMD, 0, dev_set_geometry, 0, 0),
+		I(DM_DEV_ARM_POLL_CMD, IOCTL_FLAGS_NO_PARAMS, dev_arm_poll, 0, 0),
+		I(DM_GET_TARGET_VERSION_CMD, 0, get_target_version, 0, 0),
 	};
 
 	if (unlikely(cmd >= ARRAY_SIZE(_ioctls)))
 		return NULL;
 
 	cmd = array_index_nospec(cmd, ARRAY_SIZE(_ioctls));
-	*supported_flags = _ioctls[cmd].supported_flags;
+	*supported_flags = strict ? _ioctls[cmd].strict_flags : _ioctls[cmd].supported_flags;
 	*ioctl_flags = _ioctls[cmd].flags;
 	return _ioctls[cmd].fn;
 }
@@ -2233,7 +2235,7 @@ static int ctl_ioctl(struct file *file, uint command, struct dm_ioctl __user *us
 	if (cmd == DM_VERSION_CMD)
 		return 0;
 
-	fn = lookup_ioctl(cmd, &ioctl_flags, &supported_flags);
+	fn = lookup_ioctl(cmd, !sloppy_checks(&param_kernel), &ioctl_flags, &supported_flags);
 	if (!fn) {
 		DMERR("dm_ctl_ioctl: unknown command 0x%x", command);
 		return -ENOTTY;
@@ -2451,7 +2453,7 @@ int __init dm_early_create(struct dm_ioctl *dmi,
 		m = MINOR(huge_decode_dev(dmi->dev));
 
 	/* alloc dm device */
-	r = dm_create(m, &md);
+	r = dm_create(m, &md, false);
 	if (r)
 		return r;
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3b694ba3a106e68d4c0d5e64cd9136cf7abce237..efdf70a331cb681a88490f45d26259c29ddac850 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2276,13 +2276,14 @@ static struct dm_table *__unbind(struct mapped_device *md)
 /*
  * Constructor for a new device.
  */
-int dm_create(int minor, struct mapped_device **result)
+int dm_create(int minor, struct mapped_device **result, bool disable_uevents)
 {
 	struct mapped_device *md;
 
 	md = alloc_dev(minor);
 	if (!md)
 		return -ENXIO;
+	md->disable_uevents = disable_uevents;
 
 	dm_ima_reset_data(md);
 
@@ -2999,6 +3000,8 @@ int dm_kobject_uevent(struct mapped_device *md, enum kobject_action action,
 	char udev_cookie[DM_COOKIE_LENGTH];
 	char *envp[3] = { NULL, NULL, NULL };
 	char **envpp = envp;
+	if (md->disable_uevents)
+		return 0;
 	if (cookie) {
 		snprintf(udev_cookie, DM_COOKIE_LENGTH, "%s=%u",
 			 DM_COOKIE_ENV_VAR_NAME, cookie);
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index a52d2b9a68460ac7951ad6ebe76d9a1cfccf7afb..7c8d7a7e8798d20e517e2264c06772ecd8b41ef3 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -463,7 +463,7 @@ void dm_consume_args(struct dm_arg_set *as, unsigned int num_args);
  * DM_ANY_MINOR chooses the next available minor number.
  */
 #define DM_ANY_MINOR (-1)
-int dm_create(int minor, struct mapped_device **md);
+int dm_create(int minor, struct mapped_device **md, bool disable_uevents);
 
 /*
  * Reference counting for md.
diff --git a/include/uapi/linux/dm-ioctl.h b/include/uapi/linux/dm-ioctl.h
index 5647b218f24b626f5c1cefe8bec18dc04373c3d0..07cc5bbb6944ebaa42ddfec6fd5e0413c535e7ff 100644
--- a/include/uapi/linux/dm-ioctl.h
+++ b/include/uapi/linux/dm-ioctl.h
@@ -356,8 +356,16 @@ enum {
 #define DM_BUFFER_FULL_FLAG	(1 << 8) /* Out */
 
 /*
- * This flag is now ignored if DM_VERSION_MAJOR is used, and causes
- * -EINVAL if DM_VERSION_MAJOR_STRICT is used.
+ * This flag is only recognized when DM_VERSION_MAJOR_STRICT is used.
+ * It tells the kernel to not generate any uevents for the newly-created
+ * device.  Using it outside of DM_DEV_CREATE results in -EINVAL.  When
+ * DM_VERSION_MAJOR is used this flag is ignored.
+ */
+#define DM_DISABLE_UEVENTS_FLAG	(1 << 9) /* In */
+
+/*
+ * This flag is now ignored if DM_VERSION_MAJOR is used.  When
+ * DM_VERSION_MAJOR_STRICT is used it is an alias for DM_DISABLE_UEVENT_FLAG.
  */
 #define DM_SKIP_BDGET_FLAG	(1 << 9) /* In */
 
@@ -426,8 +434,6 @@ enum {
 
 /*
  * If DM_VERSION_MAJOR is used, these flags are ignored by the kernel.
- * If DM_VERSION_MAJOR_STRICT is used, these flags are reserved and
- * must be zeroed.
  */
 #define DM_STRICT_ONLY_FLAGS ((__u32)(~((1UL << 19) - 1) | 1 << 9 | 1 << 7))
 
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 10/16] device-mapper: Refuse to create device named "control"
  2023-05-30 20:31 [PATCH v2 00/16] Diskseq support in loop, device-mapper, and blkback Demi Marie Obenour
                   ` (8 preceding siblings ...)
  2023-05-30 20:31 ` [PATCH v2 09/16] device-mapper: Allow userspace to suppress uevent generation Demi Marie Obenour
@ 2023-05-30 20:31 ` Demi Marie Obenour
  2023-05-30 20:31 ` [PATCH v2 11/16] device-mapper: "." and ".." are not valid symlink names Demi Marie Obenour
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Demi Marie Obenour @ 2023-05-30 20:31 UTC (permalink / raw)
  To: Jens Axboe, Roger Pau Monné,
	Alasdair Kergon, Mike Snitzer, dm-devel
  Cc: Demi Marie Obenour, Marek Marczykowski-Górecki, linux-block,
	linux-kernel, xen-devel

Typical userspace setups create a symlink under /dev/mapper with the
name of the device, but /dev/mapper/control is reserved for the control
device.  Therefore, trying to create such a device is almost certain to
be a userspace bug.

Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
---
 drivers/md/dm-ioctl.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
index 52aa5505d23b2f3d9c0faf6e8a91b74cd7845581..9ae00e3c1a72c19575814cf473774835b364320b 100644
--- a/drivers/md/dm-ioctl.c
+++ b/drivers/md/dm-ioctl.c
@@ -771,7 +771,12 @@ static int get_target_version(struct file *filp, struct dm_ioctl *param, size_t
 static int check_name(const char *name)
 {
 	if (strchr(name, '/')) {
-		DMERR("invalid device name");
+		DMERR("device name cannot contain '/'");
+		return -EINVAL;
+	}
+
+	if (strcmp(name, DM_CONTROL_NODE) == 0) {
+		DMERR("device name cannot be \"%s\"", DM_CONTROL_NODE);
 		return -EINVAL;
 	}
 
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 11/16] device-mapper: "." and ".." are not valid symlink names
  2023-05-30 20:31 [PATCH v2 00/16] Diskseq support in loop, device-mapper, and blkback Demi Marie Obenour
                   ` (9 preceding siblings ...)
  2023-05-30 20:31 ` [PATCH v2 10/16] device-mapper: Refuse to create device named "control" Demi Marie Obenour
@ 2023-05-30 20:31 ` Demi Marie Obenour
  2023-05-30 20:31 ` [PATCH v2 12/16] device-mapper: inform caller about already-existing device Demi Marie Obenour
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Demi Marie Obenour @ 2023-05-30 20:31 UTC (permalink / raw)
  To: Jens Axboe, Roger Pau Monné,
	Alasdair Kergon, Mike Snitzer, dm-devel
  Cc: Demi Marie Obenour, Marek Marczykowski-Górecki, linux-block,
	linux-kernel, xen-devel

Using either of these is going to greatly confuse userspace, as they are
not valid symlink names and so creating the usual /dev/mapper/NAME
symlink will not be possible.  As creating a device with either of these
names is almost certainly a userspace bug, just error out.

Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
---
 drivers/md/dm-ioctl.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
index 9ae00e3c1a72c19575814cf473774835b364320b..17ece816d490b6c40d019da131ade44c9a201dab 100644
--- a/drivers/md/dm-ioctl.c
+++ b/drivers/md/dm-ioctl.c
@@ -775,8 +775,10 @@ static int check_name(const char *name)
 		return -EINVAL;
 	}
 
-	if (strcmp(name, DM_CONTROL_NODE) == 0) {
-		DMERR("device name cannot be \"%s\"", DM_CONTROL_NODE);
+	if (strcmp(name, DM_CONTROL_NODE) == 0 ||
+	    strcmp(name, ".") == 0 ||
+	    strcmp(name, "..") == 0) {
+		DMERR("device name cannot be \"%s\", \".\", or \"..\"", DM_CONTROL_NODE);
 		return -EINVAL;
 	}
 
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 12/16] device-mapper: inform caller about already-existing device
  2023-05-30 20:31 [PATCH v2 00/16] Diskseq support in loop, device-mapper, and blkback Demi Marie Obenour
                   ` (10 preceding siblings ...)
  2023-05-30 20:31 ` [PATCH v2 11/16] device-mapper: "." and ".." are not valid symlink names Demi Marie Obenour
@ 2023-05-30 20:31 ` Demi Marie Obenour
  2023-05-30 20:31 ` [PATCH v2 13/16] xen-blkback: Implement diskseq checks Demi Marie Obenour
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Demi Marie Obenour @ 2023-05-30 20:31 UTC (permalink / raw)
  To: Jens Axboe, Roger Pau Monné,
	Alasdair Kergon, Mike Snitzer, dm-devel
  Cc: Demi Marie Obenour, Marek Marczykowski-Górecki, linux-block,
	linux-kernel, xen-devel

Not only is this helpful for debugging, it also saves the caller an
ioctl in the case where a device should be used if it exists or created
otherwise.  To ensure existing userspace is not broken, this feature is
only enabled in strict mode.

Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
---
 drivers/md/dm-ioctl.c | 26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
index 17ece816d490b6c40d019da131ade44c9a201dab..44425093d3b908abf80e05e1fc99a26b17e18a42 100644
--- a/drivers/md/dm-ioctl.c
+++ b/drivers/md/dm-ioctl.c
@@ -256,11 +256,13 @@ static void free_cell(struct hash_cell *hc)
 	}
 }
 
+static void __dev_status(struct mapped_device *md, struct dm_ioctl *param);
+
 /*
  * The kdev_t and uuid of a device can never change once it is
  * initially inserted.
  */
-static int dm_hash_insert(const char *name, const char *uuid, struct mapped_device *md)
+static int dm_hash_insert(const char *name, const char *uuid, struct mapped_device *md, struct dm_ioctl *param)
 {
 	struct hash_cell *cell, *hc;
 
@@ -277,6 +279,8 @@ static int dm_hash_insert(const char *name, const char *uuid, struct mapped_devi
 	down_write(&_hash_lock);
 	hc = __get_name_cell(name);
 	if (hc) {
+		if (param)
+			__dev_status(hc->md, param);
 		dm_put(hc->md);
 		goto bad;
 	}
@@ -287,6 +291,8 @@ static int dm_hash_insert(const char *name, const char *uuid, struct mapped_devi
 		hc = __get_uuid_cell(uuid);
 		if (hc) {
 			__unlink_name(cell);
+			if (param)
+				__dev_status(hc->md, param);
 			dm_put(hc->md);
 			goto bad;
 		}
@@ -901,12 +907,14 @@ static int dev_create(struct file *filp, struct dm_ioctl *param, size_t param_si
 		m = MINOR(huge_decode_dev(param->dev));
 
 	r = dm_create(m, &md, param->flags & DM_DISABLE_UEVENTS_FLAG);
-	if (r)
+	if (r) {
+		DMERR("Could not create device-mapper device");
 		return r;
+	}
 
 	param->flags &= ~DM_INACTIVE_PRESENT_FLAG;
 
-	r = dm_hash_insert(param->name, *param->uuid ? param->uuid : NULL, md);
+	r = dm_hash_insert(param->name, *param->uuid ? param->uuid : NULL, md, param);
 	if (r) {
 		dm_put(md);
 		dm_destroy(md);
@@ -2269,7 +2277,6 @@ static int ctl_ioctl(struct file *file, uint command, struct dm_ioctl __user *us
 		goto out;
 	/* This XOR keeps only the flags validate_params has changed. */
 	old_flags ^= param->flags;
-
 	param->data_size = sloppy_checks(param) ? offsetof(struct dm_ioctl, data) : sizeof(struct dm_ioctl);
 	r = fn(file, param, input_param_size);
 
@@ -2284,9 +2291,14 @@ static int ctl_ioctl(struct file *file, uint command, struct dm_ioctl __user *us
 	param->flags |= old_flags;
 
 	/*
-	 * Copy the results back to userland.
+	 * Copy the results back to userland if either:
+	 *
+	 * - The ioctl succeeded.
+	 * - The ioctl is DM_DEV_CREATE, the return value is -EBUSY,
+	 *   and strict parameter checking is enabled.
 	 */
-	if (!r && copy_to_user(user, param, param->data_size))
+	if ((!r || (!sloppy_checks(param) && cmd == DM_DEV_CREATE_CMD && r == -EBUSY)) &&
+	    copy_to_user(user, param, param->data_size))
 		r = -EFAULT;
 
 out:
@@ -2465,7 +2477,7 @@ int __init dm_early_create(struct dm_ioctl *dmi,
 		return r;
 
 	/* hash insert */
-	r = dm_hash_insert(dmi->name, *dmi->uuid ? dmi->uuid : NULL, md);
+	r = dm_hash_insert(dmi->name, *dmi->uuid ? dmi->uuid : NULL, md, NULL);
 	if (r)
 		goto err_destroy_dm;
 
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 13/16] xen-blkback: Implement diskseq checks
  2023-05-30 20:31 [PATCH v2 00/16] Diskseq support in loop, device-mapper, and blkback Demi Marie Obenour
                   ` (11 preceding siblings ...)
  2023-05-30 20:31 ` [PATCH v2 12/16] device-mapper: inform caller about already-existing device Demi Marie Obenour
@ 2023-05-30 20:31 ` Demi Marie Obenour
  2023-06-06  8:25   ` Roger Pau Monné
  2023-05-30 20:31 ` [PATCH v2 14/16] block, loop: Increment diskseq when releasing a loop device Demi Marie Obenour
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 38+ messages in thread
From: Demi Marie Obenour @ 2023-05-30 20:31 UTC (permalink / raw)
  To: Jens Axboe, Roger Pau Monné,
	Alasdair Kergon, Mike Snitzer, dm-devel
  Cc: Demi Marie Obenour, Marek Marczykowski-Górecki, linux-block,
	linux-kernel, xen-devel

This allows specifying a disk sequence number in XenStore.  If it does
not match the disk sequence number of the underlying device, the device
will not be exported and a warning will be logged.  Userspace can use
this to eliminate race conditions due to major/minor number reuse.
Old kernels do not support the new syntax, but a later patch will allow
userspace to discover that the new syntax is supported.

Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
---
 drivers/block/xen-blkback/xenbus.c | 112 +++++++++++++++++++++++------
 1 file changed, 89 insertions(+), 23 deletions(-)

diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
index 4807af1d58059394d7a992335dabaf2bc3901721..9c3eb148fbd802c74e626c3d7bcd69dcb09bd921 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -24,6 +24,7 @@ struct backend_info {
 	struct xenbus_watch	backend_watch;
 	unsigned		major;
 	unsigned		minor;
+	unsigned long long	diskseq;
 	char			*mode;
 };
 
@@ -479,7 +480,7 @@ static void xen_vbd_free(struct xen_vbd *vbd)
 
 static int xen_vbd_create(struct xen_blkif *blkif, blkif_vdev_t handle,
 			  unsigned major, unsigned minor, int readonly,
-			  int cdrom)
+			  bool cdrom, u64 diskseq)
 {
 	struct xen_vbd *vbd;
 	struct block_device *bdev;
@@ -507,6 +508,26 @@ static int xen_vbd_create(struct xen_blkif *blkif, blkif_vdev_t handle,
 		xen_vbd_free(vbd);
 		return -ENOENT;
 	}
+
+	if (diskseq) {
+		struct gendisk *disk = bdev->bd_disk;
+
+		if (unlikely(disk == NULL)) {
+			pr_err("%s: device %08x has no gendisk\n",
+			       __func__, vbd->pdevice);
+			xen_vbd_free(vbd);
+			return -EFAULT;
+		}
+
+		if (unlikely(disk->diskseq != diskseq)) {
+			pr_warn("%s: device %08x has incorrect sequence "
+				"number 0x%llx (expected 0x%llx)\n",
+				__func__, vbd->pdevice, disk->diskseq, diskseq);
+			xen_vbd_free(vbd);
+			return -ENODEV;
+		}
+	}
+
 	vbd->size = vbd_sz(vbd);
 
 	if (cdrom || disk_to_cdi(vbd->bdev->bd_disk))
@@ -707,6 +728,9 @@ static void backend_changed(struct xenbus_watch *watch,
 	int cdrom = 0;
 	unsigned long handle;
 	char *device_type;
+	char *diskseq_str = NULL;
+	int diskseq_len;
+	unsigned long long diskseq;
 
 	pr_debug("%s %p %d\n", __func__, dev, dev->otherend_id);
 
@@ -725,10 +749,46 @@ static void backend_changed(struct xenbus_watch *watch,
 		return;
 	}
 
-	if (be->major | be->minor) {
-		if (be->major != major || be->minor != minor)
-			pr_warn("changing physical device (from %x:%x to %x:%x) not supported.\n",
-				be->major, be->minor, major, minor);
+	diskseq_str = xenbus_read(XBT_NIL, dev->nodename, "diskseq", &diskseq_len);
+	if (IS_ERR(diskseq_str)) {
+		int err = PTR_ERR(diskseq_str);
+		diskseq_str = NULL;
+
+		/*
+		 * If this does not exist, it means legacy userspace that does not
+		 * support diskseq.
+		 */
+		if (unlikely(!XENBUS_EXIST_ERR(err))) {
+			xenbus_dev_fatal(dev, err, "reading diskseq");
+			return;
+		}
+		diskseq = 0;
+	} else if (diskseq_len <= 0) {
+		xenbus_dev_fatal(dev, -EFAULT, "diskseq must not be empty");
+		goto fail;
+	} else if (diskseq_len > 16) {
+		xenbus_dev_fatal(dev, -ERANGE, "diskseq too long: got %d but limit is 16",
+				 diskseq_len);
+		goto fail;
+	} else if (diskseq_str[0] == '0') {
+		xenbus_dev_fatal(dev, -ERANGE, "diskseq must not start with '0'");
+		goto fail;
+	} else {
+		char *diskseq_end;
+		diskseq = simple_strtoull(diskseq_str, &diskseq_end, 16);
+		if (diskseq_end != diskseq_str + diskseq_len) {
+			xenbus_dev_fatal(dev, -EINVAL, "invalid diskseq");
+			goto fail;
+		}
+		kfree(diskseq_str);
+		diskseq_str = NULL;
+	}
+
+	if (be->major | be->minor | be->diskseq) {
+		if (be->major != major || be->minor != minor || be->diskseq != diskseq)
+			pr_warn("changing physical device (from %x:%x:%llx to %x:%x:%llx)"
+				" not supported.\n",
+				be->major, be->minor, be->diskseq, major, minor, diskseq);
 		return;
 	}
 
@@ -756,29 +816,35 @@ static void backend_changed(struct xenbus_watch *watch,
 
 	be->major = major;
 	be->minor = minor;
+	be->diskseq = diskseq;
 
 	err = xen_vbd_create(be->blkif, handle, major, minor,
-			     !strchr(be->mode, 'w'), cdrom);
-
-	if (err)
-		xenbus_dev_fatal(dev, err, "creating vbd structure");
-	else {
-		err = xenvbd_sysfs_addif(dev);
-		if (err) {
-			xen_vbd_free(&be->blkif->vbd);
-			xenbus_dev_fatal(dev, err, "creating sysfs entries");
-		}
-	}
+			     !strchr(be->mode, 'w'), cdrom, diskseq);
 
 	if (err) {
-		kfree(be->mode);
-		be->mode = NULL;
-		be->major = 0;
-		be->minor = 0;
-	} else {
-		/* We're potentially connected now */
-		xen_update_blkif_status(be->blkif);
+		xenbus_dev_fatal(dev, err, "creating vbd structure");
+		goto fail;
 	}
+
+	err = xenvbd_sysfs_addif(dev);
+	if (err) {
+		xenbus_dev_fatal(dev, err, "creating sysfs entries");
+		goto free_vbd;
+	}
+
+	/* We're potentially connected now */
+	xen_update_blkif_status(be->blkif);
+	return;
+
+free_vbd:
+	xen_vbd_free(&be->blkif->vbd);
+fail:
+	kfree(diskseq_str);
+	kfree(be->mode);
+	be->mode = NULL;
+	be->major = 0;
+	be->minor = 0;
+	be->diskseq = 0;
 }
 
 /*
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 14/16] block, loop: Increment diskseq when releasing a loop device
  2023-05-30 20:31 [PATCH v2 00/16] Diskseq support in loop, device-mapper, and blkback Demi Marie Obenour
                   ` (12 preceding siblings ...)
  2023-05-30 20:31 ` [PATCH v2 13/16] xen-blkback: Implement diskseq checks Demi Marie Obenour
@ 2023-05-30 20:31 ` Demi Marie Obenour
  2023-05-30 20:31 ` [PATCH v2 15/16] xen-blkback: Minor cleanups Demi Marie Obenour
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Demi Marie Obenour @ 2023-05-30 20:31 UTC (permalink / raw)
  To: Jens Axboe, Roger Pau Monné,
	Alasdair Kergon, Mike Snitzer, dm-devel
  Cc: Demi Marie Obenour, Marek Marczykowski-Górecki, linux-block,
	linux-kernel, xen-devel

The previous patch for checking diskseq in blkback is not enough to
prevent the following race:

1. Program X opens a loop device
2. Program X gets the diskseq of the loop device.
3. Program X associates a file with the loop device.
4. Program X passes the loop device major, minor, and diskseq to
   something, such as Xen blkback.
5. Program X exits.
6. Program Y detaches the file from the loop device.
7. Program Y attaches a different file to the loop device.
8. Xen blkback finally gets around to opening the loop device and checks
   that the diskseq is what it expects it to be.  Even though the
   diskseq is the expected value, the result is that blkback is
   accessing the wrong file.

To prevent this race condition, increment the diskseq of a loop device
when it is detached from its file descriptor.  This causes blkback (or
any other program, for that matter) to fail at step 8.  Export the
inc_diskseq() function to make this possible.

Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
---
I considered destroying the loop device altogether instead of bumping
its diskseq, but was not able to accomplish that.  Suggestions welcome.
---
 block/genhd.c        | 1 +
 drivers/block/loop.c | 6 ++++++
 2 files changed, 7 insertions(+)

diff --git a/block/genhd.c b/block/genhd.c
index 1cb489b927d50ab06a84a4bfd6913ca8ba7318d4..c0ca2c387732171321555cd57565fbc606768505 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1502,3 +1502,4 @@ void inc_diskseq(struct gendisk *disk)
 {
 	disk->diskseq = atomic64_inc_return(&diskseq);
 }
+EXPORT_SYMBOL(inc_diskseq);
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index bc31bb7072a2cb7294d32066f5d0aa14130349b4..05ea5fb41508b4106f184dd6b4c37942716bdcac 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1205,6 +1205,12 @@ static void __loop_clr_fd(struct loop_device *lo, bool release)
 	if (!part_shift)
 		set_bit(GD_SUPPRESS_PART_SCAN, &lo->lo_disk->state);
 	mutex_lock(&lo->lo_mutex);
+
+	/*
+	 * Increment the disk sequence number, so that userspace knows this
+	 * device now points to something else.
+	 */
+	inc_diskseq(lo->lo_disk);
 	lo->lo_state = Lo_unbound;
 	mutex_unlock(&lo->lo_mutex);
 
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 15/16] xen-blkback: Minor cleanups
  2023-05-30 20:31 [PATCH v2 00/16] Diskseq support in loop, device-mapper, and blkback Demi Marie Obenour
                   ` (13 preceding siblings ...)
  2023-05-30 20:31 ` [PATCH v2 14/16] block, loop: Increment diskseq when releasing a loop device Demi Marie Obenour
@ 2023-05-30 20:31 ` Demi Marie Obenour
  2023-06-06  8:36   ` Roger Pau Monné
  2023-05-30 20:31 ` [PATCH v2 16/16] xen-blkback: Inform userspace that device has been opened Demi Marie Obenour
  2023-05-31 13:06 ` [dm-devel] [PATCH v2 00/16] Diskseq support in loop, device-mapper, and blkback Christoph Hellwig
  16 siblings, 1 reply; 38+ messages in thread
From: Demi Marie Obenour @ 2023-05-30 20:31 UTC (permalink / raw)
  To: Jens Axboe, Roger Pau Monné,
	Alasdair Kergon, Mike Snitzer, dm-devel
  Cc: Demi Marie Obenour, Marek Marczykowski-Górecki, linux-block,
	linux-kernel, xen-devel

This adds a couple of BUILD_BUG_ON()s and moves some arithmetic after
the validation code that checks the arithmetic’s preconditions.  The
previous code was correct but could potentially trip sanitizers that
check for unsigned integer wraparound.

Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
---
 drivers/block/xen-blkback/blkback.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index c362f4ad80ab07bfb58caff0ed7da37dc1484fc5..ac760a08d559085ab875784f1c58cdf2ead95a43 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -1342,6 +1342,8 @@ static int dispatch_rw_block_io(struct xen_blkif_ring *ring,
 	nseg = req->operation == BLKIF_OP_INDIRECT ?
 	       req->u.indirect.nr_segments : req->u.rw.nr_segments;
 
+	BUILD_BUG_ON(offsetof(struct blkif_request, u.rw.id) != 8);
+	BUILD_BUG_ON(offsetof(struct blkif_request, u.indirect.id) != 8);
 	if (unlikely(nseg == 0 && operation_flags != REQ_PREFLUSH) ||
 	    unlikely((req->operation != BLKIF_OP_INDIRECT) &&
 		     (nseg > BLKIF_MAX_SEGMENTS_PER_REQUEST)) ||
@@ -1365,13 +1367,13 @@ static int dispatch_rw_block_io(struct xen_blkif_ring *ring,
 		preq.sector_number     = req->u.rw.sector_number;
 		for (i = 0; i < nseg; i++) {
 			pages[i]->gref = req->u.rw.seg[i].gref;
-			seg[i].nsec = req->u.rw.seg[i].last_sect -
-				req->u.rw.seg[i].first_sect + 1;
-			seg[i].offset = (req->u.rw.seg[i].first_sect << 9);
 			if ((req->u.rw.seg[i].last_sect >= (XEN_PAGE_SIZE >> 9)) ||
 			    (req->u.rw.seg[i].last_sect <
 			     req->u.rw.seg[i].first_sect))
 				goto fail_response;
+			seg[i].nsec = req->u.rw.seg[i].last_sect -
+				req->u.rw.seg[i].first_sect + 1;
+			seg[i].offset = (req->u.rw.seg[i].first_sect << 9);
 			preq.nr_sects += seg[i].nsec;
 		}
 	} else {
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 16/16] xen-blkback: Inform userspace that device has been opened
  2023-05-30 20:31 [PATCH v2 00/16] Diskseq support in loop, device-mapper, and blkback Demi Marie Obenour
                   ` (14 preceding siblings ...)
  2023-05-30 20:31 ` [PATCH v2 15/16] xen-blkback: Minor cleanups Demi Marie Obenour
@ 2023-05-30 20:31 ` Demi Marie Obenour
  2023-06-06  9:15   ` Roger Pau Monné
  2023-06-08 10:08   ` Roger Pau Monné
  2023-05-31 13:06 ` [dm-devel] [PATCH v2 00/16] Diskseq support in loop, device-mapper, and blkback Christoph Hellwig
  16 siblings, 2 replies; 38+ messages in thread
From: Demi Marie Obenour @ 2023-05-30 20:31 UTC (permalink / raw)
  To: Jens Axboe, Roger Pau Monné,
	Alasdair Kergon, Mike Snitzer, dm-devel
  Cc: Demi Marie Obenour, Marek Marczykowski-Górecki, linux-block,
	linux-kernel, xen-devel

Set "opened" to "0" before the hotplug script is called.  Once the
device node has been opened, set "opened" to "1".

"opened" is used exclusively by userspace.  It serves two purposes:

1. It tells userspace that the diskseq Xenstore entry is supported.

2. It tells userspace that it can wait for "opened" to be set to 1.
   Once "opened" is 1, blkback has a reference to the device, so
   userspace doesn't need to keep one.

Together, these changes allow userspace to use block devices with
delete-on-close behavior, such as loop devices with the autoclear flag
set or device-mapper devices with the deferred-remove flag set.

Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
---
 drivers/block/xen-blkback/xenbus.c | 35 ++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
index 9c3eb148fbd802c74e626c3d7bcd69dcb09bd921..519a78aa9073d1faa1dce5c1b36e95ae58da534b 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -3,6 +3,20 @@
     Copyright (C) 2005 Rusty Russell <rusty@rustcorp.com.au>
     Copyright (C) 2005 XenSource Ltd
 
+In addition to the Xenstore nodes required by the Xen block device
+specification, this implementation of blkback uses a new Xenstore
+node: "opened".  blkback sets "opened" to "0" before the hotplug script
+is called.  Once the device node has been opened, blkback sets "opened"
+to "1".
+
+"opened" is read exclusively by userspace.  It serves two purposes:
+
+1. It tells userspace that diskseq@major:minor syntax for "physical-device" is
+   supported.
+
+2. It tells userspace that it can wait for "opened" to be set to 1 after writing
+   "physical-device".  Once "opened" is 1, blkback has a reference to the
+   device, so userspace doesn't need to keep one.
 
 */
 
@@ -699,6 +713,14 @@ static int xen_blkbk_probe(struct xenbus_device *dev,
 	if (err)
 		pr_warn("%s write out 'max-ring-page-order' failed\n", __func__);
 
+	/*
+	 * This informs userspace that the "opened" node will be set to "1" when
+	 * the device has been opened successfully.
+	 */
+	err = xenbus_write(XBT_NIL, dev->nodename, "opened", "0");
+	if (err)
+		goto fail;
+
 	err = xenbus_switch_state(dev, XenbusStateInitWait);
 	if (err)
 		goto fail;
@@ -826,6 +848,19 @@ static void backend_changed(struct xenbus_watch *watch,
 		goto fail;
 	}
 
+	/*
+	 * Tell userspace that the device has been opened and that blkback has a
+	 * reference to it.  Userspace can then close the device or mark it as
+	 * delete-on-close, knowing that blkback will keep the device open as
+	 * long as necessary.
+	 */
+	err = xenbus_write(XBT_NIL, dev->nodename, "opened", "1");
+	if (err) {
+		xenbus_dev_fatal(dev, err, "%s: notifying userspace device has been opened",
+				 dev->nodename);
+		goto free_vbd;
+	}
+
 	err = xenvbd_sysfs_addif(dev);
 	if (err) {
 		xenbus_dev_fatal(dev, err, "creating sysfs entries");
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [dm-devel] [PATCH v2 00/16] Diskseq support in loop, device-mapper, and blkback
  2023-05-30 20:31 [PATCH v2 00/16] Diskseq support in loop, device-mapper, and blkback Demi Marie Obenour
                   ` (15 preceding siblings ...)
  2023-05-30 20:31 ` [PATCH v2 16/16] xen-blkback: Inform userspace that device has been opened Demi Marie Obenour
@ 2023-05-31 13:06 ` Christoph Hellwig
  16 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2023-05-31 13:06 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Jens Axboe, Roger Pau Monné,
	Alasdair Kergon, Mike Snitzer, dm-devel, linux-block, xen-devel,
	Marek Marczykowski-Górecki, linux-kernel

On Tue, May 30, 2023 at 04:31:00PM -0400, Demi Marie Obenour wrote:
> This work aims to allow userspace to create and destroy block devices
> in a race-free way, and to allow them to be exposed to other Xen VMs via
> blkback without races.
> 
> Changes since v1:
> 
> - Several device-mapper fixes added.

Let's get these reviewed by the DM maintainers independently.  This
series is mixing up way too many things.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 13/16] xen-blkback: Implement diskseq checks
  2023-05-30 20:31 ` [PATCH v2 13/16] xen-blkback: Implement diskseq checks Demi Marie Obenour
@ 2023-06-06  8:25   ` Roger Pau Monné
  2023-06-06 17:01     ` Demi Marie Obenour
  0 siblings, 1 reply; 38+ messages in thread
From: Roger Pau Monné @ 2023-06-06  8:25 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel,
	Marek Marczykowski-Górecki, linux-block, linux-kernel,
	xen-devel

On Tue, May 30, 2023 at 04:31:13PM -0400, Demi Marie Obenour wrote:
> This allows specifying a disk sequence number in XenStore.  If it does
> not match the disk sequence number of the underlying device, the device
> will not be exported and a warning will be logged.  Userspace can use
> this to eliminate race conditions due to major/minor number reuse.
> Old kernels do not support the new syntax, but a later patch will allow
> userspace to discover that the new syntax is supported.
> 
> Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
> ---
>  drivers/block/xen-blkback/xenbus.c | 112 +++++++++++++++++++++++------
>  1 file changed, 89 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
> index 4807af1d58059394d7a992335dabaf2bc3901721..9c3eb148fbd802c74e626c3d7bcd69dcb09bd921 100644
> --- a/drivers/block/xen-blkback/xenbus.c
> +++ b/drivers/block/xen-blkback/xenbus.c
> @@ -24,6 +24,7 @@ struct backend_info {
>  	struct xenbus_watch	backend_watch;
>  	unsigned		major;
>  	unsigned		minor;
> +	unsigned long long	diskseq;

Since diskseq is declared as u64 in gendisk, better use the same type
here too?

>  	char			*mode;
>  };
>  
> @@ -479,7 +480,7 @@ static void xen_vbd_free(struct xen_vbd *vbd)
>  
>  static int xen_vbd_create(struct xen_blkif *blkif, blkif_vdev_t handle,
>  			  unsigned major, unsigned minor, int readonly,
> -			  int cdrom)
> +			  bool cdrom, u64 diskseq)
>  {
>  	struct xen_vbd *vbd;
>  	struct block_device *bdev;
> @@ -507,6 +508,26 @@ static int xen_vbd_create(struct xen_blkif *blkif, blkif_vdev_t handle,
>  		xen_vbd_free(vbd);
>  		return -ENOENT;
>  	}
> +
> +	if (diskseq) {
> +		struct gendisk *disk = bdev->bd_disk;

const.

> +
> +		if (unlikely(disk == NULL)) {
> +			pr_err("%s: device %08x has no gendisk\n",
> +			       __func__, vbd->pdevice);
> +			xen_vbd_free(vbd);
> +			return -EFAULT;

ENODEV or ENOENT might be more accurate IMO.

> +		}
> +
> +		if (unlikely(disk->diskseq != diskseq)) {
> +			pr_warn("%s: device %08x has incorrect sequence "
> +				"number 0x%llx (expected 0x%llx)\n",

I prefer %#llx, and likely pr_err like above.  Also I think it's now
preferred to not split printed lines, so that `grep "has incorrect
sequence number" ...` can find the instance.

> +				__func__, vbd->pdevice, disk->diskseq, diskseq);
> +			xen_vbd_free(vbd);
> +			return -ENODEV;
> +		}
> +	}
> +
>  	vbd->size = vbd_sz(vbd);
>  
>  	if (cdrom || disk_to_cdi(vbd->bdev->bd_disk))
> @@ -707,6 +728,9 @@ static void backend_changed(struct xenbus_watch *watch,
>  	int cdrom = 0;
>  	unsigned long handle;
>  	char *device_type;
> +	char *diskseq_str = NULL;

const, and I think there's no need to init to NULL.

> +	int diskseq_len;

unsigned int

> +	unsigned long long diskseq;

u64

>  
>  	pr_debug("%s %p %d\n", __func__, dev, dev->otherend_id);
>  
> @@ -725,10 +749,46 @@ static void backend_changed(struct xenbus_watch *watch,
>  		return;
>  	}
>  
> -	if (be->major | be->minor) {
> -		if (be->major != major || be->minor != minor)
> -			pr_warn("changing physical device (from %x:%x to %x:%x) not supported.\n",
> -				be->major, be->minor, major, minor);
> +	diskseq_str = xenbus_read(XBT_NIL, dev->nodename, "diskseq", &diskseq_len);
> +	if (IS_ERR(diskseq_str)) {
> +		int err = PTR_ERR(diskseq_str);
> +		diskseq_str = NULL;
> +
> +		/*
> +		 * If this does not exist, it means legacy userspace that does not
> +		 * support diskseq.
> +		 */
> +		if (unlikely(!XENBUS_EXIST_ERR(err))) {
> +			xenbus_dev_fatal(dev, err, "reading diskseq");
> +			return;
> +		}
> +		diskseq = 0;
> +	} else if (diskseq_len <= 0) {
> +		xenbus_dev_fatal(dev, -EFAULT, "diskseq must not be empty");
> +		goto fail;
> +	} else if (diskseq_len > 16) {
> +		xenbus_dev_fatal(dev, -ERANGE, "diskseq too long: got %d but limit is 16",
> +				 diskseq_len);
> +		goto fail;
> +	} else if (diskseq_str[0] == '0') {
> +		xenbus_dev_fatal(dev, -ERANGE, "diskseq must not start with '0'");
> +		goto fail;
> +	} else {
> +		char *diskseq_end;
> +		diskseq = simple_strtoull(diskseq_str, &diskseq_end, 16);
> +		if (diskseq_end != diskseq_str + diskseq_len) {
> +			xenbus_dev_fatal(dev, -EINVAL, "invalid diskseq");
> +			goto fail;
> +		}
> +		kfree(diskseq_str);
> +		diskseq_str = NULL;
> +	}

Won't it be simpler to use xenbus_scanf() with %llx formatter?

Also, we might want to fetch "physical-device" and "diskseq" inside
the same xenstore transaction.

Also, you tie this logic to the "physical-device" watch, which
strictly implies that the "diskseq" node must be written to xenstore
before the "physical-device" node.  This seems fragile, but I don't
see much better optiono since the "diskseq" is optional.

The node and its behaviour should be documented in blkif.h.

> +	if (be->major | be->minor | be->diskseq) {
> +		if (be->major != major || be->minor != minor || be->diskseq != diskseq)
> +			pr_warn("changing physical device (from %x:%x:%llx to %x:%x:%llx)"
> +				" not supported.\n",
> +				be->major, be->minor, be->diskseq, major, minor, diskseq);
>  		return;

You are leaking diskseq_str here, and in all the error cases between
here and up to the call to xen_vbd_create().

It might be better to simnply free diskseq_str once you are done with
the processing, and have set diskseq.

Otherwise see my suggestion of using xenbus_scanf().

Thanks, Roger.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 15/16] xen-blkback: Minor cleanups
  2023-05-30 20:31 ` [PATCH v2 15/16] xen-blkback: Minor cleanups Demi Marie Obenour
@ 2023-06-06  8:36   ` Roger Pau Monné
  0 siblings, 0 replies; 38+ messages in thread
From: Roger Pau Monné @ 2023-06-06  8:36 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel,
	Marek Marczykowski-Górecki, linux-block, linux-kernel,
	xen-devel

On Tue, May 30, 2023 at 04:31:15PM -0400, Demi Marie Obenour wrote:
> This adds a couple of BUILD_BUG_ON()s and moves some arithmetic after
> the validation code that checks the arithmetic’s preconditions.  The
> previous code was correct but could potentially trip sanitizers that
> check for unsigned integer wraparound.
> 
> Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
> ---
>  drivers/block/xen-blkback/blkback.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
> index c362f4ad80ab07bfb58caff0ed7da37dc1484fc5..ac760a08d559085ab875784f1c58cdf2ead95a43 100644
> --- a/drivers/block/xen-blkback/blkback.c
> +++ b/drivers/block/xen-blkback/blkback.c
> @@ -1342,6 +1342,8 @@ static int dispatch_rw_block_io(struct xen_blkif_ring *ring,
>  	nseg = req->operation == BLKIF_OP_INDIRECT ?
>  	       req->u.indirect.nr_segments : req->u.rw.nr_segments;
>  
> +	BUILD_BUG_ON(offsetof(struct blkif_request, u.rw.id) != 8);
> +	BUILD_BUG_ON(offsetof(struct blkif_request, u.indirect.id) != 8);

Won't it be clearer as:

offsetof(struct blkif_request, u.rw.id) != offsetof(struct blkif_request, u.indirect.id)

We don't really care about the specific offset value, but both layouts
must match.

Also, you likely want to check for all requests types, not just rw and
indirect (discard and other).

>  	if (unlikely(nseg == 0 && operation_flags != REQ_PREFLUSH) ||
>  	    unlikely((req->operation != BLKIF_OP_INDIRECT) &&
>  		     (nseg > BLKIF_MAX_SEGMENTS_PER_REQUEST)) ||
> @@ -1365,13 +1367,13 @@ static int dispatch_rw_block_io(struct xen_blkif_ring *ring,
>  		preq.sector_number     = req->u.rw.sector_number;
>  		for (i = 0; i < nseg; i++) {
>  			pages[i]->gref = req->u.rw.seg[i].gref;
> -			seg[i].nsec = req->u.rw.seg[i].last_sect -
> -				req->u.rw.seg[i].first_sect + 1;
> -			seg[i].offset = (req->u.rw.seg[i].first_sect << 9);
>  			if ((req->u.rw.seg[i].last_sect >= (XEN_PAGE_SIZE >> 9)) ||
>  			    (req->u.rw.seg[i].last_sect <
>  			     req->u.rw.seg[i].first_sect))
>  				goto fail_response;
> +			seg[i].nsec = req->u.rw.seg[i].last_sect -
> +				req->u.rw.seg[i].first_sect + 1;
> +			seg[i].offset = (req->u.rw.seg[i].first_sect << 9);

Parentheses here are unneeded.  If we do that move, we might as well
move the assignation of pages[i]->gref as well, just to avoid
assigning to gref to take the failure path.

I do think however wraparound is not an issue here, as we will discard
the result.

Thanks, Roger.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 16/16] xen-blkback: Inform userspace that device has been opened
  2023-05-30 20:31 ` [PATCH v2 16/16] xen-blkback: Inform userspace that device has been opened Demi Marie Obenour
@ 2023-06-06  9:15   ` Roger Pau Monné
  2023-06-06 17:31     ` Demi Marie Obenour
  2023-06-08 10:08   ` Roger Pau Monné
  1 sibling, 1 reply; 38+ messages in thread
From: Roger Pau Monné @ 2023-06-06  9:15 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel,
	Marek Marczykowski-Górecki, linux-block, linux-kernel,
	xen-devel

On Tue, May 30, 2023 at 04:31:16PM -0400, Demi Marie Obenour wrote:
> Set "opened" to "0" before the hotplug script is called.  Once the
> device node has been opened, set "opened" to "1".
> 
> "opened" is used exclusively by userspace.  It serves two purposes:
> 
> 1. It tells userspace that the diskseq Xenstore entry is supported.
> 
> 2. It tells userspace that it can wait for "opened" to be set to 1.
>    Once "opened" is 1, blkback has a reference to the device, so
>    userspace doesn't need to keep one.
> 
> Together, these changes allow userspace to use block devices with
> delete-on-close behavior, such as loop devices with the autoclear flag
> set or device-mapper devices with the deferred-remove flag set.

There was some work in the past to allow reloading blkback as a
module, it's clear that using delete-on-close won't work if attempting
to reload blkback.

Isn't there some existing way to check whether a device is opened?
(stat syscall maybe?).

I would like to avoid adding more xenstore blkback state if such
information can be fetched from other methods.

> Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
> ---
>  drivers/block/xen-blkback/xenbus.c | 35 ++++++++++++++++++++++++++++++
>  1 file changed, 35 insertions(+)
> 
> diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
> index 9c3eb148fbd802c74e626c3d7bcd69dcb09bd921..519a78aa9073d1faa1dce5c1b36e95ae58da534b 100644
> --- a/drivers/block/xen-blkback/xenbus.c
> +++ b/drivers/block/xen-blkback/xenbus.c
> @@ -3,6 +3,20 @@
>      Copyright (C) 2005 Rusty Russell <rusty@rustcorp.com.au>
>      Copyright (C) 2005 XenSource Ltd
>  
> +In addition to the Xenstore nodes required by the Xen block device
> +specification, this implementation of blkback uses a new Xenstore
> +node: "opened".  blkback sets "opened" to "0" before the hotplug script
> +is called.  Once the device node has been opened, blkback sets "opened"
> +to "1".
> +
> +"opened" is read exclusively by userspace.  It serves two purposes:
> +
> +1. It tells userspace that diskseq@major:minor syntax for "physical-device" is
> +   supported.
> +
> +2. It tells userspace that it can wait for "opened" to be set to 1 after writing
> +   "physical-device".  Once "opened" is 1, blkback has a reference to the
> +   device, so userspace doesn't need to keep one.
>  
>  */
>  
> @@ -699,6 +713,14 @@ static int xen_blkbk_probe(struct xenbus_device *dev,
>  	if (err)
>  		pr_warn("%s write out 'max-ring-page-order' failed\n", __func__);
>  
> +	/*
> +	 * This informs userspace that the "opened" node will be set to "1" when
> +	 * the device has been opened successfully.
> +	 */
> +	err = xenbus_write(XBT_NIL, dev->nodename, "opened", "0");
> +	if (err)
> +		goto fail;
> +

You would need to set "opened" before registering the xenstore backend
watch AFAICT, or else it could be racy.

Thanks, Roger.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 13/16] xen-blkback: Implement diskseq checks
  2023-06-06  8:25   ` Roger Pau Monné
@ 2023-06-06 17:01     ` Demi Marie Obenour
  2023-06-07  8:20       ` Roger Pau Monné
  0 siblings, 1 reply; 38+ messages in thread
From: Demi Marie Obenour @ 2023-06-06 17:01 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel,
	Marek Marczykowski-Górecki, linux-block, linux-kernel,
	xen-devel

[-- Attachment #1: Type: text/plain, Size: 6483 bytes --]

On Tue, Jun 06, 2023 at 10:25:47AM +0200, Roger Pau Monné wrote:
> On Tue, May 30, 2023 at 04:31:13PM -0400, Demi Marie Obenour wrote:
> > This allows specifying a disk sequence number in XenStore.  If it does
> > not match the disk sequence number of the underlying device, the device
> > will not be exported and a warning will be logged.  Userspace can use
> > this to eliminate race conditions due to major/minor number reuse.
> > Old kernels do not support the new syntax, but a later patch will allow
> > userspace to discover that the new syntax is supported.
> > 
> > Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
> > ---
> >  drivers/block/xen-blkback/xenbus.c | 112 +++++++++++++++++++++++------
> >  1 file changed, 89 insertions(+), 23 deletions(-)
> > 
> > diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
> > index 4807af1d58059394d7a992335dabaf2bc3901721..9c3eb148fbd802c74e626c3d7bcd69dcb09bd921 100644
> > --- a/drivers/block/xen-blkback/xenbus.c
> > +++ b/drivers/block/xen-blkback/xenbus.c
> > @@ -24,6 +24,7 @@ struct backend_info {
> >  	struct xenbus_watch	backend_watch;
> >  	unsigned		major;
> >  	unsigned		minor;
> > +	unsigned long long	diskseq;
> 
> Since diskseq is declared as u64 in gendisk, better use the same type
> here too?

simple_strtoull() returns an unsigned long long, and C permits unsigned
long long to be larger than 64 bits.

> >  	char			*mode;
> >  };
> >  
> > @@ -479,7 +480,7 @@ static void xen_vbd_free(struct xen_vbd *vbd)
> >  
> >  static int xen_vbd_create(struct xen_blkif *blkif, blkif_vdev_t handle,
> >  			  unsigned major, unsigned minor, int readonly,
> > -			  int cdrom)
> > +			  bool cdrom, u64 diskseq)
> >  {
> >  	struct xen_vbd *vbd;
> >  	struct block_device *bdev;
> > @@ -507,6 +508,26 @@ static int xen_vbd_create(struct xen_blkif *blkif, blkif_vdev_t handle,
> >  		xen_vbd_free(vbd);
> >  		return -ENOENT;
> >  	}
> > +
> > +	if (diskseq) {
> > +		struct gendisk *disk = bdev->bd_disk;
> 
> const.
> 
> > +
> > +		if (unlikely(disk == NULL)) {
> > +			pr_err("%s: device %08x has no gendisk\n",
> > +			       __func__, vbd->pdevice);
> > +			xen_vbd_free(vbd);
> > +			return -EFAULT;
> 
> ENODEV or ENOENT might be more accurate IMO.

I will drop it, as this turns out to be unreachable code.

> > +		}
> > +
> > +		if (unlikely(disk->diskseq != diskseq)) {
> > +			pr_warn("%s: device %08x has incorrect sequence "
> > +				"number 0x%llx (expected 0x%llx)\n",
> 
> I prefer %#llx, and likely pr_err like above.  Also I think it's now
> preferred to not split printed lines, so that `grep "has incorrect
> sequence number" ...` can find the instance.

Ah, so _that_ is why I got a warning from checkpatch!

> > +				__func__, vbd->pdevice, disk->diskseq, diskseq);
> > +			xen_vbd_free(vbd);
> > +			return -ENODEV;
> > +		}
> > +	}
> > +
> >  	vbd->size = vbd_sz(vbd);
> >  
> >  	if (cdrom || disk_to_cdi(vbd->bdev->bd_disk))
> > @@ -707,6 +728,9 @@ static void backend_changed(struct xenbus_watch *watch,
> >  	int cdrom = 0;
> >  	unsigned long handle;
> >  	char *device_type;
> > +	char *diskseq_str = NULL;
> 
> const, and I think there's no need to init to NULL.
> 
> > +	int diskseq_len;
> 
> unsigned int
> 
> > +	unsigned long long diskseq;
> 
> u64
> 
> >  
> >  	pr_debug("%s %p %d\n", __func__, dev, dev->otherend_id);
> >  
> > @@ -725,10 +749,46 @@ static void backend_changed(struct xenbus_watch *watch,
> >  		return;
> >  	}
> >  
> > -	if (be->major | be->minor) {
> > -		if (be->major != major || be->minor != minor)
> > -			pr_warn("changing physical device (from %x:%x to %x:%x) not supported.\n",
> > -				be->major, be->minor, major, minor);
> > +	diskseq_str = xenbus_read(XBT_NIL, dev->nodename, "diskseq", &diskseq_len);
> > +	if (IS_ERR(diskseq_str)) {
> > +		int err = PTR_ERR(diskseq_str);
> > +		diskseq_str = NULL;
> > +
> > +		/*
> > +		 * If this does not exist, it means legacy userspace that does not
> > +		 * support diskseq.
> > +		 */
> > +		if (unlikely(!XENBUS_EXIST_ERR(err))) {
> > +			xenbus_dev_fatal(dev, err, "reading diskseq");
> > +			return;
> > +		}
> > +		diskseq = 0;
> > +	} else if (diskseq_len <= 0) {
> > +		xenbus_dev_fatal(dev, -EFAULT, "diskseq must not be empty");
> > +		goto fail;
> > +	} else if (diskseq_len > 16) {
> > +		xenbus_dev_fatal(dev, -ERANGE, "diskseq too long: got %d but limit is 16",
> > +				 diskseq_len);
> > +		goto fail;
> > +	} else if (diskseq_str[0] == '0') {
> > +		xenbus_dev_fatal(dev, -ERANGE, "diskseq must not start with '0'");
> > +		goto fail;
> > +	} else {
> > +		char *diskseq_end;
> > +		diskseq = simple_strtoull(diskseq_str, &diskseq_end, 16);
> > +		if (diskseq_end != diskseq_str + diskseq_len) {
> > +			xenbus_dev_fatal(dev, -EINVAL, "invalid diskseq");
> > +			goto fail;
> > +		}
> > +		kfree(diskseq_str);
> > +		diskseq_str = NULL;
> > +	}
> 
> Won't it be simpler to use xenbus_scanf() with %llx formatter?

xenbus_scanf() doesn’t check for overflow and accepts lots of junk it
really should not.  Should this be fixed in xenbus_scanf()?

> Also, we might want to fetch "physical-device" and "diskseq" inside
> the same xenstore transaction.

Should the rest of the xenstore reads be included in the same
transaction?

> Also, you tie this logic to the "physical-device" watch, which
> strictly implies that the "diskseq" node must be written to xenstore
> before the "physical-device" node.  This seems fragile, but I don't
> see much better optiono since the "diskseq" is optional.

What about including the diskseq in the "physical-device" node?  Perhaps
use diskseq@major:minor syntax?

> The node and its behaviour should be documented in blkif.h.

Indeed so.

> > +	if (be->major | be->minor | be->diskseq) {
> > +		if (be->major != major || be->minor != minor || be->diskseq != diskseq)
> > +			pr_warn("changing physical device (from %x:%x:%llx to %x:%x:%llx)"
> > +				" not supported.\n",
> > +				be->major, be->minor, be->diskseq, major, minor, diskseq);
> >  		return;
> 
> You are leaking diskseq_str here, and in all the error cases between
> here and up to the call to xen_vbd_create().

I will fix this by moving the diskseq reading code into its own
function.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 16/16] xen-blkback: Inform userspace that device has been opened
  2023-06-06  9:15   ` Roger Pau Monné
@ 2023-06-06 17:31     ` Demi Marie Obenour
  2023-06-07  8:44       ` Roger Pau Monné
  0 siblings, 1 reply; 38+ messages in thread
From: Demi Marie Obenour @ 2023-06-06 17:31 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel,
	Marek Marczykowski-Górecki, linux-block, linux-kernel,
	xen-devel

[-- Attachment #1: Type: text/plain, Size: 2747 bytes --]

On Tue, Jun 06, 2023 at 11:15:37AM +0200, Roger Pau Monné wrote:
> On Tue, May 30, 2023 at 04:31:16PM -0400, Demi Marie Obenour wrote:
> > Set "opened" to "0" before the hotplug script is called.  Once the
> > device node has been opened, set "opened" to "1".
> > 
> > "opened" is used exclusively by userspace.  It serves two purposes:
> > 
> > 1. It tells userspace that the diskseq Xenstore entry is supported.
> > 
> > 2. It tells userspace that it can wait for "opened" to be set to 1.
> >    Once "opened" is 1, blkback has a reference to the device, so
> >    userspace doesn't need to keep one.
> > 
> > Together, these changes allow userspace to use block devices with
> > delete-on-close behavior, such as loop devices with the autoclear flag
> > set or device-mapper devices with the deferred-remove flag set.
> 
> There was some work in the past to allow reloading blkback as a
> module, it's clear that using delete-on-close won't work if attempting
> to reload blkback.

Should blkback stop itself from being unloaded if delete-on-close is in
use?

> Isn't there some existing way to check whether a device is opened?
> (stat syscall maybe?).

Knowing that the device has been opened isn’t enough.  The block script
needs to be able to wait for blkback (and not something else) to open
the device.  Otherwise it will be confused if the device is opened by
e.g. udev.

> I would like to avoid adding more xenstore blkback state if such
> information can be fetched from other methods.

I don’t think it can be, unless the information is passed via a
completely different method.  Maybe netlink(7) or ioctl(2)?  Arguably
this information should not be stored in Xenstore at all, as it exposes
backend implementation details to the frontend.

> > diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
> > index 9c3eb148fbd802c74e626c3d7bcd69dcb09bd921..519a78aa9073d1faa1dce5c1b36e95ae58da534b 100644
> > --- a/drivers/block/xen-blkback/xenbus.c
> > +++ b/drivers/block/xen-blkback/xenbus.c
> > @@ -699,6 +713,14 @@ static int xen_blkbk_probe(struct xenbus_device *dev,
> >  	if (err)
> >  		pr_warn("%s write out 'max-ring-page-order' failed\n", __func__);
> >  
> > +	/*
> > +	 * This informs userspace that the "opened" node will be set to "1" when
> > +	 * the device has been opened successfully.
> > +	 */
> > +	err = xenbus_write(XBT_NIL, dev->nodename, "opened", "0");
> > +	if (err)
> > +		goto fail;
> > +
> 
> You would need to set "opened" before registering the xenstore backend
> watch AFAICT, or else it could be racy.

Will fix in the next version.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 13/16] xen-blkback: Implement diskseq checks
  2023-06-06 17:01     ` Demi Marie Obenour
@ 2023-06-07  8:20       ` Roger Pau Monné
  2023-06-07 16:14         ` Demi Marie Obenour
  0 siblings, 1 reply; 38+ messages in thread
From: Roger Pau Monné @ 2023-06-07  8:20 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel,
	Marek Marczykowski-Górecki, linux-block, linux-kernel,
	xen-devel

On Tue, Jun 06, 2023 at 01:01:20PM -0400, Demi Marie Obenour wrote:
> On Tue, Jun 06, 2023 at 10:25:47AM +0200, Roger Pau Monné wrote:
> > On Tue, May 30, 2023 at 04:31:13PM -0400, Demi Marie Obenour wrote:
> > > This allows specifying a disk sequence number in XenStore.  If it does
> > > not match the disk sequence number of the underlying device, the device
> > > will not be exported and a warning will be logged.  Userspace can use
> > > this to eliminate race conditions due to major/minor number reuse.
> > > Old kernels do not support the new syntax, but a later patch will allow
> > > userspace to discover that the new syntax is supported.
> > > 
> > > Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
> > > ---
> > >  drivers/block/xen-blkback/xenbus.c | 112 +++++++++++++++++++++++------
> > >  1 file changed, 89 insertions(+), 23 deletions(-)
> > > 
> > > diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
> > > index 4807af1d58059394d7a992335dabaf2bc3901721..9c3eb148fbd802c74e626c3d7bcd69dcb09bd921 100644
> > > --- a/drivers/block/xen-blkback/xenbus.c
> > > +++ b/drivers/block/xen-blkback/xenbus.c
> > > @@ -24,6 +24,7 @@ struct backend_info {
> > >  	struct xenbus_watch	backend_watch;
> > >  	unsigned		major;
> > >  	unsigned		minor;
> > > +	unsigned long long	diskseq;
> > 
> > Since diskseq is declared as u64 in gendisk, better use the same type
> > here too?
> 
> simple_strtoull() returns an unsigned long long, and C permits unsigned
> long long to be larger than 64 bits.

Right, but the type of gendisk is u64.  It's fine if you want to store
the result of simple_strtoull() into an unsigned long long and do
whatever checks to assert it matches the format expected by gendisk,
but ultimately the field type would better use u64 for consistency IMO.

> > > @@ -725,10 +749,46 @@ static void backend_changed(struct xenbus_watch *watch,
> > >  		return;
> > >  	}
> > >  
> > > -	if (be->major | be->minor) {
> > > -		if (be->major != major || be->minor != minor)
> > > -			pr_warn("changing physical device (from %x:%x to %x:%x) not supported.\n",
> > > -				be->major, be->minor, major, minor);
> > > +	diskseq_str = xenbus_read(XBT_NIL, dev->nodename, "diskseq", &diskseq_len);
> > > +	if (IS_ERR(diskseq_str)) {
> > > +		int err = PTR_ERR(diskseq_str);
> > > +		diskseq_str = NULL;
> > > +
> > > +		/*
> > > +		 * If this does not exist, it means legacy userspace that does not
> > > +		 * support diskseq.
> > > +		 */
> > > +		if (unlikely(!XENBUS_EXIST_ERR(err))) {
> > > +			xenbus_dev_fatal(dev, err, "reading diskseq");
> > > +			return;
> > > +		}
> > > +		diskseq = 0;
> > > +	} else if (diskseq_len <= 0) {
> > > +		xenbus_dev_fatal(dev, -EFAULT, "diskseq must not be empty");
> > > +		goto fail;
> > > +	} else if (diskseq_len > 16) {
> > > +		xenbus_dev_fatal(dev, -ERANGE, "diskseq too long: got %d but limit is 16",
> > > +				 diskseq_len);
> > > +		goto fail;
> > > +	} else if (diskseq_str[0] == '0') {
> > > +		xenbus_dev_fatal(dev, -ERANGE, "diskseq must not start with '0'");
> > > +		goto fail;
> > > +	} else {
> > > +		char *diskseq_end;
> > > +		diskseq = simple_strtoull(diskseq_str, &diskseq_end, 16);
> > > +		if (diskseq_end != diskseq_str + diskseq_len) {
> > > +			xenbus_dev_fatal(dev, -EINVAL, "invalid diskseq");
> > > +			goto fail;
> > > +		}
> > > +		kfree(diskseq_str);
> > > +		diskseq_str = NULL;
> > > +	}
> > 
> > Won't it be simpler to use xenbus_scanf() with %llx formatter?
> 
> xenbus_scanf() doesn’t check for overflow and accepts lots of junk it
> really should not.  Should this be fixed in xenbus_scanf()?

That would be my preference, so that you can use it here instead of
kind of open-coding it.

> > Also, we might want to fetch "physical-device" and "diskseq" inside
> > the same xenstore transaction.
> 
> Should the rest of the xenstore reads be included in the same
> transaction?

I guess it would make the code simpler to indeed fetch everything
inside the same transaction.

> > Also, you tie this logic to the "physical-device" watch, which
> > strictly implies that the "diskseq" node must be written to xenstore
> > before the "physical-device" node.  This seems fragile, but I don't
> > see much better optiono since the "diskseq" is optional.
> 
> What about including the diskseq in the "physical-device" node?  Perhaps
> use diskseq@major:minor syntax?

Hm, how would you know whether the blkback instance in the kernel
supports the diskseq syntax in physical-device?

Can you fetch a disk using a diskseq identifier?

Why I understand that this is an extra safety check in order to assert
blkback is opening the intended device, is this attempting to fix some
existing issue?

I'm not sure I see how the major:minor numbers would point to a
different device than the one specified by the toolstack unless the
admin explicitly messes with the devices before blkback has got time
to open them.  But then the admin can already do pretty much
everything it wants with the system.

Thanks, Roger.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 16/16] xen-blkback: Inform userspace that device has been opened
  2023-06-06 17:31     ` Demi Marie Obenour
@ 2023-06-07  8:44       ` Roger Pau Monné
  2023-06-07 16:29         ` Demi Marie Obenour
  0 siblings, 1 reply; 38+ messages in thread
From: Roger Pau Monné @ 2023-06-07  8:44 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel,
	Marek Marczykowski-Górecki, linux-block, linux-kernel,
	xen-devel

On Tue, Jun 06, 2023 at 01:31:25PM -0400, Demi Marie Obenour wrote:
> On Tue, Jun 06, 2023 at 11:15:37AM +0200, Roger Pau Monné wrote:
> > On Tue, May 30, 2023 at 04:31:16PM -0400, Demi Marie Obenour wrote:
> > > Set "opened" to "0" before the hotplug script is called.  Once the
> > > device node has been opened, set "opened" to "1".
> > > 
> > > "opened" is used exclusively by userspace.  It serves two purposes:
> > > 
> > > 1. It tells userspace that the diskseq Xenstore entry is supported.
> > > 
> > > 2. It tells userspace that it can wait for "opened" to be set to 1.
> > >    Once "opened" is 1, blkback has a reference to the device, so
> > >    userspace doesn't need to keep one.
> > > 
> > > Together, these changes allow userspace to use block devices with
> > > delete-on-close behavior, such as loop devices with the autoclear flag
> > > set or device-mapper devices with the deferred-remove flag set.
> > 
> > There was some work in the past to allow reloading blkback as a
> > module, it's clear that using delete-on-close won't work if attempting
> > to reload blkback.
> 
> Should blkback stop itself from being unloaded if delete-on-close is in
> use?

Hm, maybe.  I guess that's the best we can do right now.

> > Isn't there some existing way to check whether a device is opened?
> > (stat syscall maybe?).
> 
> Knowing that the device has been opened isn’t enough.  The block script
> needs to be able to wait for blkback (and not something else) to open
> the device.  Otherwise it will be confused if the device is opened by
> e.g. udev.

Urg, no, the block script cannot wait indefinitely for blkback to open
the device, as it has an execution timeout.  blkback is free to only
open the device upon guest frontend connection, and that (when using
libxl) requires the hotplug scripts execution to be finished so the
guest can be started.

> > I would like to avoid adding more xenstore blkback state if such
> > information can be fetched from other methods.
> 
> I don’t think it can be, unless the information is passed via a
> completely different method.  Maybe netlink(7) or ioctl(2)?  Arguably
> this information should not be stored in Xenstore at all, as it exposes
> backend implementation details to the frontend.

Could you maybe use sysfs for this information?

We have all sorts of crap in xenstore, but it would be best if we can
see of placing stuff like this in another interface.

Thanks, Roger.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 13/16] xen-blkback: Implement diskseq checks
  2023-06-07  8:20       ` Roger Pau Monné
@ 2023-06-07 16:14         ` Demi Marie Obenour
  2023-06-08  8:29           ` Roger Pau Monné
  0 siblings, 1 reply; 38+ messages in thread
From: Demi Marie Obenour @ 2023-06-07 16:14 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel,
	Marek Marczykowski-Górecki, linux-block, linux-kernel,
	xen-devel

[-- Attachment #1: Type: text/plain, Size: 6522 bytes --]

On Wed, Jun 07, 2023 at 10:20:08AM +0200, Roger Pau Monné wrote:
> On Tue, Jun 06, 2023 at 01:01:20PM -0400, Demi Marie Obenour wrote:
> > On Tue, Jun 06, 2023 at 10:25:47AM +0200, Roger Pau Monné wrote:
> > > On Tue, May 30, 2023 at 04:31:13PM -0400, Demi Marie Obenour wrote:
> > > > This allows specifying a disk sequence number in XenStore.  If it does
> > > > not match the disk sequence number of the underlying device, the device
> > > > will not be exported and a warning will be logged.  Userspace can use
> > > > this to eliminate race conditions due to major/minor number reuse.
> > > > Old kernels do not support the new syntax, but a later patch will allow
> > > > userspace to discover that the new syntax is supported.
> > > > 
> > > > Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
> > > > ---
> > > >  drivers/block/xen-blkback/xenbus.c | 112 +++++++++++++++++++++++------
> > > >  1 file changed, 89 insertions(+), 23 deletions(-)
> > > > 
> > > > diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
> > > > index 4807af1d58059394d7a992335dabaf2bc3901721..9c3eb148fbd802c74e626c3d7bcd69dcb09bd921 100644
> > > > --- a/drivers/block/xen-blkback/xenbus.c
> > > > +++ b/drivers/block/xen-blkback/xenbus.c
> > > > @@ -24,6 +24,7 @@ struct backend_info {
> > > >  	struct xenbus_watch	backend_watch;
> > > >  	unsigned		major;
> > > >  	unsigned		minor;
> > > > +	unsigned long long	diskseq;
> > > 
> > > Since diskseq is declared as u64 in gendisk, better use the same type
> > > here too?
> > 
> > simple_strtoull() returns an unsigned long long, and C permits unsigned
> > long long to be larger than 64 bits.
> 
> Right, but the type of gendisk is u64.  It's fine if you want to store
> the result of simple_strtoull() into an unsigned long long and do
> whatever checks to assert it matches the format expected by gendisk,
> but ultimately the field type would better use u64 for consistency IMO.

I changed my mind on this, not least because the 16-byte length limit
means that the value is limited to UINT64_MAX anyway.

> > > > @@ -725,10 +749,46 @@ static void backend_changed(struct xenbus_watch *watch,
> > > >  		return;
> > > >  	}
> > > >  
> > > > -	if (be->major | be->minor) {
> > > > -		if (be->major != major || be->minor != minor)
> > > > -			pr_warn("changing physical device (from %x:%x to %x:%x) not supported.\n",
> > > > -				be->major, be->minor, major, minor);
> > > > +	diskseq_str = xenbus_read(XBT_NIL, dev->nodename, "diskseq", &diskseq_len);
> > > > +	if (IS_ERR(diskseq_str)) {
> > > > +		int err = PTR_ERR(diskseq_str);
> > > > +		diskseq_str = NULL;
> > > > +
> > > > +		/*
> > > > +		 * If this does not exist, it means legacy userspace that does not
> > > > +		 * support diskseq.
> > > > +		 */
> > > > +		if (unlikely(!XENBUS_EXIST_ERR(err))) {
> > > > +			xenbus_dev_fatal(dev, err, "reading diskseq");
> > > > +			return;
> > > > +		}
> > > > +		diskseq = 0;
> > > > +	} else if (diskseq_len <= 0) {
> > > > +		xenbus_dev_fatal(dev, -EFAULT, "diskseq must not be empty");
> > > > +		goto fail;
> > > > +	} else if (diskseq_len > 16) {
> > > > +		xenbus_dev_fatal(dev, -ERANGE, "diskseq too long: got %d but limit is 16",
> > > > +				 diskseq_len);
> > > > +		goto fail;
> > > > +	} else if (diskseq_str[0] == '0') {
> > > > +		xenbus_dev_fatal(dev, -ERANGE, "diskseq must not start with '0'");
> > > > +		goto fail;
> > > > +	} else {
> > > > +		char *diskseq_end;
> > > > +		diskseq = simple_strtoull(diskseq_str, &diskseq_end, 16);
> > > > +		if (diskseq_end != diskseq_str + diskseq_len) {
> > > > +			xenbus_dev_fatal(dev, -EINVAL, "invalid diskseq");
> > > > +			goto fail;
> > > > +		}
> > > > +		kfree(diskseq_str);
> > > > +		diskseq_str = NULL;
> > > > +	}
> > > 
> > > Won't it be simpler to use xenbus_scanf() with %llx formatter?
> > 
> > xenbus_scanf() doesn’t check for overflow and accepts lots of junk it
> > really should not.  Should this be fixed in xenbus_scanf()?
> 
> That would be my preference, so that you can use it here instead of
> kind of open-coding it.

This winds up being a much more invasive patch as it requires changing
sscanf().  It also has a risk (probably mostly theoretical) of breaking
buggy userspace that passes garbage values here.

> > > Also, we might want to fetch "physical-device" and "diskseq" inside
> > > the same xenstore transaction.
> > 
> > Should the rest of the xenstore reads be included in the same
> > transaction?
> 
> I guess it would make the code simpler to indeed fetch everything
> inside the same transaction.

Okay, will change in v3.

> > > Also, you tie this logic to the "physical-device" watch, which
> > > strictly implies that the "diskseq" node must be written to xenstore
> > > before the "physical-device" node.  This seems fragile, but I don't
> > > see much better optiono since the "diskseq" is optional.
> > 
> > What about including the diskseq in the "physical-device" node?  Perhaps
> > use diskseq@major:minor syntax?
> 
> Hm, how would you know whether the blkback instance in the kernel
> supports the diskseq syntax in physical-device?

That’s what the next patch is for 🙂.

> Can you fetch a disk using a diskseq identifier?

Not yet, although I have considered adding this ability.  It would be
one step towards a “diskseqfs” that userspace could use to open a device
by diskseq.

> Why I understand that this is an extra safety check in order to assert
> blkback is opening the intended device, is this attempting to fix some
> existing issue?

Yes, it is.  I have a block script (written in C) that validates the
device it has opened before passing the information to blkback.  It uses
the diskseq to do this, but for that protection to be complete, blkback
must also be aware of it.

> I'm not sure I see how the major:minor numbers would point to a
> different device than the one specified by the toolstack unless the
> admin explicitly messes with the devices before blkback has got time
> to open them.  But then the admin can already do pretty much
> everything it wants with the system.

Admins typically refer to e.g. device-mapper devices by name, not by
major:minor number.  If a device is destroyed and recreated right as the
block script is running, this race condition can occur.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 16/16] xen-blkback: Inform userspace that device has been opened
  2023-06-07  8:44       ` Roger Pau Monné
@ 2023-06-07 16:29         ` Demi Marie Obenour
  2023-06-08  9:11           ` Roger Pau Monné
  0 siblings, 1 reply; 38+ messages in thread
From: Demi Marie Obenour @ 2023-06-07 16:29 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel,
	Marek Marczykowski-Górecki, linux-block, linux-kernel,
	xen-devel

[-- Attachment #1: Type: text/plain, Size: 3076 bytes --]

On Wed, Jun 07, 2023 at 10:44:48AM +0200, Roger Pau Monné wrote:
> On Tue, Jun 06, 2023 at 01:31:25PM -0400, Demi Marie Obenour wrote:
> > On Tue, Jun 06, 2023 at 11:15:37AM +0200, Roger Pau Monné wrote:
> > > On Tue, May 30, 2023 at 04:31:16PM -0400, Demi Marie Obenour wrote:
> > > > Set "opened" to "0" before the hotplug script is called.  Once the
> > > > device node has been opened, set "opened" to "1".
> > > > 
> > > > "opened" is used exclusively by userspace.  It serves two purposes:
> > > > 
> > > > 1. It tells userspace that the diskseq Xenstore entry is supported.
> > > > 
> > > > 2. It tells userspace that it can wait for "opened" to be set to 1.
> > > >    Once "opened" is 1, blkback has a reference to the device, so
> > > >    userspace doesn't need to keep one.
> > > > 
> > > > Together, these changes allow userspace to use block devices with
> > > > delete-on-close behavior, such as loop devices with the autoclear flag
> > > > set or device-mapper devices with the deferred-remove flag set.
> > > 
> > > There was some work in the past to allow reloading blkback as a
> > > module, it's clear that using delete-on-close won't work if attempting
> > > to reload blkback.
> > 
> > Should blkback stop itself from being unloaded if delete-on-close is in
> > use?
> 
> Hm, maybe.  I guess that's the best we can do right now.

I’ll implement this.

> > > Isn't there some existing way to check whether a device is opened?
> > > (stat syscall maybe?).
> > 
> > Knowing that the device has been opened isn’t enough.  The block script
> > needs to be able to wait for blkback (and not something else) to open
> > the device.  Otherwise it will be confused if the device is opened by
> > e.g. udev.
> 
> Urg, no, the block script cannot wait indefinitely for blkback to open
> the device, as it has an execution timeout.  blkback is free to only
> open the device upon guest frontend connection, and that (when using
> libxl) requires the hotplug scripts execution to be finished so the
> guest can be started.

I’m a bit confused here.  My understanding is that blkdev_get_by_dev()
already opens the device, and that happens in the xenstore watch
handler.  I have tested this with delete-on-close device-mapper devices,
and it does work.

> > > I would like to avoid adding more xenstore blkback state if such
> > > information can be fetched from other methods.
> > 
> > I don’t think it can be, unless the information is passed via a
> > completely different method.  Maybe netlink(7) or ioctl(2)?  Arguably
> > this information should not be stored in Xenstore at all, as it exposes
> > backend implementation details to the frontend.
> 
> Could you maybe use sysfs for this information?

Probably?  This would involve adding a new file in sysfs.

> We have all sorts of crap in xenstore, but it would be best if we can
> see of placing stuff like this in another interface.

Fair.

> Thanks, Roger.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 13/16] xen-blkback: Implement diskseq checks
  2023-06-07 16:14         ` Demi Marie Obenour
@ 2023-06-08  8:29           ` Roger Pau Monné
  2023-06-08 15:33             ` Demi Marie Obenour
  0 siblings, 1 reply; 38+ messages in thread
From: Roger Pau Monné @ 2023-06-08  8:29 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel,
	Marek Marczykowski-Górecki, linux-block, linux-kernel,
	xen-devel

On Wed, Jun 07, 2023 at 12:14:46PM -0400, Demi Marie Obenour wrote:
> On Wed, Jun 07, 2023 at 10:20:08AM +0200, Roger Pau Monné wrote:
> > On Tue, Jun 06, 2023 at 01:01:20PM -0400, Demi Marie Obenour wrote:
> > > On Tue, Jun 06, 2023 at 10:25:47AM +0200, Roger Pau Monné wrote:
> > > > On Tue, May 30, 2023 at 04:31:13PM -0400, Demi Marie Obenour wrote:
> > > > > -	if (be->major | be->minor) {
> > > > > -		if (be->major != major || be->minor != minor)
> > > > > -			pr_warn("changing physical device (from %x:%x to %x:%x) not supported.\n",
> > > > > -				be->major, be->minor, major, minor);
> > > > > +	diskseq_str = xenbus_read(XBT_NIL, dev->nodename, "diskseq", &diskseq_len);
> > > > > +	if (IS_ERR(diskseq_str)) {
> > > > > +		int err = PTR_ERR(diskseq_str);
> > > > > +		diskseq_str = NULL;
> > > > > +
> > > > > +		/*
> > > > > +		 * If this does not exist, it means legacy userspace that does not
> > > > > +		 * support diskseq.
> > > > > +		 */
> > > > > +		if (unlikely(!XENBUS_EXIST_ERR(err))) {
> > > > > +			xenbus_dev_fatal(dev, err, "reading diskseq");
> > > > > +			return;
> > > > > +		}
> > > > > +		diskseq = 0;
> > > > > +	} else if (diskseq_len <= 0) {
> > > > > +		xenbus_dev_fatal(dev, -EFAULT, "diskseq must not be empty");
> > > > > +		goto fail;
> > > > > +	} else if (diskseq_len > 16) {
> > > > > +		xenbus_dev_fatal(dev, -ERANGE, "diskseq too long: got %d but limit is 16",
> > > > > +				 diskseq_len);
> > > > > +		goto fail;
> > > > > +	} else if (diskseq_str[0] == '0') {
> > > > > +		xenbus_dev_fatal(dev, -ERANGE, "diskseq must not start with '0'");
> > > > > +		goto fail;
> > > > > +	} else {
> > > > > +		char *diskseq_end;
> > > > > +		diskseq = simple_strtoull(diskseq_str, &diskseq_end, 16);
> > > > > +		if (diskseq_end != diskseq_str + diskseq_len) {
> > > > > +			xenbus_dev_fatal(dev, -EINVAL, "invalid diskseq");
> > > > > +			goto fail;
> > > > > +		}
> > > > > +		kfree(diskseq_str);
> > > > > +		diskseq_str = NULL;
> > > > > +	}
> > > > 
> > > > Won't it be simpler to use xenbus_scanf() with %llx formatter?
> > > 
> > > xenbus_scanf() doesn’t check for overflow and accepts lots of junk it
> > > really should not.  Should this be fixed in xenbus_scanf()?
> > 
> > That would be my preference, so that you can use it here instead of
> > kind of open-coding it.
> 
> This winds up being a much more invasive patch as it requires changing
> sscanf().  It also has a risk (probably mostly theoretical) of breaking
> buggy userspace that passes garbage values here.

Well, if the current function is not suitable for your purposes it
would be better to fix it rather than open-code what you need.  Mostly
because further usages would then also need to open-code whatever
required.

> > > > Also, you tie this logic to the "physical-device" watch, which
> > > > strictly implies that the "diskseq" node must be written to xenstore
> > > > before the "physical-device" node.  This seems fragile, but I don't
> > > > see much better optiono since the "diskseq" is optional.
> > > 
> > > What about including the diskseq in the "physical-device" node?  Perhaps
> > > use diskseq@major:minor syntax?
> > 
> > Hm, how would you know whether the blkback instance in the kernel
> > supports the diskseq syntax in physical-device?
> 
> That’s what the next patch is for 🙂.

Hm, I think we should separate diskseq support from the notify open
stuff: it's possible a different (non-Linux) backend wants to
implement open notify support but doesn't have diskseq.

> > Can you fetch a disk using a diskseq identifier?
> 
> Not yet, although I have considered adding this ability.  It would be
> one step towards a “diskseqfs” that userspace could use to open a device
> by diskseq.
> 
> > Why I understand that this is an extra safety check in order to assert
> > blkback is opening the intended device, is this attempting to fix some
> > existing issue?
> 
> Yes, it is.  I have a block script (written in C) that validates the
> device it has opened before passing the information to blkback.  It uses
> the diskseq to do this, but for that protection to be complete, blkback
> must also be aware of it.

But if your block script opens the device, and keeps it open until
blkback has also taken a reference to it, there's no way such device
could be removed and recreated in the window you point out above, as
there's always a reference on it taken?

> > I'm not sure I see how the major:minor numbers would point to a
> > different device than the one specified by the toolstack unless the
> > admin explicitly messes with the devices before blkback has got time
> > to open them.  But then the admin can already do pretty much
> > everything it wants with the system.
> 
> Admins typically refer to e.g. device-mapper devices by name, not by
> major:minor number.  If a device is destroyed and recreated right as the
> block script is running, this race condition can occur.

Right, but what about this device recreation happening after the admin
has written the guest config file but before the call to (lib)xl
happens?  blkback would also end up using a different device than
indented, and your proposed approach doesn't fix this.  The only way to
solve this would be to reference devices by UUID (iow: diskseq)
directly in the guest config file.

Then the block script will open the device by diskseq and pass the
major:minor numbers to blkback.

Thanks, Roger.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 16/16] xen-blkback: Inform userspace that device has been opened
  2023-06-07 16:29         ` Demi Marie Obenour
@ 2023-06-08  9:11           ` Roger Pau Monné
  2023-06-08 15:23             ` Demi Marie Obenour
  0 siblings, 1 reply; 38+ messages in thread
From: Roger Pau Monné @ 2023-06-08  9:11 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel,
	Marek Marczykowski-Górecki, linux-block, linux-kernel,
	xen-devel

On Wed, Jun 07, 2023 at 12:29:26PM -0400, Demi Marie Obenour wrote:
> On Wed, Jun 07, 2023 at 10:44:48AM +0200, Roger Pau Monné wrote:
> > On Tue, Jun 06, 2023 at 01:31:25PM -0400, Demi Marie Obenour wrote:
> > > On Tue, Jun 06, 2023 at 11:15:37AM +0200, Roger Pau Monné wrote:
> > > > On Tue, May 30, 2023 at 04:31:16PM -0400, Demi Marie Obenour wrote:
> > > > > Set "opened" to "0" before the hotplug script is called.  Once the
> > > > > device node has been opened, set "opened" to "1".
> > > > > 
> > > > > "opened" is used exclusively by userspace.  It serves two purposes:
> > > > > 
> > > > > 1. It tells userspace that the diskseq Xenstore entry is supported.
> > > > > 
> > > > > 2. It tells userspace that it can wait for "opened" to be set to 1.
> > > > >    Once "opened" is 1, blkback has a reference to the device, so
> > > > >    userspace doesn't need to keep one.
> > > > > 
> > > > > Together, these changes allow userspace to use block devices with
> > > > > delete-on-close behavior, such as loop devices with the autoclear flag
> > > > > set or device-mapper devices with the deferred-remove flag set.
> > > > 
> > > > There was some work in the past to allow reloading blkback as a
> > > > module, it's clear that using delete-on-close won't work if attempting
> > > > to reload blkback.
> > > 
> > > Should blkback stop itself from being unloaded if delete-on-close is in
> > > use?
> > 
> > Hm, maybe.  I guess that's the best we can do right now.
> 
> I’ll implement this.

Let's make this a separate patch.

> > > > Isn't there some existing way to check whether a device is opened?
> > > > (stat syscall maybe?).
> > > 
> > > Knowing that the device has been opened isn’t enough.  The block script
> > > needs to be able to wait for blkback (and not something else) to open
> > > the device.  Otherwise it will be confused if the device is opened by
> > > e.g. udev.
> > 
> > Urg, no, the block script cannot wait indefinitely for blkback to open
> > the device, as it has an execution timeout.  blkback is free to only
> > open the device upon guest frontend connection, and that (when using
> > libxl) requires the hotplug scripts execution to be finished so the
> > guest can be started.
> 
> I’m a bit confused here.  My understanding is that blkdev_get_by_dev()
> already opens the device, and that happens in the xenstore watch
> handler.  I have tested this with delete-on-close device-mapper devices,
> and it does work.

Right, but on a very contended system there's no guarantee of when
blkback will pick up the update to "physical-device" and open the
device, so far the block script only writes the physical-device node
and exits.  With the proposed change the block script will also wait
for blkback to react to the physcal-device write, hence making VM
creation slower.

> > > > I would like to avoid adding more xenstore blkback state if such
> > > > information can be fetched from other methods.
> > > 
> > > I don’t think it can be, unless the information is passed via a
> > > completely different method.  Maybe netlink(7) or ioctl(2)?  Arguably
> > > this information should not be stored in Xenstore at all, as it exposes
> > > backend implementation details to the frontend.
> > 
> > Could you maybe use sysfs for this information?
> 
> Probably?  This would involve adding a new file in sysfs.
> 
> > We have all sorts of crap in xenstore, but it would be best if we can
> > see of placing stuff like this in another interface.
> 
> Fair.

Let's see if that's a suitable approach, and we can avoid having to
add an extra node to xenstore.

Thanks, Roger.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 16/16] xen-blkback: Inform userspace that device has been opened
  2023-05-30 20:31 ` [PATCH v2 16/16] xen-blkback: Inform userspace that device has been opened Demi Marie Obenour
  2023-06-06  9:15   ` Roger Pau Monné
@ 2023-06-08 10:08   ` Roger Pau Monné
  2023-06-08 15:24     ` Demi Marie Obenour
  1 sibling, 1 reply; 38+ messages in thread
From: Roger Pau Monné @ 2023-06-08 10:08 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel,
	Marek Marczykowski-Górecki, linux-block, linux-kernel,
	xen-devel

On Tue, May 30, 2023 at 04:31:16PM -0400, Demi Marie Obenour wrote:
> Set "opened" to "0" before the hotplug script is called.  Once the
> device node has been opened, set "opened" to "1".
> 
> "opened" is used exclusively by userspace.  It serves two purposes:
> 
> 1. It tells userspace that the diskseq Xenstore entry is supported.
> 
> 2. It tells userspace that it can wait for "opened" to be set to 1.
>    Once "opened" is 1, blkback has a reference to the device, so
>    userspace doesn't need to keep one.
> 
> Together, these changes allow userspace to use block devices with
> delete-on-close behavior, such as loop devices with the autoclear flag
> set or device-mapper devices with the deferred-remove flag set.

Now that I think a bit more about this, how are you planning to handle
reboot with such devices?  It's fine for loop (because those get
instantiated by the block script), but likely not with other block
devices, as on reboot the toolstack will find the block device is
gone.

I guess the delete-on-close is only intended to be used for loop
devices? (or in general block devices that are instantiated by the
block script itself)

Thanks, Roger.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 16/16] xen-blkback: Inform userspace that device has been opened
  2023-06-08  9:11           ` Roger Pau Monné
@ 2023-06-08 15:23             ` Demi Marie Obenour
  0 siblings, 0 replies; 38+ messages in thread
From: Demi Marie Obenour @ 2023-06-08 15:23 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel,
	Marek Marczykowski-Górecki, linux-block, linux-kernel,
	xen-devel

[-- Attachment #1: Type: text/plain, Size: 4967 bytes --]

On Thu, Jun 08, 2023 at 11:11:44AM +0200, Roger Pau Monné wrote:
> On Wed, Jun 07, 2023 at 12:29:26PM -0400, Demi Marie Obenour wrote:
> > On Wed, Jun 07, 2023 at 10:44:48AM +0200, Roger Pau Monné wrote:
> > > On Tue, Jun 06, 2023 at 01:31:25PM -0400, Demi Marie Obenour wrote:
> > > > On Tue, Jun 06, 2023 at 11:15:37AM +0200, Roger Pau Monné wrote:
> > > > > On Tue, May 30, 2023 at 04:31:16PM -0400, Demi Marie Obenour wrote:
> > > > > > Set "opened" to "0" before the hotplug script is called.  Once the
> > > > > > device node has been opened, set "opened" to "1".
> > > > > > 
> > > > > > "opened" is used exclusively by userspace.  It serves two purposes:
> > > > > > 
> > > > > > 1. It tells userspace that the diskseq Xenstore entry is supported.
> > > > > > 
> > > > > > 2. It tells userspace that it can wait for "opened" to be set to 1.
> > > > > >    Once "opened" is 1, blkback has a reference to the device, so
> > > > > >    userspace doesn't need to keep one.
> > > > > > 
> > > > > > Together, these changes allow userspace to use block devices with
> > > > > > delete-on-close behavior, such as loop devices with the autoclear flag
> > > > > > set or device-mapper devices with the deferred-remove flag set.
> > > > > 
> > > > > There was some work in the past to allow reloading blkback as a
> > > > > module, it's clear that using delete-on-close won't work if attempting
> > > > > to reload blkback.
> > > > 
> > > > Should blkback stop itself from being unloaded if delete-on-close is in
> > > > use?
> > > 
> > > Hm, maybe.  I guess that's the best we can do right now.
> > 
> > I’ll implement this.
> 
> Let's make this a separate patch.

Good idea.

> > > > > Isn't there some existing way to check whether a device is opened?
> > > > > (stat syscall maybe?).
> > > > 
> > > > Knowing that the device has been opened isn’t enough.  The block script
> > > > needs to be able to wait for blkback (and not something else) to open
> > > > the device.  Otherwise it will be confused if the device is opened by
> > > > e.g. udev.
> > > 
> > > Urg, no, the block script cannot wait indefinitely for blkback to open
> > > the device, as it has an execution timeout.  blkback is free to only
> > > open the device upon guest frontend connection, and that (when using
> > > libxl) requires the hotplug scripts execution to be finished so the
> > > guest can be started.
> > 
> > I’m a bit confused here.  My understanding is that blkdev_get_by_dev()
> > already opens the device, and that happens in the xenstore watch
> > handler.  I have tested this with delete-on-close device-mapper devices,
> > and it does work.
> 
> Right, but on a very contended system there's no guarantee of when
> blkback will pick up the update to "physical-device" and open the
> device, so far the block script only writes the physical-device node
> and exits.  With the proposed change the block script will also wait
> for blkback to react to the physcal-device write, hence making VM
> creation slower.

Only block scripts that choose to wait for device open suffer
this performance penalty.  My current plan is to only do so for
delete-on-close devices which are managed by the block script
itself.  Other devices will not suffer a performance hit.

In the long term, I would like to solve this problem entirely by using
an ioctl to configure blkback.  The ioctl would take a file descriptor
argument, avoiding the need for a round-trip through xenstore.  This
also solves a security annoyance with the current design, which is that
the device is opened by a kernel thread and so the security context of
whoever requested the device to be opened is lost.

> > > > > I would like to avoid adding more xenstore blkback state if such
> > > > > information can be fetched from other methods.
> > > > 
> > > > I don’t think it can be, unless the information is passed via a
> > > > completely different method.  Maybe netlink(7) or ioctl(2)?  Arguably
> > > > this information should not be stored in Xenstore at all, as it exposes
> > > > backend implementation details to the frontend.
> > > 
> > > Could you maybe use sysfs for this information?
> > 
> > Probably?  This would involve adding a new file in sysfs.
> > 
> > > We have all sorts of crap in xenstore, but it would be best if we can
> > > see of placing stuff like this in another interface.
> > 
> > Fair.
> 
> Let's see if that's a suitable approach, and we can avoid having to
> add an extra node to xenstore.

I thought about this some more and realized that in Qubes OS, we might
want to include the diskseq in the information dom0 gets about each
exported block device.  This would allow dom0 to write the xenstore node
itself, but it would require some way for dom0 to be informed about
blkback having this feature.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 16/16] xen-blkback: Inform userspace that device has been opened
  2023-06-08 10:08   ` Roger Pau Monné
@ 2023-06-08 15:24     ` Demi Marie Obenour
  0 siblings, 0 replies; 38+ messages in thread
From: Demi Marie Obenour @ 2023-06-08 15:24 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel,
	Marek Marczykowski-Górecki, linux-block, linux-kernel,
	xen-devel

[-- Attachment #1: Type: text/plain, Size: 1401 bytes --]

On Thu, Jun 08, 2023 at 12:08:55PM +0200, Roger Pau Monné wrote:
> On Tue, May 30, 2023 at 04:31:16PM -0400, Demi Marie Obenour wrote:
> > Set "opened" to "0" before the hotplug script is called.  Once the
> > device node has been opened, set "opened" to "1".
> > 
> > "opened" is used exclusively by userspace.  It serves two purposes:
> > 
> > 1. It tells userspace that the diskseq Xenstore entry is supported.
> > 
> > 2. It tells userspace that it can wait for "opened" to be set to 1.
> >    Once "opened" is 1, blkback has a reference to the device, so
> >    userspace doesn't need to keep one.
> > 
> > Together, these changes allow userspace to use block devices with
> > delete-on-close behavior, such as loop devices with the autoclear flag
> > set or device-mapper devices with the deferred-remove flag set.
> 
> Now that I think a bit more about this, how are you planning to handle
> reboot with such devices?  It's fine for loop (because those get
> instantiated by the block script), but likely not with other block
> devices, as on reboot the toolstack will find the block device is
> gone.
> 
> I guess the delete-on-close is only intended to be used for loop
> devices? (or in general block devices that are instantiated by the
> block script itself)

You understand correctly.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 13/16] xen-blkback: Implement diskseq checks
  2023-06-08  8:29           ` Roger Pau Monné
@ 2023-06-08 15:33             ` Demi Marie Obenour
  2023-06-09 15:13               ` Roger Pau Monné
  0 siblings, 1 reply; 38+ messages in thread
From: Demi Marie Obenour @ 2023-06-08 15:33 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel,
	Marek Marczykowski-Górecki, linux-block, linux-kernel,
	xen-devel

[-- Attachment #1: Type: text/plain, Size: 6816 bytes --]

On Thu, Jun 08, 2023 at 10:29:18AM +0200, Roger Pau Monné wrote:
> On Wed, Jun 07, 2023 at 12:14:46PM -0400, Demi Marie Obenour wrote:
> > On Wed, Jun 07, 2023 at 10:20:08AM +0200, Roger Pau Monné wrote:
> > > On Tue, Jun 06, 2023 at 01:01:20PM -0400, Demi Marie Obenour wrote:
> > > > On Tue, Jun 06, 2023 at 10:25:47AM +0200, Roger Pau Monné wrote:
> > > > > On Tue, May 30, 2023 at 04:31:13PM -0400, Demi Marie Obenour wrote:
> > > > > > -	if (be->major | be->minor) {
> > > > > > -		if (be->major != major || be->minor != minor)
> > > > > > -			pr_warn("changing physical device (from %x:%x to %x:%x) not supported.\n",
> > > > > > -				be->major, be->minor, major, minor);
> > > > > > +	diskseq_str = xenbus_read(XBT_NIL, dev->nodename, "diskseq", &diskseq_len);
> > > > > > +	if (IS_ERR(diskseq_str)) {
> > > > > > +		int err = PTR_ERR(diskseq_str);
> > > > > > +		diskseq_str = NULL;
> > > > > > +
> > > > > > +		/*
> > > > > > +		 * If this does not exist, it means legacy userspace that does not
> > > > > > +		 * support diskseq.
> > > > > > +		 */
> > > > > > +		if (unlikely(!XENBUS_EXIST_ERR(err))) {
> > > > > > +			xenbus_dev_fatal(dev, err, "reading diskseq");
> > > > > > +			return;
> > > > > > +		}
> > > > > > +		diskseq = 0;
> > > > > > +	} else if (diskseq_len <= 0) {
> > > > > > +		xenbus_dev_fatal(dev, -EFAULT, "diskseq must not be empty");
> > > > > > +		goto fail;
> > > > > > +	} else if (diskseq_len > 16) {
> > > > > > +		xenbus_dev_fatal(dev, -ERANGE, "diskseq too long: got %d but limit is 16",
> > > > > > +				 diskseq_len);
> > > > > > +		goto fail;
> > > > > > +	} else if (diskseq_str[0] == '0') {
> > > > > > +		xenbus_dev_fatal(dev, -ERANGE, "diskseq must not start with '0'");
> > > > > > +		goto fail;
> > > > > > +	} else {
> > > > > > +		char *diskseq_end;
> > > > > > +		diskseq = simple_strtoull(diskseq_str, &diskseq_end, 16);
> > > > > > +		if (diskseq_end != diskseq_str + diskseq_len) {
> > > > > > +			xenbus_dev_fatal(dev, -EINVAL, "invalid diskseq");
> > > > > > +			goto fail;
> > > > > > +		}
> > > > > > +		kfree(diskseq_str);
> > > > > > +		diskseq_str = NULL;
> > > > > > +	}
> > > > > 
> > > > > Won't it be simpler to use xenbus_scanf() with %llx formatter?
> > > > 
> > > > xenbus_scanf() doesn’t check for overflow and accepts lots of junk it
> > > > really should not.  Should this be fixed in xenbus_scanf()?
> > > 
> > > That would be my preference, so that you can use it here instead of
> > > kind of open-coding it.
> > 
> > This winds up being a much more invasive patch as it requires changing
> > sscanf().  It also has a risk (probably mostly theoretical) of breaking
> > buggy userspace that passes garbage values here.
> 
> Well, if the current function is not suitable for your purposes it
> would be better to fix it rather than open-code what you need.  Mostly
> because further usages would then also need to open-code whatever
> required.

That is fair.

> > > > > Also, you tie this logic to the "physical-device" watch, which
> > > > > strictly implies that the "diskseq" node must be written to xenstore
> > > > > before the "physical-device" node.  This seems fragile, but I don't
> > > > > see much better optiono since the "diskseq" is optional.
> > > > 
> > > > What about including the diskseq in the "physical-device" node?  Perhaps
> > > > use diskseq@major:minor syntax?
> > > 
> > > Hm, how would you know whether the blkback instance in the kernel
> > > supports the diskseq syntax in physical-device?
> > 
> > That’s what the next patch is for 🙂.
> 
> Hm, I think we should separate diskseq support from the notify open
> stuff: it's possible a different (non-Linux) backend wants to
> implement open notify support but doesn't have diskseq.

I like this idea!  What about having blkback set diskseq to zero?
Userspace could then replace it with the actual value.

> > > Can you fetch a disk using a diskseq identifier?
> > 
> > Not yet, although I have considered adding this ability.  It would be
> > one step towards a “diskseqfs” that userspace could use to open a device
> > by diskseq.
> > 
> > > Why I understand that this is an extra safety check in order to assert
> > > blkback is opening the intended device, is this attempting to fix some
> > > existing issue?
> > 
> > Yes, it is.  I have a block script (written in C) that validates the
> > device it has opened before passing the information to blkback.  It uses
> > the diskseq to do this, but for that protection to be complete, blkback
> > must also be aware of it.
> 
> But if your block script opens the device, and keeps it open until
> blkback has also taken a reference to it, there's no way such device
> could be removed and recreated in the window you point out above, as
> there's always a reference on it taken?

This assumes that the block script is not killed in the meantime,
which is not a safe assumption due to timeouts and the OOM killer.

> > > I'm not sure I see how the major:minor numbers would point to a
> > > different device than the one specified by the toolstack unless the
> > > admin explicitly messes with the devices before blkback has got time
> > > to open them.  But then the admin can already do pretty much
> > > everything it wants with the system.
> > 
> > Admins typically refer to e.g. device-mapper devices by name, not by
> > major:minor number.  If a device is destroyed and recreated right as the
> > block script is running, this race condition can occur.
> 
> Right, but what about this device recreation happening after the admin
> has written the guest config file but before the call to (lib)xl
> happens?  blkback would also end up using a different device than
> indented, and your proposed approach doesn't fix this.  The only way to
> solve this would be to reference devices by UUID (iow: diskseq)
> directly in the guest config file.

That would be a good idea, but it is orthogonal to this patch.  My
script opens the device and uses various means to check that it did
open the correct device.  It then passes the diskseq to blkback.

> Then the block script will open the device by diskseq and pass the
> major:minor numbers to blkback.

Alternatively, the toolstack could write both the diskseq and
major:minor numbers and be confident that it is referring to the
correct device, no matter how long ago it got that information.
This could be quite useful for e.g. one VM exporting a device to
another VM by calling losetup(8) and expecting a human to make a
decision based on various properties about the device.  In this
case there is no upper bound on the race window.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 13/16] xen-blkback: Implement diskseq checks
  2023-06-08 15:33             ` Demi Marie Obenour
@ 2023-06-09 15:13               ` Roger Pau Monné
  2023-06-09 16:55                 ` Demi Marie Obenour
  0 siblings, 1 reply; 38+ messages in thread
From: Roger Pau Monné @ 2023-06-09 15:13 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel,
	Marek Marczykowski-Górecki, linux-block, linux-kernel,
	xen-devel

On Thu, Jun 08, 2023 at 11:33:26AM -0400, Demi Marie Obenour wrote:
> On Thu, Jun 08, 2023 at 10:29:18AM +0200, Roger Pau Monné wrote:
> > On Wed, Jun 07, 2023 at 12:14:46PM -0400, Demi Marie Obenour wrote:
> > > On Wed, Jun 07, 2023 at 10:20:08AM +0200, Roger Pau Monné wrote:
> > > > On Tue, Jun 06, 2023 at 01:01:20PM -0400, Demi Marie Obenour wrote:
> > > > > On Tue, Jun 06, 2023 at 10:25:47AM +0200, Roger Pau Monné wrote:
> > > > > > On Tue, May 30, 2023 at 04:31:13PM -0400, Demi Marie Obenour wrote:
> > > > > > Also, you tie this logic to the "physical-device" watch, which
> > > > > > strictly implies that the "diskseq" node must be written to xenstore
> > > > > > before the "physical-device" node.  This seems fragile, but I don't
> > > > > > see much better optiono since the "diskseq" is optional.
> > > > > 
> > > > > What about including the diskseq in the "physical-device" node?  Perhaps
> > > > > use diskseq@major:minor syntax?
> > > > 
> > > > Hm, how would you know whether the blkback instance in the kernel
> > > > supports the diskseq syntax in physical-device?
> > > 
> > > That’s what the next patch is for 🙂.
> > 
> > Hm, I think we should separate diskseq support from the notify open
> > stuff: it's possible a different (non-Linux) backend wants to
> > implement open notify support but doesn't have diskseq.
> 
> I like this idea!  What about having blkback set diskseq to zero?
> Userspace could then replace it with the actual value.

I think it would be better if we used a sysfs node, because blkfront
has no business is knowing whether diskseq is supported by the
backend or not.

xenstore should be reserved to features exposed between blkfront and
blkback if possible.  I know we haven't been very good at this
however.

> > > > Can you fetch a disk using a diskseq identifier?
> > > 
> > > Not yet, although I have considered adding this ability.  It would be
> > > one step towards a “diskseqfs” that userspace could use to open a device
> > > by diskseq.
> > > 
> > > > Why I understand that this is an extra safety check in order to assert
> > > > blkback is opening the intended device, is this attempting to fix some
> > > > existing issue?
> > > 
> > > Yes, it is.  I have a block script (written in C) that validates the
> > > device it has opened before passing the information to blkback.  It uses
> > > the diskseq to do this, but for that protection to be complete, blkback
> > > must also be aware of it.
> > 
> > But if your block script opens the device, and keeps it open until
> > blkback has also taken a reference to it, there's no way such device
> > could be removed and recreated in the window you point out above, as
> > there's always a reference on it taken?
> 
> This assumes that the block script is not killed in the meantime,
> which is not a safe assumption due to timeouts and the OOM killer.

Doesn't seem very reliable to use with delete-on-close either then.

> > > > I'm not sure I see how the major:minor numbers would point to a
> > > > different device than the one specified by the toolstack unless the
> > > > admin explicitly messes with the devices before blkback has got time
> > > > to open them.  But then the admin can already do pretty much
> > > > everything it wants with the system.
> > > 
> > > Admins typically refer to e.g. device-mapper devices by name, not by
> > > major:minor number.  If a device is destroyed and recreated right as the
> > > block script is running, this race condition can occur.
> > 
> > Right, but what about this device recreation happening after the admin
> > has written the guest config file but before the call to (lib)xl
> > happens?  blkback would also end up using a different device than
> > indented, and your proposed approach doesn't fix this.  The only way to
> > solve this would be to reference devices by UUID (iow: diskseq)
> > directly in the guest config file.
> 
> That would be a good idea, but it is orthogonal to this patch.  My
> script opens the device and uses various means to check that it did
> open the correct device.  It then passes the diskseq to blkback.

How you do this with losetup?  I guess there's an atomic way to setup
a loop device and get its diskseq?

> > Then the block script will open the device by diskseq and pass the
> > major:minor numbers to blkback.
> 
> Alternatively, the toolstack could write both the diskseq and
> major:minor numbers and be confident that it is referring to the
> correct device, no matter how long ago it got that information.
> This could be quite useful for e.g. one VM exporting a device to
> another VM by calling losetup(8) and expecting a human to make a
> decision based on various properties about the device.  In this
> case there is no upper bound on the race window.

Instead of playing with xenstore nodes, it might be better to simply
have blkback export on sysfs the diskseq of the opened device, and let
the block script check whether that's correct or not.  That implies
less code in the kernel side, and doesn't pollute xenstore.

Regards, Roger.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 13/16] xen-blkback: Implement diskseq checks
  2023-06-09 15:13               ` Roger Pau Monné
@ 2023-06-09 16:55                 ` Demi Marie Obenour
  2023-06-12  8:09                   ` Roger Pau Monné
  0 siblings, 1 reply; 38+ messages in thread
From: Demi Marie Obenour @ 2023-06-09 16:55 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel,
	Marek Marczykowski-Górecki, linux-block, linux-kernel,
	xen-devel

[-- Attachment #1: Type: text/plain, Size: 6178 bytes --]

On Fri, Jun 09, 2023 at 05:13:45PM +0200, Roger Pau Monné wrote:
> On Thu, Jun 08, 2023 at 11:33:26AM -0400, Demi Marie Obenour wrote:
> > On Thu, Jun 08, 2023 at 10:29:18AM +0200, Roger Pau Monné wrote:
> > > On Wed, Jun 07, 2023 at 12:14:46PM -0400, Demi Marie Obenour wrote:
> > > > On Wed, Jun 07, 2023 at 10:20:08AM +0200, Roger Pau Monné wrote:
> > > > > On Tue, Jun 06, 2023 at 01:01:20PM -0400, Demi Marie Obenour wrote:
> > > > > > On Tue, Jun 06, 2023 at 10:25:47AM +0200, Roger Pau Monné wrote:
> > > > > > > On Tue, May 30, 2023 at 04:31:13PM -0400, Demi Marie Obenour wrote:
> > > > > > > Also, you tie this logic to the "physical-device" watch, which
> > > > > > > strictly implies that the "diskseq" node must be written to xenstore
> > > > > > > before the "physical-device" node.  This seems fragile, but I don't
> > > > > > > see much better optiono since the "diskseq" is optional.
> > > > > > 
> > > > > > What about including the diskseq in the "physical-device" node?  Perhaps
> > > > > > use diskseq@major:minor syntax?
> > > > > 
> > > > > Hm, how would you know whether the blkback instance in the kernel
> > > > > supports the diskseq syntax in physical-device?
> > > > 
> > > > That’s what the next patch is for 🙂.
> > > 
> > > Hm, I think we should separate diskseq support from the notify open
> > > stuff: it's possible a different (non-Linux) backend wants to
> > > implement open notify support but doesn't have diskseq.
> > 
> > I like this idea!  What about having blkback set diskseq to zero?
> > Userspace could then replace it with the actual value.
> 
> I think it would be better if we used a sysfs node, because blkfront
> has no business is knowing whether diskseq is supported by the
> backend or not.
> 
> xenstore should be reserved to features exposed between blkfront and
> blkback if possible.  I know we haven't been very good at this
> however.
> 
> > > > > Can you fetch a disk using a diskseq identifier?
> > > > 
> > > > Not yet, although I have considered adding this ability.  It would be
> > > > one step towards a “diskseqfs” that userspace could use to open a device
> > > > by diskseq.
> > > > 
> > > > > Why I understand that this is an extra safety check in order to assert
> > > > > blkback is opening the intended device, is this attempting to fix some
> > > > > existing issue?
> > > > 
> > > > Yes, it is.  I have a block script (written in C) that validates the
> > > > device it has opened before passing the information to blkback.  It uses
> > > > the diskseq to do this, but for that protection to be complete, blkback
> > > > must also be aware of it.
> > > 
> > > But if your block script opens the device, and keeps it open until
> > > blkback has also taken a reference to it, there's no way such device
> > > could be removed and recreated in the window you point out above, as
> > > there's always a reference on it taken?
> > 
> > This assumes that the block script is not killed in the meantime,
> > which is not a safe assumption due to timeouts and the OOM killer.
> 
> Doesn't seem very reliable to use with delete-on-close either then.

That’s actually the purpose of delete-on-close!  It ensures that if the
block script gets killed, the device is automatically cleaned up.

> > > > > I'm not sure I see how the major:minor numbers would point to a
> > > > > different device than the one specified by the toolstack unless the
> > > > > admin explicitly messes with the devices before blkback has got time
> > > > > to open them.  But then the admin can already do pretty much
> > > > > everything it wants with the system.
> > > > 
> > > > Admins typically refer to e.g. device-mapper devices by name, not by
> > > > major:minor number.  If a device is destroyed and recreated right as the
> > > > block script is running, this race condition can occur.
> > > 
> > > Right, but what about this device recreation happening after the admin
> > > has written the guest config file but before the call to (lib)xl
> > > happens?  blkback would also end up using a different device than
> > > indented, and your proposed approach doesn't fix this.  The only way to
> > > solve this would be to reference devices by UUID (iow: diskseq)
> > > directly in the guest config file.
> > 
> > That would be a good idea, but it is orthogonal to this patch.  My
> > script opens the device and uses various means to check that it did
> > open the correct device.  It then passes the diskseq to blkback.
> 
> How you do this with losetup?  I guess there's an atomic way to setup
> a loop device and get its diskseq?

It can’t be done with losetup.  I use a C program that directly
issues ioctls to /dev/loop-control and /dev/loop*.  Doing this with
device-mapper requires kernel patches that have been submitted but are
not yet upstream.

> > > Then the block script will open the device by diskseq and pass the
> > > major:minor numbers to blkback.
> > 
> > Alternatively, the toolstack could write both the diskseq and
> > major:minor numbers and be confident that it is referring to the
> > correct device, no matter how long ago it got that information.
> > This could be quite useful for e.g. one VM exporting a device to
> > another VM by calling losetup(8) and expecting a human to make a
> > decision based on various properties about the device.  In this
> > case there is no upper bound on the race window.
> 
> Instead of playing with xenstore nodes, it might be better to simply
> have blkback export on sysfs the diskseq of the opened device, and let
> the block script check whether that's correct or not.  That implies
> less code in the kernel side, and doesn't pollute xenstore.

This would require that blkback delay exposing the device to the
frontend until the block script has checked that the diskseq is correct.
Much simpler for the block script to provide the diskseq in xenstore.
If you want to avoid an extra xenstore node, I can make the diskseq part
of the physical-device node.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 13/16] xen-blkback: Implement diskseq checks
  2023-06-09 16:55                 ` Demi Marie Obenour
@ 2023-06-12  8:09                   ` Roger Pau Monné
  2023-06-21  1:14                     ` Demi Marie Obenour
  0 siblings, 1 reply; 38+ messages in thread
From: Roger Pau Monné @ 2023-06-12  8:09 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel,
	Marek Marczykowski-Górecki, linux-block, linux-kernel,
	xen-devel

On Fri, Jun 09, 2023 at 12:55:39PM -0400, Demi Marie Obenour wrote:
> On Fri, Jun 09, 2023 at 05:13:45PM +0200, Roger Pau Monné wrote:
> > On Thu, Jun 08, 2023 at 11:33:26AM -0400, Demi Marie Obenour wrote:
> > > On Thu, Jun 08, 2023 at 10:29:18AM +0200, Roger Pau Monné wrote:
> > > > On Wed, Jun 07, 2023 at 12:14:46PM -0400, Demi Marie Obenour wrote:
> > > > > On Wed, Jun 07, 2023 at 10:20:08AM +0200, Roger Pau Monné wrote:
> > > > > > Can you fetch a disk using a diskseq identifier?
> > > > > 
> > > > > Not yet, although I have considered adding this ability.  It would be
> > > > > one step towards a “diskseqfs” that userspace could use to open a device
> > > > > by diskseq.
> > > > > 
> > > > > > Why I understand that this is an extra safety check in order to assert
> > > > > > blkback is opening the intended device, is this attempting to fix some
> > > > > > existing issue?
> > > > > 
> > > > > Yes, it is.  I have a block script (written in C) that validates the
> > > > > device it has opened before passing the information to blkback.  It uses
> > > > > the diskseq to do this, but for that protection to be complete, blkback
> > > > > must also be aware of it.
> > > > 
> > > > But if your block script opens the device, and keeps it open until
> > > > blkback has also taken a reference to it, there's no way such device
> > > > could be removed and recreated in the window you point out above, as
> > > > there's always a reference on it taken?
> > > 
> > > This assumes that the block script is not killed in the meantime,
> > > which is not a safe assumption due to timeouts and the OOM killer.
> > 
> > Doesn't seem very reliable to use with delete-on-close either then.
> 
> That’s actually the purpose of delete-on-close!  It ensures that if the
> block script gets killed, the device is automatically cleaned up.

Block script attach getting killed shouldn't prevent the toolstack
from performing domain destruction, and thus removing the stale block
device.

OTOH if your toolstack gets killed then there's not much that can be
done, and the system will need intervention in order to get back into
a sane state.

Hitting OOM in your control domain however is unlikely to be handled
gracefully, even with delete-on-close.

> > > > Then the block script will open the device by diskseq and pass the
> > > > major:minor numbers to blkback.
> > > 
> > > Alternatively, the toolstack could write both the diskseq and
> > > major:minor numbers and be confident that it is referring to the
> > > correct device, no matter how long ago it got that information.
> > > This could be quite useful for e.g. one VM exporting a device to
> > > another VM by calling losetup(8) and expecting a human to make a
> > > decision based on various properties about the device.  In this
> > > case there is no upper bound on the race window.
> > 
> > Instead of playing with xenstore nodes, it might be better to simply
> > have blkback export on sysfs the diskseq of the opened device, and let
> > the block script check whether that's correct or not.  That implies
> > less code in the kernel side, and doesn't pollute xenstore.
> 
> This would require that blkback delay exposing the device to the
> frontend until the block script has checked that the diskseq is correct.

This depends on your toolstack implementation.  libxl won't start the
domain until block scripts have finished execution, and hence the
block script waiting for the sysfs node to appear and check it against
the expected value would be enough.

> Much simpler for the block script to provide the diskseq in xenstore.
> If you want to avoid an extra xenstore node, I can make the diskseq part
> of the physical-device node.

I'm thinking that we might want to introduce a "physical-device-uuid"
node and use that to provide the diskseq to the backened.  Toolstacks
(or block scripts) would need to be sure the "physical-device-uuid"
node is populated before setting "physical-device", as writes to
that node would still trigger blkback watch.  I think using two
distinct watches would just make the logic in blkback too
complicated.

My preference would be for the kernel to have a function for opening a
device identified by a diskseq (as fetched from
"physical-device-uuid"), so that we don't have to open using
major:minor and then check the diskseq.

Thanks, Roger.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 13/16] xen-blkback: Implement diskseq checks
  2023-06-12  8:09                   ` Roger Pau Monné
@ 2023-06-21  1:14                     ` Demi Marie Obenour
  2023-06-21 10:07                       ` Roger Pau Monné
  0 siblings, 1 reply; 38+ messages in thread
From: Demi Marie Obenour @ 2023-06-21  1:14 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel,
	Marek Marczykowski-Górecki, linux-block, linux-kernel,
	xen-devel

[-- Attachment #1: Type: text/plain, Size: 5216 bytes --]

On Mon, Jun 12, 2023 at 10:09:39AM +0200, Roger Pau Monné wrote:
> On Fri, Jun 09, 2023 at 12:55:39PM -0400, Demi Marie Obenour wrote:
> > On Fri, Jun 09, 2023 at 05:13:45PM +0200, Roger Pau Monné wrote:
> > > On Thu, Jun 08, 2023 at 11:33:26AM -0400, Demi Marie Obenour wrote:
> > > > On Thu, Jun 08, 2023 at 10:29:18AM +0200, Roger Pau Monné wrote:
> > > > > On Wed, Jun 07, 2023 at 12:14:46PM -0400, Demi Marie Obenour wrote:
> > > > > > On Wed, Jun 07, 2023 at 10:20:08AM +0200, Roger Pau Monné wrote:
> > > > > > > Can you fetch a disk using a diskseq identifier?
> > > > > > 
> > > > > > Not yet, although I have considered adding this ability.  It would be
> > > > > > one step towards a “diskseqfs” that userspace could use to open a device
> > > > > > by diskseq.
> > > > > > 
> > > > > > > Why I understand that this is an extra safety check in order to assert
> > > > > > > blkback is opening the intended device, is this attempting to fix some
> > > > > > > existing issue?
> > > > > > 
> > > > > > Yes, it is.  I have a block script (written in C) that validates the
> > > > > > device it has opened before passing the information to blkback.  It uses
> > > > > > the diskseq to do this, but for that protection to be complete, blkback
> > > > > > must also be aware of it.
> > > > > 
> > > > > But if your block script opens the device, and keeps it open until
> > > > > blkback has also taken a reference to it, there's no way such device
> > > > > could be removed and recreated in the window you point out above, as
> > > > > there's always a reference on it taken?
> > > > 
> > > > This assumes that the block script is not killed in the meantime,
> > > > which is not a safe assumption due to timeouts and the OOM killer.
> > > 
> > > Doesn't seem very reliable to use with delete-on-close either then.
> > 
> > That’s actually the purpose of delete-on-close!  It ensures that if the
> > block script gets killed, the device is automatically cleaned up.
> 
> Block script attach getting killed shouldn't prevent the toolstack
> from performing domain destruction, and thus removing the stale block
> device.
> 
> OTOH if your toolstack gets killed then there's not much that can be
> done, and the system will need intervention in order to get back into
> a sane state.
> 
> Hitting OOM in your control domain however is unlikely to be handled
> gracefully, even with delete-on-close.

I agree, _but_ we should not make it any harder than necessary.

> > > > > Then the block script will open the device by diskseq and pass the
> > > > > major:minor numbers to blkback.
> > > > 
> > > > Alternatively, the toolstack could write both the diskseq and
> > > > major:minor numbers and be confident that it is referring to the
> > > > correct device, no matter how long ago it got that information.
> > > > This could be quite useful for e.g. one VM exporting a device to
> > > > another VM by calling losetup(8) and expecting a human to make a
> > > > decision based on various properties about the device.  In this
> > > > case there is no upper bound on the race window.
> > > 
> > > Instead of playing with xenstore nodes, it might be better to simply
> > > have blkback export on sysfs the diskseq of the opened device, and let
> > > the block script check whether that's correct or not.  That implies
> > > less code in the kernel side, and doesn't pollute xenstore.
> > 
> > This would require that blkback delay exposing the device to the
> > frontend until the block script has checked that the diskseq is correct.
> 
> This depends on your toolstack implementation.  libxl won't start the
> domain until block scripts have finished execution, and hence the
> block script waiting for the sysfs node to appear and check it against
> the expected value would be enough.

True, but we cannot assume that everyone is using libxl.

> > Much simpler for the block script to provide the diskseq in xenstore.
> > If you want to avoid an extra xenstore node, I can make the diskseq part
> > of the physical-device node.
> 
> I'm thinking that we might want to introduce a "physical-device-uuid"
> node and use that to provide the diskseq to the backened.  Toolstacks
> (or block scripts) would need to be sure the "physical-device-uuid"
> node is populated before setting "physical-device", as writes to
> that node would still trigger blkback watch.  I think using two
> distinct watches would just make the logic in blkback too
> complicated.
> 
> My preference would be for the kernel to have a function for opening a
> device identified by a diskseq (as fetched from
> "physical-device-uuid"), so that we don't have to open using
> major:minor and then check the diskseq.

In theory I agree, but in practice it would be a significantly more
complex patch and given that it does not impact the uAPI I would prefer
the less-invasive option.  Is there anything more that needs to be done
here, other than replacing the "diskseq" name?  I prefer
"physical-device-luid" because the ID is only valid in one particular
VM.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 13/16] xen-blkback: Implement diskseq checks
  2023-06-21  1:14                     ` Demi Marie Obenour
@ 2023-06-21 10:07                       ` Roger Pau Monné
  0 siblings, 0 replies; 38+ messages in thread
From: Roger Pau Monné @ 2023-06-21 10:07 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Jens Axboe, Alasdair Kergon, Mike Snitzer, dm-devel,
	Marek Marczykowski-Górecki, linux-block, linux-kernel,
	xen-devel

On Tue, Jun 20, 2023 at 09:14:25PM -0400, Demi Marie Obenour wrote:
> On Mon, Jun 12, 2023 at 10:09:39AM +0200, Roger Pau Monné wrote:
> > On Fri, Jun 09, 2023 at 12:55:39PM -0400, Demi Marie Obenour wrote:
> > > On Fri, Jun 09, 2023 at 05:13:45PM +0200, Roger Pau Monné wrote:
> > > > On Thu, Jun 08, 2023 at 11:33:26AM -0400, Demi Marie Obenour wrote:
> > > > > On Thu, Jun 08, 2023 at 10:29:18AM +0200, Roger Pau Monné wrote:
> > > > > > On Wed, Jun 07, 2023 at 12:14:46PM -0400, Demi Marie Obenour wrote:
> > > > > > > On Wed, Jun 07, 2023 at 10:20:08AM +0200, Roger Pau Monné wrote:
> > > > > > Then the block script will open the device by diskseq and pass the
> > > > > > major:minor numbers to blkback.
> > > > > 
> > > > > Alternatively, the toolstack could write both the diskseq and
> > > > > major:minor numbers and be confident that it is referring to the
> > > > > correct device, no matter how long ago it got that information.
> > > > > This could be quite useful for e.g. one VM exporting a device to
> > > > > another VM by calling losetup(8) and expecting a human to make a
> > > > > decision based on various properties about the device.  In this
> > > > > case there is no upper bound on the race window.
> > > > 
> > > > Instead of playing with xenstore nodes, it might be better to simply
> > > > have blkback export on sysfs the diskseq of the opened device, and let
> > > > the block script check whether that's correct or not.  That implies
> > > > less code in the kernel side, and doesn't pollute xenstore.
> > > 
> > > This would require that blkback delay exposing the device to the
> > > frontend until the block script has checked that the diskseq is correct.
> > 
> > This depends on your toolstack implementation.  libxl won't start the
> > domain until block scripts have finished execution, and hence the
> > block script waiting for the sysfs node to appear and check it against
> > the expected value would be enough.
> 
> True, but we cannot assume that everyone is using libxl.

Right, for the udev case this won't be good, since the domain could be
already running, and hence could potentially attach to the backend
before the hotplug script realized the opened device is wrong.
Likewise for hot add disks.

> > > Much simpler for the block script to provide the diskseq in xenstore.
> > > If you want to avoid an extra xenstore node, I can make the diskseq part
> > > of the physical-device node.
> > 
> > I'm thinking that we might want to introduce a "physical-device-uuid"
> > node and use that to provide the diskseq to the backened.  Toolstacks
> > (or block scripts) would need to be sure the "physical-device-uuid"
> > node is populated before setting "physical-device", as writes to
> > that node would still trigger blkback watch.  I think using two
> > distinct watches would just make the logic in blkback too
> > complicated.
> > 
> > My preference would be for the kernel to have a function for opening a
> > device identified by a diskseq (as fetched from
> > "physical-device-uuid"), so that we don't have to open using
> > major:minor and then check the diskseq.
> 
> In theory I agree, but in practice it would be a significantly more
> complex patch and given that it does not impact the uAPI I would prefer
> the less-invasive option.

From a blkback point of view I don't see that option as more invasive,
it's actually the other way around IMO.  On blkback we would use
blkdev_get_by_diskseq() (or equivalent) instead of
blkdev_get_by_dev(), so it would result in an overall simpler
change (because the check against diskseq won't be needed anymore).

> Is there anything more that needs to be done
> here, other than replacing the "diskseq" name?

I think we also spoke about using sscanf to parse the option.

The patch to Xen blkif.h needs to be accepted before the Linux side
can progress.


> I prefer
> "physical-device-luid" because the ID is only valid in one particular
> VM.

"physical-device-uid" then maybe?

Thanks, Roger.

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2023-06-21 10:09 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-30 20:31 [PATCH v2 00/16] Diskseq support in loop, device-mapper, and blkback Demi Marie Obenour
2023-05-30 20:31 ` [PATCH v2 01/16] device-mapper: Check that target specs are sufficiently aligned Demi Marie Obenour
2023-05-30 20:31 ` [PATCH v2 02/16] device-mapper: Avoid pointer arithmetic overflow Demi Marie Obenour
2023-05-30 20:31 ` [PATCH v2 03/16] device-mapper: do not allow targets to overlap 'struct dm_ioctl' Demi Marie Obenour
2023-05-30 20:31 ` [PATCH v2 04/16] device-mapper: Better error message for too-short target spec Demi Marie Obenour
2023-05-30 20:31 ` [PATCH v2 05/16] device-mapper: Target parameters must not overlap next " Demi Marie Obenour
2023-05-30 20:31 ` [PATCH v2 06/16] device-mapper: Avoid double-fetch of version Demi Marie Obenour
2023-05-30 20:31 ` [PATCH v2 07/16] device-mapper: Allow userspace to opt-in to strict parameter checks Demi Marie Obenour
2023-05-30 20:31 ` [PATCH v2 08/16] device-mapper: Allow userspace to provide expected diskseq Demi Marie Obenour
2023-05-30 20:31 ` [PATCH v2 09/16] device-mapper: Allow userspace to suppress uevent generation Demi Marie Obenour
2023-05-30 20:31 ` [PATCH v2 10/16] device-mapper: Refuse to create device named "control" Demi Marie Obenour
2023-05-30 20:31 ` [PATCH v2 11/16] device-mapper: "." and ".." are not valid symlink names Demi Marie Obenour
2023-05-30 20:31 ` [PATCH v2 12/16] device-mapper: inform caller about already-existing device Demi Marie Obenour
2023-05-30 20:31 ` [PATCH v2 13/16] xen-blkback: Implement diskseq checks Demi Marie Obenour
2023-06-06  8:25   ` Roger Pau Monné
2023-06-06 17:01     ` Demi Marie Obenour
2023-06-07  8:20       ` Roger Pau Monné
2023-06-07 16:14         ` Demi Marie Obenour
2023-06-08  8:29           ` Roger Pau Monné
2023-06-08 15:33             ` Demi Marie Obenour
2023-06-09 15:13               ` Roger Pau Monné
2023-06-09 16:55                 ` Demi Marie Obenour
2023-06-12  8:09                   ` Roger Pau Monné
2023-06-21  1:14                     ` Demi Marie Obenour
2023-06-21 10:07                       ` Roger Pau Monné
2023-05-30 20:31 ` [PATCH v2 14/16] block, loop: Increment diskseq when releasing a loop device Demi Marie Obenour
2023-05-30 20:31 ` [PATCH v2 15/16] xen-blkback: Minor cleanups Demi Marie Obenour
2023-06-06  8:36   ` Roger Pau Monné
2023-05-30 20:31 ` [PATCH v2 16/16] xen-blkback: Inform userspace that device has been opened Demi Marie Obenour
2023-06-06  9:15   ` Roger Pau Monné
2023-06-06 17:31     ` Demi Marie Obenour
2023-06-07  8:44       ` Roger Pau Monné
2023-06-07 16:29         ` Demi Marie Obenour
2023-06-08  9:11           ` Roger Pau Monné
2023-06-08 15:23             ` Demi Marie Obenour
2023-06-08 10:08   ` Roger Pau Monné
2023-06-08 15:24     ` Demi Marie Obenour
2023-05-31 13:06 ` [dm-devel] [PATCH v2 00/16] Diskseq support in loop, device-mapper, and blkback Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).