All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace
@ 2020-12-03 21:58 ` Palmer Dabbelt
  0 siblings, 0 replies; 43+ messages in thread
From: Palmer Dabbelt @ 2020-12-03 21:58 UTC (permalink / raw)
  To: dm-devel, agk, snitzer
  Cc: corbet, song, Palmer Dabbelt, shuah, linux-doc, linux-kernel,
	linux-raid, linux-kselftest, kernel-team

This patch set contains dm-user, a device mapper target that proxies incoming
BIOs to userspace via a misc device.  Essentially it's FUSE, but for block
devices.  There's more information in the documentation patch and as a handful
of commends, so I'm just going to avoid duplicating that here.  I don't really
think there's any fundamental functionality that dm-user enables, as one could
use something along the lines of nbd/iscsi, but dm-user does result in
extremely simple userspace daemons -- so simple that when I tried to write a
helper userspace library for dm-user I just ended up with nothing.

I talked about this a bit at Plumbers and was hoping to send patches a bit
earlier on in the process, but got tied up with a few things.  As a result this
is actually quite far along: it's at the point where we're starting to run this
on real devices as part of an updated Android OTA update flow, where we're
using this to provide an Android-specific compressed backing store for
dm-snap-persistent.  The bulk of that project is scattered throughout the
various Android trees, so there are kselftests and a (somewhat bare for now)
Documentation entry with the intent of making this a self-contained
contribution.  There's a lot to the Android userspace daemon, but it doesn't
interact with dm-user in a very complex manner.

This is still in a somewhat early stage, but it's at the point where things
largely function.  I'm certainly not ready to commit to the user ABI
implemented here and there are a bunch of FIXMEs scattered throughout the code,
but I do think that it's far along enough to begin a more concrete discussion
of where folks would like to go with something like this.  While I'd intending
on sorting that stuff out, I'd like to at least get a feel for whether this is
a path worth pursuing before spending a bunch more time on it.

I haven't done much in the way of performance analysis for dm-user.  Earlier on
I did some simple throughput tests and found that dm-user/ext4 was faster than
half the speed of tmpfs, which is way out of the realm of being an issue for
our use case (decompressing blocks out of a phone's storage).  The design of
dm-user does preclude an extremely high performance implementation, where I
assume one would want an explicit ring buffer and zero copy, but I feel like
users who want that degree of performance are probably better served writing a
proper kernel driver.  I wouldn't be opposed to pushing on performance (ideally
without a major design change), but for now I feel like time is better spent
fortifying the user ABI and fixing the various issues with the implementation.

The patches follow as usual, but in case it's easier I've published a tree as
well:

    git://git.kernel.org/pub/scm/linux/kernel/git/palmer/dm-user.git -b dm-user-v1



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [dm-devel] [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace
@ 2020-12-03 21:58 ` Palmer Dabbelt
  0 siblings, 0 replies; 43+ messages in thread
From: Palmer Dabbelt @ 2020-12-03 21:58 UTC (permalink / raw)
  To: dm-devel, agk, snitzer
  Cc: corbet, linux-doc, kernel-team, Palmer Dabbelt, linux-kernel,
	linux-raid, song, linux-kselftest, shuah

This patch set contains dm-user, a device mapper target that proxies incoming
BIOs to userspace via a misc device.  Essentially it's FUSE, but for block
devices.  There's more information in the documentation patch and as a handful
of commends, so I'm just going to avoid duplicating that here.  I don't really
think there's any fundamental functionality that dm-user enables, as one could
use something along the lines of nbd/iscsi, but dm-user does result in
extremely simple userspace daemons -- so simple that when I tried to write a
helper userspace library for dm-user I just ended up with nothing.

I talked about this a bit at Plumbers and was hoping to send patches a bit
earlier on in the process, but got tied up with a few things.  As a result this
is actually quite far along: it's at the point where we're starting to run this
on real devices as part of an updated Android OTA update flow, where we're
using this to provide an Android-specific compressed backing store for
dm-snap-persistent.  The bulk of that project is scattered throughout the
various Android trees, so there are kselftests and a (somewhat bare for now)
Documentation entry with the intent of making this a self-contained
contribution.  There's a lot to the Android userspace daemon, but it doesn't
interact with dm-user in a very complex manner.

This is still in a somewhat early stage, but it's at the point where things
largely function.  I'm certainly not ready to commit to the user ABI
implemented here and there are a bunch of FIXMEs scattered throughout the code,
but I do think that it's far along enough to begin a more concrete discussion
of where folks would like to go with something like this.  While I'd intending
on sorting that stuff out, I'd like to at least get a feel for whether this is
a path worth pursuing before spending a bunch more time on it.

I haven't done much in the way of performance analysis for dm-user.  Earlier on
I did some simple throughput tests and found that dm-user/ext4 was faster than
half the speed of tmpfs, which is way out of the realm of being an issue for
our use case (decompressing blocks out of a phone's storage).  The design of
dm-user does preclude an extremely high performance implementation, where I
assume one would want an explicit ring buffer and zero copy, but I feel like
users who want that degree of performance are probably better served writing a
proper kernel driver.  I wouldn't be opposed to pushing on performance (ideally
without a major design change), but for now I feel like time is better spent
fortifying the user ABI and fixing the various issues with the implementation.

The patches follow as usual, but in case it's easier I've published a tree as
well:

    git://git.kernel.org/pub/scm/linux/kernel/git/palmer/dm-user.git -b dm-user-v1


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v1 1/5] Documentation: Describe dm-user
  2020-12-03 21:58 ` [dm-devel] " Palmer Dabbelt
@ 2020-12-03 21:58   ` Palmer Dabbelt
  -1 siblings, 0 replies; 43+ messages in thread
From: Palmer Dabbelt @ 2020-12-03 21:58 UTC (permalink / raw)
  To: dm-devel, agk, snitzer
  Cc: corbet, song, Palmer Dabbelt, shuah, linux-doc, linux-kernel,
	linux-raid, linux-kselftest, kernel-team

From: Palmer Dabbelt <palmerdabbelt@google.com>

I started by patterning this after the Fuse documentation, which is
located at Documentation/fs/fuse.rst.  There's not a whole lot of that
left, though.

Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>

---

This is a work in progress, but nothing in there should be incorrect.
---
 Documentation/block/dm-user.rst | 99 +++++++++++++++++++++++++++++++++
 1 file changed, 99 insertions(+)
 create mode 100644 Documentation/block/dm-user.rst

diff --git a/Documentation/block/dm-user.rst b/Documentation/block/dm-user.rst
new file mode 100644
index 000000000000..5eb3120f3fd5
--- /dev/null
+++ b/Documentation/block/dm-user.rst
@@ -0,0 +1,99 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======
+dm-user
+=======
+
+What is dm-user?
+================
+
+dm-user is a device mapper target that allows block accesses to be satisfied by
+an otherwise unprivileged daemon running in userspace.  Conceptually it is
+FUSE, but for block devices as opposed to file systems.
+
+Creating a dm-user Target
+=========================
+
+dm-user is implemented as a Device Mapper target, which allows for various
+device management tasks.  In general dm-user targets function in the same
+fashion as other device-mapper targets, with the exception that dm-user targets
+handle requests via a userspace daemon as opposed to one of various in-kernel
+mechanisms.  As such there is little difference between creating a dm-user
+target and any other device mapper target: the standard device mapper control
+device and ioctl() calls are used to create a table with at least one target of
+the "user" type.  Like all other targets this table entry needs a start/size
+pair.  The additional required argument is the name of the control device that
+will be associated with this target.  Specifically:
+
+````
+user <start sector> <number of sectors> <path to control device>
+````
+
+As a concrete example, the following `dmsetup` invocation will create a new
+device mapper block device available at `/dev/mapper/blk`, consisting entirely
+of a single target which can be controlled via a stream of messages passed over
+`/dev/dm-user/ctl`.
+
+````
+dmsetup create blk <<EOF
+0 1024 user 0 1024 ctl
+EOF
+dmsetup resume blk
+````
+
+Userspace is expected to spin up a daemon to handle those block requests, which
+will block in the meantime.  The various Device Mapper operations should all
+function as they do for other targets.  If the userspace daemon terminates (or
+otherwise closes the control file descriptor) then the kernel will continue to
+queue up BIOs until userspace either starts a new daemon or destroys the
+target.
+
+Userspace may open each control device multiple times, in which case the kernel
+will distribute messages among each file instance.
+
+Writing a dm-user Daemon
+========================
+
+tools/testing/selftests/dm-user contains a handful of test daemons.
+functional/simple-read-all.c is also suitable as an example.
+
+Kernel - userspace interface
+****************************
+
+FIXME: A description of `struct dm_user_message`
+
+Kernel Implementation
+=====================
+
+BIO Lifecycle
+*************
+
+| "dd if=/dev/mapper/user ...             " | dm-user block server
+|                                           |
+|                                           | >sys_read()
+|                                           |  >dev_read()
+|                                           |   [sleep on c->wq]
+|                                           |
+| >sys_read()                               |
+|  [... block and DM layer ... ]            |
+|  >user_map()                              |
+|   [enqueue message]                       |
+|   [wake up c->wq]                         |
+|  <user_map()                              |   [woken up]
+| [sleep on BIO completion]                 |   [copy message to user]
+|                                           |  <dev_read()
+|                                           | <sys_read()
+|                                           |
+|                                           | [obtain request data]
+|                                           |
+|                                           | >sys_write()
+|                                           |  >dev_write()
+|                                           |   [copy message from user]
+|                                           |   [complete BIO]
+|  [woken up on BIO completion]             |  <dev_write()
+| <sys_read()                               | <sys_write()
+|                                           |
+| [write and loop]                          | [loop for more messages]
+
+Locking Scheme
+**************
-- 
2.29.2.454.gaff20da3a2-goog


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [dm-devel] [PATCH v1 1/5] Documentation: Describe dm-user
@ 2020-12-03 21:58   ` Palmer Dabbelt
  0 siblings, 0 replies; 43+ messages in thread
From: Palmer Dabbelt @ 2020-12-03 21:58 UTC (permalink / raw)
  To: dm-devel, agk, snitzer
  Cc: corbet, linux-doc, kernel-team, Palmer Dabbelt, linux-kernel,
	linux-raid, song, linux-kselftest, shuah

From: Palmer Dabbelt <palmerdabbelt@google.com>

I started by patterning this after the Fuse documentation, which is
located at Documentation/fs/fuse.rst.  There's not a whole lot of that
left, though.

Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>

---

This is a work in progress, but nothing in there should be incorrect.
---
 Documentation/block/dm-user.rst | 99 +++++++++++++++++++++++++++++++++
 1 file changed, 99 insertions(+)
 create mode 100644 Documentation/block/dm-user.rst

diff --git a/Documentation/block/dm-user.rst b/Documentation/block/dm-user.rst
new file mode 100644
index 000000000000..5eb3120f3fd5
--- /dev/null
+++ b/Documentation/block/dm-user.rst
@@ -0,0 +1,99 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======
+dm-user
+=======
+
+What is dm-user?
+================
+
+dm-user is a device mapper target that allows block accesses to be satisfied by
+an otherwise unprivileged daemon running in userspace.  Conceptually it is
+FUSE, but for block devices as opposed to file systems.
+
+Creating a dm-user Target
+=========================
+
+dm-user is implemented as a Device Mapper target, which allows for various
+device management tasks.  In general dm-user targets function in the same
+fashion as other device-mapper targets, with the exception that dm-user targets
+handle requests via a userspace daemon as opposed to one of various in-kernel
+mechanisms.  As such there is little difference between creating a dm-user
+target and any other device mapper target: the standard device mapper control
+device and ioctl() calls are used to create a table with at least one target of
+the "user" type.  Like all other targets this table entry needs a start/size
+pair.  The additional required argument is the name of the control device that
+will be associated with this target.  Specifically:
+
+````
+user <start sector> <number of sectors> <path to control device>
+````
+
+As a concrete example, the following `dmsetup` invocation will create a new
+device mapper block device available at `/dev/mapper/blk`, consisting entirely
+of a single target which can be controlled via a stream of messages passed over
+`/dev/dm-user/ctl`.
+
+````
+dmsetup create blk <<EOF
+0 1024 user 0 1024 ctl
+EOF
+dmsetup resume blk
+````
+
+Userspace is expected to spin up a daemon to handle those block requests, which
+will block in the meantime.  The various Device Mapper operations should all
+function as they do for other targets.  If the userspace daemon terminates (or
+otherwise closes the control file descriptor) then the kernel will continue to
+queue up BIOs until userspace either starts a new daemon or destroys the
+target.
+
+Userspace may open each control device multiple times, in which case the kernel
+will distribute messages among each file instance.
+
+Writing a dm-user Daemon
+========================
+
+tools/testing/selftests/dm-user contains a handful of test daemons.
+functional/simple-read-all.c is also suitable as an example.
+
+Kernel - userspace interface
+****************************
+
+FIXME: A description of `struct dm_user_message`
+
+Kernel Implementation
+=====================
+
+BIO Lifecycle
+*************
+
+| "dd if=/dev/mapper/user ...             " | dm-user block server
+|                                           |
+|                                           | >sys_read()
+|                                           |  >dev_read()
+|                                           |   [sleep on c->wq]
+|                                           |
+| >sys_read()                               |
+|  [... block and DM layer ... ]            |
+|  >user_map()                              |
+|   [enqueue message]                       |
+|   [wake up c->wq]                         |
+|  <user_map()                              |   [woken up]
+| [sleep on BIO completion]                 |   [copy message to user]
+|                                           |  <dev_read()
+|                                           | <sys_read()
+|                                           |
+|                                           | [obtain request data]
+|                                           |
+|                                           | >sys_write()
+|                                           |  >dev_write()
+|                                           |   [copy message from user]
+|                                           |   [complete BIO]
+|  [woken up on BIO completion]             |  <dev_write()
+| <sys_read()                               | <sys_write()
+|                                           |
+| [write and loop]                          | [loop for more messages]
+
+Locking Scheme
+**************
-- 
2.29.2.454.gaff20da3a2-goog

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v1 2/5] uapi: Add dm-user structure definition
  2020-12-03 21:58 ` [dm-devel] " Palmer Dabbelt
@ 2020-12-03 21:58   ` Palmer Dabbelt
  -1 siblings, 0 replies; 43+ messages in thread
From: Palmer Dabbelt @ 2020-12-03 21:58 UTC (permalink / raw)
  To: dm-devel, agk, snitzer
  Cc: corbet, song, Palmer Dabbelt, shuah, linux-doc, linux-kernel,
	linux-raid, linux-kselftest, kernel-team

From: Palmer Dabbelt <palmerdabbelt@google.com>

dm-user is a device mapper target that allows a userspace process to
handle each incoming BIO.  Communication with userspace consists of a
stream of messages proxied over a misc device, the structure of each
message is defined in this header.

Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>

---

As it currently stands this isn't really sufficient to be a stable user
ABI.  These are probably best discussed in the context of the dm-user
implementation, though, where they're largely called out as FIXMEs.
---
 include/uapi/linux/dm-user.h | 67 ++++++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)
 create mode 100644 include/uapi/linux/dm-user.h

diff --git a/include/uapi/linux/dm-user.h b/include/uapi/linux/dm-user.h
new file mode 100644
index 000000000000..1806109518f4
--- /dev/null
+++ b/include/uapi/linux/dm-user.h
@@ -0,0 +1,67 @@
+/* SPDX-License-Identifier: LGPL-2.0+ WITH Linux-syscall-note */
+/*
+ * Copyright (C) 2020 Palmer Dabbelt <palmerdabbelt@google.com>
+ */
+
+#ifndef _LINUX_DM_USER_H
+#define _LINUX_DM_USER_H
+
+#include <linux/types.h>
+
+/*
+ * dm-user proxies device mapper ops between the kernel and userspace.  It's
+ * essentially just an RPC mechanism: all kernel calls create a request,
+ * userspace handles that with a response.  Userspace obtains requests via
+ * read() and provides responses via write().
+ *
+ * See Documentation/block/dm-user.rst for more information.
+ */
+
+#define DM_USER_REQ_MAP_READ 0
+#define DM_USER_REQ_MAP_WRITE 1
+#define DM_USER_REQ_MAP_FLUSH 2
+#define DM_USER_REQ_MAP_DISCARD 3
+#define DM_USER_REQ_MAP_SECURE_ERASE 4
+#define DM_USER_REQ_MAP_WRITE_SAME 5
+#define DM_USER_REQ_MAP_WRITE_ZEROES 6
+#define DM_USER_REQ_MAP_ZONE_OPEN 7
+#define DM_USER_REQ_MAP_ZONE_CLOSE 8
+#define DM_USER_REQ_MAP_ZONE_FINISH 9
+#define DM_USER_REQ_MAP_ZONE_APPEND 10
+#define DM_USER_REQ_MAP_ZONE_RESET 11
+#define DM_USER_REQ_MAP_ZONE_RESET_ALL 12
+
+#define DM_USER_REQ_MAP_FLAG_FAILFAST_DEV 0x00001
+#define DM_USER_REQ_MAP_FLAG_FAILFAST_TRANSPORT 0x00002
+#define DM_USER_REQ_MAP_FLAG_FAILFAST_DRIVER 0x00004
+#define DM_USER_REQ_MAP_FLAG_SYNC 0x00008
+#define DM_USER_REQ_MAP_FLAG_META 0x00010
+#define DM_USER_REQ_MAP_FLAG_PRIO 0x00020
+#define DM_USER_REQ_MAP_FLAG_NOMERGE 0x00040
+#define DM_USER_REQ_MAP_FLAG_IDLE 0x00080
+#define DM_USER_REQ_MAP_FLAG_INTEGRITY 0x00100
+#define DM_USER_REQ_MAP_FLAG_FUA 0x00200
+#define DM_USER_REQ_MAP_FLAG_PREFLUSH 0x00400
+#define DM_USER_REQ_MAP_FLAG_RAHEAD 0x00800
+#define DM_USER_REQ_MAP_FLAG_BACKGROUND 0x01000
+#define DM_USER_REQ_MAP_FLAG_NOWAIT 0x02000
+#define DM_USER_REQ_MAP_FLAG_CGROUP_PUNT 0x04000
+#define DM_USER_REQ_MAP_FLAG_NOUNMAP 0x08000
+#define DM_USER_REQ_MAP_FLAG_HIPRI 0x10000
+#define DM_USER_REQ_MAP_FLAG_DRV 0x20000
+#define DM_USER_REQ_MAP_FLAG_SWAP 0x40000
+
+#define DM_USER_RESP_SUCCESS 0
+#define DM_USER_RESP_ERROR 1
+#define DM_USER_RESP_UNSUPPORTED 2
+
+struct dm_user_message {
+	__u64 seq;
+	__u64 type;
+	__u64 flags;
+	__u64 sector;
+	__u64 len;
+	__u8 buf[];
+};
+
+#endif
-- 
2.29.2.454.gaff20da3a2-goog


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [dm-devel] [PATCH v1 2/5] uapi: Add dm-user structure definition
@ 2020-12-03 21:58   ` Palmer Dabbelt
  0 siblings, 0 replies; 43+ messages in thread
From: Palmer Dabbelt @ 2020-12-03 21:58 UTC (permalink / raw)
  To: dm-devel, agk, snitzer
  Cc: corbet, linux-doc, kernel-team, Palmer Dabbelt, linux-kernel,
	linux-raid, song, linux-kselftest, shuah

From: Palmer Dabbelt <palmerdabbelt@google.com>

dm-user is a device mapper target that allows a userspace process to
handle each incoming BIO.  Communication with userspace consists of a
stream of messages proxied over a misc device, the structure of each
message is defined in this header.

Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>

---

As it currently stands this isn't really sufficient to be a stable user
ABI.  These are probably best discussed in the context of the dm-user
implementation, though, where they're largely called out as FIXMEs.
---
 include/uapi/linux/dm-user.h | 67 ++++++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)
 create mode 100644 include/uapi/linux/dm-user.h

diff --git a/include/uapi/linux/dm-user.h b/include/uapi/linux/dm-user.h
new file mode 100644
index 000000000000..1806109518f4
--- /dev/null
+++ b/include/uapi/linux/dm-user.h
@@ -0,0 +1,67 @@
+/* SPDX-License-Identifier: LGPL-2.0+ WITH Linux-syscall-note */
+/*
+ * Copyright (C) 2020 Palmer Dabbelt <palmerdabbelt@google.com>
+ */
+
+#ifndef _LINUX_DM_USER_H
+#define _LINUX_DM_USER_H
+
+#include <linux/types.h>
+
+/*
+ * dm-user proxies device mapper ops between the kernel and userspace.  It's
+ * essentially just an RPC mechanism: all kernel calls create a request,
+ * userspace handles that with a response.  Userspace obtains requests via
+ * read() and provides responses via write().
+ *
+ * See Documentation/block/dm-user.rst for more information.
+ */
+
+#define DM_USER_REQ_MAP_READ 0
+#define DM_USER_REQ_MAP_WRITE 1
+#define DM_USER_REQ_MAP_FLUSH 2
+#define DM_USER_REQ_MAP_DISCARD 3
+#define DM_USER_REQ_MAP_SECURE_ERASE 4
+#define DM_USER_REQ_MAP_WRITE_SAME 5
+#define DM_USER_REQ_MAP_WRITE_ZEROES 6
+#define DM_USER_REQ_MAP_ZONE_OPEN 7
+#define DM_USER_REQ_MAP_ZONE_CLOSE 8
+#define DM_USER_REQ_MAP_ZONE_FINISH 9
+#define DM_USER_REQ_MAP_ZONE_APPEND 10
+#define DM_USER_REQ_MAP_ZONE_RESET 11
+#define DM_USER_REQ_MAP_ZONE_RESET_ALL 12
+
+#define DM_USER_REQ_MAP_FLAG_FAILFAST_DEV 0x00001
+#define DM_USER_REQ_MAP_FLAG_FAILFAST_TRANSPORT 0x00002
+#define DM_USER_REQ_MAP_FLAG_FAILFAST_DRIVER 0x00004
+#define DM_USER_REQ_MAP_FLAG_SYNC 0x00008
+#define DM_USER_REQ_MAP_FLAG_META 0x00010
+#define DM_USER_REQ_MAP_FLAG_PRIO 0x00020
+#define DM_USER_REQ_MAP_FLAG_NOMERGE 0x00040
+#define DM_USER_REQ_MAP_FLAG_IDLE 0x00080
+#define DM_USER_REQ_MAP_FLAG_INTEGRITY 0x00100
+#define DM_USER_REQ_MAP_FLAG_FUA 0x00200
+#define DM_USER_REQ_MAP_FLAG_PREFLUSH 0x00400
+#define DM_USER_REQ_MAP_FLAG_RAHEAD 0x00800
+#define DM_USER_REQ_MAP_FLAG_BACKGROUND 0x01000
+#define DM_USER_REQ_MAP_FLAG_NOWAIT 0x02000
+#define DM_USER_REQ_MAP_FLAG_CGROUP_PUNT 0x04000
+#define DM_USER_REQ_MAP_FLAG_NOUNMAP 0x08000
+#define DM_USER_REQ_MAP_FLAG_HIPRI 0x10000
+#define DM_USER_REQ_MAP_FLAG_DRV 0x20000
+#define DM_USER_REQ_MAP_FLAG_SWAP 0x40000
+
+#define DM_USER_RESP_SUCCESS 0
+#define DM_USER_RESP_ERROR 1
+#define DM_USER_RESP_UNSUPPORTED 2
+
+struct dm_user_message {
+	__u64 seq;
+	__u64 type;
+	__u64 flags;
+	__u64 sector;
+	__u64 len;
+	__u8 buf[];
+};
+
+#endif
-- 
2.29.2.454.gaff20da3a2-goog

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v1 3/5] dm: dm-user: New target that proxies BIOs to userspace
  2020-12-03 21:58 ` [dm-devel] " Palmer Dabbelt
@ 2020-12-03 21:58   ` Palmer Dabbelt
  -1 siblings, 0 replies; 43+ messages in thread
From: Palmer Dabbelt @ 2020-12-03 21:58 UTC (permalink / raw)
  To: dm-devel, agk, snitzer
  Cc: corbet, song, Palmer Dabbelt, shuah, linux-doc, linux-kernel,
	linux-raid, linux-kselftest, kernel-team

From: Palmer Dabbelt <palmerdabbelt@google.com>

dm-user is a device mapper target that allows a userpsace process to
handle each incoming BIO -- essentially it's Fuse, but for the block
layer.

Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>

---

This has numerous issues with this, which I've enumerated via FIXMEs
scattered throughout the code.  While it's obviously in no shape to be
merged, this does at least function at a basic level (the next patch has
some tests).  Many of the FIXMEs are simply missing functionality, but I
wanted to send this out earlier rather than later as I have some higher
level questions:

* Does it even make sense to have this within device mapper?  There's no
  fundamental reason for this to be a device mapper target (ie, it could
  just be its own block device), but being this does allow us to
  piggyback on existing mechanisms to handle the device lifecycle.
* Is dm-user (in cooperation with the userspace daemon) responsible for
  ordering flush-related BIOs with any other BIOs, or is that handled
  elsewhere within the kernel?
* Is my shared target mutex legal?
* Is there any benefit to returing DM_MAPIO_KILLED as opposed to later
  terminating the BIO with an IO error after it has been submitted?

Each of the above is discussed in more detail in the code.
---
 drivers/md/Kconfig   |   13 +
 drivers/md/Makefile  |    1 +
 drivers/md/dm-user.c | 1227 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 1241 insertions(+)
 create mode 100644 drivers/md/dm-user.c

diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 30ba3573626c..bcafca0e571d 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -617,4 +617,17 @@ config DM_ZONED
 
 	  If unsure, say N.
 
+config DM_USER
+	tristate "Block device in userspace"
+	depends on BLK_DEV_DM
+	help
+	  This device-mapper target allows a userspace daemon to provide the
+	  contents of a block device.  See
+	  <file:Documentation/block/dm-user.rst> for more information.
+
+	  To compile this code as a module, choose M here: the module will be
+	  called dm-user.
+
+	  If unsure, say N.
+
 endif # MD
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index 6d3e234dc46a..82ae3d496a00 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -51,6 +51,7 @@ obj-$(CONFIG_BLK_DEV_DM_BUILTIN) += dm-builtin.o
 obj-$(CONFIG_DM_UNSTRIPED)	+= dm-unstripe.o
 obj-$(CONFIG_DM_BUFIO)		+= dm-bufio.o
 obj-$(CONFIG_DM_BIO_PRISON)	+= dm-bio-prison.o
+obj-$(CONFIG_DM_USER)		+= dm-user.o
 obj-$(CONFIG_DM_CRYPT)		+= dm-crypt.o
 obj-$(CONFIG_DM_DELAY)		+= dm-delay.o
 obj-$(CONFIG_DM_DUST)		+= dm-dust.o
diff --git a/drivers/md/dm-user.c b/drivers/md/dm-user.c
new file mode 100644
index 000000000000..0aaa8f39f18a
--- /dev/null
+++ b/drivers/md/dm-user.c
@@ -0,0 +1,1227 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2020 Palmer Dabbelt <palmerdabbelt@google.com>
+ */
+
+#include <linux/device-mapper.h>
+#include <uapi/linux/dm-user.h>
+
+#include <linux/bio.h>
+#include <linux/init.h>
+#include <linux/mempool.h>
+#include <linux/miscdevice.h>
+#include <linux/module.h>
+#include <linux/poll.h>
+#include <linux/uio.h>
+#include <linux/wait.h>
+
+#define DM_MSG_PREFIX "user"
+
+#define MAX_OUTSTANDING_MESSAGES 128
+
+/*
+ * dm-user uses four structures:
+ *
+ *  - "struct target", the outermost structure, corresponds to a single device
+ *    mapper target.  This contains the set of outstanding BIOs that have been
+ *    provided by DM and are not actively being processed by the user, along
+ *    with a misc device that userspace can open to communicate with the
+ *    kernel.  Each time userspaces opens the misc device a new channel is
+ *    created.
+ *  - "struct channel", which represents a single active communication channel
+ *    with userspace.  Userspace may choose arbitrary read/write sizes to use
+ *    when processing messages, channels form these into logical accesses.
+ *    When userspace responds to a full message the channel completes the BIO
+ *    and obtains a new message to process from the target.
+ *  - "struct message", which wraps a BIO with the additional information
+ *    required by the kernel to sort out what to do with BIOs when they return
+ *    from userspace.
+ *  - "struct dm_user_message", which is the exact message format that
+ *    userspace sees.
+ *
+ * The hot path contains three distinct operations:
+ *
+ *  - user_map(), which is provided a BIO from device mapper that is queued
+ *    into the target.  This allocates and enqueues a new message.
+ *  - dev_read(), which dequeues a message, copies it to userspace.
+ *  - dev_write(), which looks up a message (keyed by sequence number) and
+ *    completes the corresponding BIO.
+ *
+ * Lock ordering (outer to inner)
+ *
+ * 1) miscdevice's global lock.  This is held around dev_open, so it has to be
+ *    the outermost lock.
+ * 2) target->lock
+ * 3) channel->lock
+ */
+
+struct message {
+	/*
+	 * Messages themselves do not need a lock, they're protected by either
+	 * the target or channel's lock, depending on which can reference them
+	 * directly.
+	 */
+	struct dm_user_message msg;
+	struct bio *bio;
+	size_t posn_to_user;
+	size_t total_to_user;
+	size_t posn_from_user;
+	size_t total_from_user;
+
+	struct list_head from_user;
+	struct list_head to_user;
+
+	/*
+	 * These are written back from the user.  They live in the same spot in
+	 * the message, but we need to either keep the old values around or
+	 * call a bunch more BIO helpers.  These are only valid after write has
+	 * adopted the message.
+	 */
+	u64 return_type;
+	u64 return_flags;
+};
+
+struct target {
+	/*
+	 * A target has a single lock, which protects everything in the target
+	 * (but does not protect the channels associated with a target).
+	 */
+	struct mutex lock;
+
+	/*
+	 * There is only one point at which anything blocks: userspace blocks
+	 * reading a new message, which is woken up by device mapper providing
+	 * a new BIO to process (or tearing down the target).  The
+	 * corresponding write side doesn't block, instead we treat userspace's
+	 * response containing a message that has yet to be mapped as an
+	 * invalid operation.
+	 */
+	struct wait_queue_head wq;
+
+	/*
+	 * Messages are delivered to userspace in order, but may be returned
+	 * out of order.  This allows userspace to schedule IO if it wants to.
+	 */
+	mempool_t message_pool;
+	u64 next_seq_to_map;
+	u64 next_seq_to_user;
+	struct list_head to_user;
+
+	/*
+	 * There is a misc device per target.  The name is selected by
+	 * userspace (via a DM create ioctl argument), and each ends up in
+	 * /dev/dm-user/.  It looks like a better way to do this may be to have
+	 * a filesystem to manage these, but this was more expedient.  The
+	 * current mechanism is functional, but does result in an arbitrary
+	 * number of dynamically created misc devices.
+	 */
+	struct miscdevice miscdev;
+
+	/*
+	 * Device mapper's target destructor triggers tearing this all down,
+	 * but we can't actually free until every channel associated with this
+	 * target has been destroyed.  Channels each have a reference to their
+	 * target, and there is an additional single reference that corresponds
+	 * to both DM and the misc device (both of which are destroyed by DM).
+	 *
+	 * In the common case userspace will be asleep waiting for a new
+	 * message when device mapper decides to destroy the target, which
+	 * means no new messages will appear.  The destroyed flag triggers a
+	 * wakeup, which will end up removing the reference.
+	 */
+	struct kref references;
+	int dm_destroyed;
+};
+
+struct channel {
+	struct target *target;
+
+	/*
+	 * A channel has a single lock, which prevents multiple reads (or
+	 * multiple writes) from conflicting with each other.
+	 */
+	struct mutex lock;
+
+	struct message *cur_to_user;
+	struct message *cur_from_user;
+	ssize_t to_user_error;
+	ssize_t from_user_error;
+
+	/*
+	 * Once a message has been forwarded to userspace on a channel it must
+	 * be responded to on the same channel.  This allows us to error out
+	 * the messages that have not yet been responded to by a channel when
+	 * that channel closes, which makes handling errors more reasonable for
+	 * fault-tolerant userspace daemons.  It also happens to make avoiding
+	 * shared locks between user_map() and dev_read() a lot easier.
+	 *
+	 * This does preclude a multi-threaded work stealing userspace
+	 * implementation (or at least, force a degree of head-of-line blocking
+	 * on the response path).
+	 */
+	struct list_head from_user;
+
+	/*
+	 * Responses from userspace can arrive in arbitrarily small chunks.
+	 * We need some place to buffer one up until we can find the
+	 * corresponding kernel-side message to continue processing, so instead
+	 * of allocating them we just keep one off to the side here.  This can
+	 * only ever be pointed to by from_user_cur, and will never have a BIO.
+	 */
+	struct message scratch_message_from_user;
+};
+
+static inline struct target *target_from_target(struct dm_target *target)
+{
+	WARN_ON(target->private == NULL);
+	return target->private;
+}
+
+static inline struct target *target_from_miscdev(struct miscdevice *miscdev)
+{
+	return container_of(miscdev, struct target, miscdev);
+}
+
+static inline struct channel *channel_from_file(struct file *file)
+{
+	WARN_ON(file->private_data == NULL);
+	return file->private_data;
+}
+
+static inline struct target *target_from_channel(struct channel *c)
+{
+	WARN_ON(c->target == NULL);
+	return c->target;
+}
+
+static inline size_t bio_size(struct bio *bio)
+{
+	struct bio_vec bvec;
+	struct bvec_iter iter;
+	size_t out = 0;
+
+	bio_for_each_segment(bvec, bio, iter)
+		out += bio_iter_len(bio, iter);
+	return out;
+}
+
+static inline size_t bio_bytes_needed_to_user(struct bio *bio)
+{
+	switch (bio_op(bio)) {
+	case REQ_OP_WRITE:
+		return sizeof(struct dm_user_message) + bio_size(bio);
+	case REQ_OP_READ:
+	case REQ_OP_FLUSH:
+	case REQ_OP_DISCARD:
+	case REQ_OP_SECURE_ERASE:
+	case REQ_OP_WRITE_SAME:
+	case REQ_OP_WRITE_ZEROES:
+	case REQ_OP_ZONE_OPEN:
+	case REQ_OP_ZONE_CLOSE:
+	case REQ_OP_ZONE_FINISH:
+	case REQ_OP_ZONE_APPEND:
+	case REQ_OP_ZONE_RESET:
+		return sizeof(struct dm_user_message);
+
+	/*
+	 * These ops are not passed to userspace under the assumption that
+	 * they're not going to be particularly useful in that context.
+	 */
+	case REQ_OP_SCSI_IN:
+	case REQ_OP_SCSI_OUT:
+	case REQ_OP_DRV_IN:
+	case REQ_OP_DRV_OUT:
+	/* Anything new isn't supported,at least not yet. */
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+static inline size_t bio_bytes_needed_from_user(struct bio *bio)
+{
+	switch (bio_op(bio)) {
+	case REQ_OP_READ:
+		return sizeof(struct dm_user_message) + bio_size(bio);
+	case REQ_OP_WRITE:
+	case REQ_OP_FLUSH:
+	case REQ_OP_DISCARD:
+	case REQ_OP_SECURE_ERASE:
+	case REQ_OP_WRITE_SAME:
+	case REQ_OP_WRITE_ZEROES:
+	case REQ_OP_ZONE_OPEN:
+	case REQ_OP_ZONE_CLOSE:
+	case REQ_OP_ZONE_FINISH:
+	case REQ_OP_ZONE_APPEND:
+	case REQ_OP_ZONE_RESET:
+		return sizeof(struct dm_user_message);
+
+	/*
+	 * These ops are not passed to userspace under the assumption that
+	 * they're not going to be particularly useful in that context.
+	 */
+	case REQ_OP_SCSI_IN:
+	case REQ_OP_SCSI_OUT:
+	case REQ_OP_DRV_IN:
+	case REQ_OP_DRV_OUT:
+	/* Anything new isn't supported,at least not yet. */
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+static inline long bio_type_to_user_type(struct bio *bio)
+{
+	switch (bio_op(bio)) {
+	case REQ_OP_READ:
+		return DM_USER_REQ_MAP_READ;
+	case REQ_OP_WRITE:
+		return DM_USER_REQ_MAP_WRITE;
+	case REQ_OP_FLUSH:
+		return DM_USER_REQ_MAP_FLUSH;
+	case REQ_OP_DISCARD:
+		return DM_USER_REQ_MAP_DISCARD;
+	case REQ_OP_SECURE_ERASE:
+		return DM_USER_REQ_MAP_SECURE_ERASE;
+	case REQ_OP_WRITE_SAME:
+		return DM_USER_REQ_MAP_WRITE_SAME;
+	case REQ_OP_WRITE_ZEROES:
+		return DM_USER_REQ_MAP_WRITE_ZEROES;
+	case REQ_OP_ZONE_OPEN:
+		return DM_USER_REQ_MAP_ZONE_OPEN;
+	case REQ_OP_ZONE_CLOSE:
+		return DM_USER_REQ_MAP_ZONE_CLOSE;
+	case REQ_OP_ZONE_FINISH:
+		return DM_USER_REQ_MAP_ZONE_FINISH;
+	case REQ_OP_ZONE_APPEND:
+		return DM_USER_REQ_MAP_ZONE_APPEND;
+	case REQ_OP_ZONE_RESET:
+		return DM_USER_REQ_MAP_ZONE_RESET;
+
+	/*
+	 * These ops are not passed to userspace under the assumption that
+	 * they're not going to be particularly useful in that context.
+	 */
+	case REQ_OP_SCSI_IN:
+	case REQ_OP_SCSI_OUT:
+	case REQ_OP_DRV_IN:
+	case REQ_OP_DRV_OUT:
+	/* Anything new isn't supported,at least not yet. */
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+static inline long bio_flags_to_user_flags(struct bio *bio)
+{
+	u64 out = 0;
+	typeof(bio->bi_opf) opf = bio->bi_opf & ~REQ_OP_MASK;
+
+	if (opf & REQ_FAILFAST_DEV) {
+		opf &= ~REQ_FAILFAST_DEV;
+		out |= DM_USER_REQ_MAP_FLAG_FAILFAST_DEV;
+	}
+
+	if (opf & REQ_FAILFAST_TRANSPORT) {
+		opf &= ~REQ_FAILFAST_TRANSPORT;
+		out |= DM_USER_REQ_MAP_FLAG_FAILFAST_TRANSPORT;
+	}
+
+	if (opf & REQ_FAILFAST_DRIVER) {
+		opf &= ~REQ_FAILFAST_DRIVER;
+		out |= DM_USER_REQ_MAP_FLAG_FAILFAST_DRIVER;
+	}
+
+	if (opf & REQ_SYNC) {
+		opf &= ~REQ_SYNC;
+		out |= DM_USER_REQ_MAP_FLAG_SYNC;
+	}
+
+	if (opf & REQ_META) {
+		opf &= ~REQ_META;
+		out |= DM_USER_REQ_MAP_FLAG_META;
+	}
+
+	if (opf & REQ_PRIO) {
+		opf &= ~REQ_PRIO;
+		out |= DM_USER_REQ_MAP_FLAG_PRIO;
+	}
+
+	if (opf & REQ_NOMERGE) {
+		opf &= ~REQ_NOMERGE;
+		out |= DM_USER_REQ_MAP_FLAG_NOMERGE;
+	}
+
+	if (opf & REQ_IDLE) {
+		opf &= ~REQ_IDLE;
+		out |= DM_USER_REQ_MAP_FLAG_IDLE;
+	}
+
+	if (opf & REQ_INTEGRITY) {
+		opf &= ~REQ_INTEGRITY;
+		out |= DM_USER_REQ_MAP_FLAG_INTEGRITY;
+	}
+
+	if (opf & REQ_FUA) {
+		opf &= ~REQ_FUA;
+		out |= DM_USER_REQ_MAP_FLAG_FUA;
+	}
+
+	if (opf & REQ_PREFLUSH) {
+		opf &= ~REQ_PREFLUSH;
+		out |= DM_USER_REQ_MAP_FLAG_PREFLUSH;
+	}
+
+	if (opf & REQ_PREFLUSH) {
+		opf &= ~REQ_PREFLUSH;
+		out |= DM_USER_REQ_MAP_FLAG_PREFLUSH;
+	}
+
+	if (opf & REQ_RAHEAD) {
+		opf &= ~REQ_RAHEAD;
+		out |= DM_USER_REQ_MAP_FLAG_RAHEAD;
+	}
+
+	if (opf & REQ_BACKGROUND) {
+		opf &= ~REQ_BACKGROUND;
+		out |= DM_USER_REQ_MAP_FLAG_BACKGROUND;
+	}
+
+	if (opf & REQ_BACKGROUND) {
+		opf &= ~REQ_BACKGROUND;
+		out |= DM_USER_REQ_MAP_FLAG_BACKGROUND;
+	}
+
+	if (opf & REQ_NOWAIT) {
+		opf &= ~REQ_NOWAIT;
+		out |= DM_USER_REQ_MAP_FLAG_NOWAIT;
+	}
+
+	if (opf & REQ_CGROUP_PUNT) {
+		opf &= ~REQ_CGROUP_PUNT;
+		out |= DM_USER_REQ_MAP_FLAG_CGROUP_PUNT;
+	}
+
+	if (opf & REQ_NOUNMAP) {
+		opf &= ~REQ_NOUNMAP;
+		out |= DM_USER_REQ_MAP_FLAG_NOUNMAP;
+	}
+
+	if (opf & REQ_HIPRI) {
+		opf &= ~REQ_HIPRI;
+		out |= DM_USER_REQ_MAP_FLAG_HIPRI;
+	}
+
+	if (unlikely(opf)) {
+		pr_warn("unsupported BIO type %x\n", opf);
+		return -EOPNOTSUPP;
+	}
+	WARN_ON(out < 0);
+	return out;
+}
+
+/*
+ * Not quite what's in blk-map.c, but instead what I thought the functions in
+ * blk-map did.  This one seems more generally useful and I think we could
+ * write the blk-map version in terms of this one.  The differences are that
+ * this has a return value that counts, and blk-map uses the BIO _all iters.
+ * Neither  advance the BIO iter but don't advance the IOV iter, which is a bit
+ * odd here.
+ */
+static ssize_t bio_copy_from_iter(struct bio *bio, struct iov_iter *iter)
+{
+	struct bio_vec bvec;
+	struct bvec_iter biter;
+	ssize_t out = 0;
+
+	bio_for_each_segment(bvec, bio, biter) {
+		ssize_t ret;
+
+		ret = copy_page_from_iter(bvec.bv_page, bvec.bv_offset,
+					  bvec.bv_len, iter);
+
+		/*
+		 * FIXME: I thought that IOV copies had a mechanism for
+		 * terminating early, if for example a signal came in while
+		 * sleeping waiting for a page to be mapped, but I don't see
+		 * where that would happen.
+		 */
+		WARN_ON(ret < 0);
+		out += ret;
+
+		if (!iov_iter_count(iter))
+			break;
+
+		if (ret < bvec.bv_len)
+			return ret;
+	}
+
+	return out;
+}
+
+static ssize_t bio_copy_to_iter(struct bio *bio, struct iov_iter *iter)
+{
+	struct bio_vec bvec;
+	struct bvec_iter biter;
+	ssize_t out = 0;
+
+	bio_for_each_segment(bvec, bio, biter) {
+		ssize_t ret;
+
+		ret = copy_page_to_iter(bvec.bv_page, bvec.bv_offset,
+					bvec.bv_len, iter);
+
+		/* as above */
+		WARN_ON(ret < 0);
+		out += ret;
+
+		if (!iov_iter_count(iter))
+			break;
+
+		if (ret < bvec.bv_len)
+			return ret;
+	}
+
+	return out;
+}
+
+static ssize_t msg_copy_to_iov(struct message *msg, struct iov_iter *to)
+{
+	ssize_t copied = 0;
+
+	if (!iov_iter_count(to))
+		return 0;
+
+	if (msg->posn_to_user < sizeof(msg->msg)) {
+		copied = copy_to_iter((char *)(&msg->msg) + msg->posn_to_user,
+				      sizeof(msg->msg) - msg->posn_to_user, to);
+	} else {
+		copied = bio_copy_to_iter(msg->bio, to);
+		if (copied > 0)
+			bio_advance(msg->bio, copied);
+	}
+
+	if (copied < 0)
+		return copied;
+
+	msg->posn_to_user += copied;
+	return copied;
+}
+
+static ssize_t msg_copy_from_iov(struct message *msg, struct iov_iter *from)
+{
+	ssize_t copied = 0;
+
+	if (!iov_iter_count(from))
+		return 0;
+
+	if (msg->posn_from_user < sizeof(msg->msg)) {
+		copied = copy_from_iter(
+			(char *)(&msg->msg) + msg->posn_from_user,
+			sizeof(msg->msg) - msg->posn_from_user, from);
+	} else {
+		copied = bio_copy_from_iter(msg->bio, from);
+		if (copied > 0)
+			bio_advance(msg->bio, copied);
+	}
+
+	if (copied < 0)
+		return copied;
+
+	msg->posn_from_user += copied;
+	return copied;
+}
+
+static struct message *msg_get_map(struct target *t)
+{
+	struct message *m;
+
+	lockdep_assert_held(&t->lock);
+
+	m = mempool_alloc(&t->message_pool, GFP_NOIO);
+	m->msg.seq = t->next_seq_to_map++;
+	INIT_LIST_HEAD(&m->to_user);
+	INIT_LIST_HEAD(&m->from_user);
+	return m;
+}
+
+static struct message *msg_get_to_user(struct target *t)
+{
+	struct message *m;
+
+	lockdep_assert_held(&t->lock);
+
+	if (list_empty(&t->to_user))
+		return NULL;
+
+	m = list_first_entry(&t->to_user, struct message, to_user);
+	list_del(&m->to_user);
+	return m;
+}
+
+static struct message *msg_get_from_user(struct channel *c, u64 seq)
+{
+	struct message *m;
+	struct list_head *cur;
+
+	lockdep_assert_held(&c->lock);
+
+	list_for_each(cur, &c->from_user) {
+		m = list_entry(cur, struct message, from_user);
+		if (m->msg.seq == seq) {
+			list_del(&m->from_user);
+			return m;
+		}
+	}
+
+	return NULL;
+}
+
+void message_kill(struct message *m, mempool_t *pool)
+{
+	m->bio->bi_status = BLK_STS_IOERR;
+	bio_endio(m->bio);
+	bio_put(m->bio);
+	mempool_free(m, pool);
+}
+
+/*
+ * Returns 0 when there is no work left to do.  This must be callable without
+ * holding the target lock, as it is part of the waitqueue's check expression.
+ * When called without the lock it may spuriously indicate there is remaining
+ * work, but when called with the lock it must be accurate.
+ */
+int target_poll(struct target *t)
+{
+	return !list_empty(&t->to_user) || t->dm_destroyed;
+}
+
+void target_release(struct kref *ref)
+{
+	struct target *t = container_of(ref, struct target, references);
+	struct list_head *cur;
+
+	/*
+	 * There may be outstanding BIOs that have not yet been given to
+	 * userspace.  At this point there's nothing we can do about them, as
+	 * there are and will never be any channels.
+	 */
+	list_for_each (cur, &t->to_user) {
+		message_kill(list_entry(cur, struct message, to_user),
+			     &t->message_pool);
+	}
+
+	mempool_exit(&t->message_pool);
+	mutex_unlock(&t->lock);
+	mutex_destroy(&t->lock);
+	kfree(t);
+}
+
+void target_put(struct target *t)
+{
+	/*
+	 * This both releases a reference to the target and the lock.  We leave
+	 * it up to the caller to hold the lock, as they probably needed it for
+	 * something else.
+	 */
+	lockdep_assert_held(&t->lock);
+
+	if (!kref_put(&t->references, target_release))
+		mutex_unlock(&t->lock);
+}
+
+struct channel *channel_alloc(struct target *t)
+{
+	struct channel *c;
+
+	lockdep_assert_held(&t->lock);
+
+	c = kzalloc(sizeof(*c), GFP_KERNEL);
+	if (c == NULL)
+		return NULL;
+
+	kref_get(&t->references);
+	c->target = t;
+	c->cur_from_user = &c->scratch_message_from_user;
+	mutex_init(&c->lock);
+	INIT_LIST_HEAD(&c->from_user);
+	return c;
+}
+
+void channel_free(struct channel *c)
+{
+	struct list_head *cur;
+
+	lockdep_assert_held(&c->lock);
+
+	/*
+	 * There may be outstanding BIOs that have been given to userspace but
+	 * have not yet been completed.  The channel has been shut down so
+	 * there's no way to process the rest of those messages, so we just go
+	 * ahead and error out the BIOs.  Hopefully whatever's on the other end
+	 * can handle the errors.  One could imagine splitting the BIOs and
+	 * completing as much as we got, but that seems like overkill here.
+	 *
+	 * Our only other options would be to let the BIO hang around (which
+	 * seems way worse) or to resubmit it to userspace in the hope there's
+	 * another channel.  I don't really like the idea of submitting a
+	 * message twice.
+	 */
+	if (c->cur_to_user != NULL)
+		message_kill(c->cur_to_user, &c->target->message_pool);
+	if (c->cur_from_user != &c->scratch_message_from_user)
+		message_kill(c->cur_from_user, &c->target->message_pool);
+	list_for_each(cur, &c->from_user)
+		message_kill(list_entry(cur, struct message, to_user),
+			     &c->target->message_pool);
+
+	mutex_lock(&c->target->lock);
+	target_put(c->target);
+	mutex_unlock(&c->lock);
+	mutex_destroy(&c->lock);
+	kfree(c);
+}
+
+static int dev_open(struct inode *inode, struct file *file)
+{
+	struct channel *c;
+	struct target *t;
+
+	/*
+	 * This is called by miscdev, which sets private_data to point to the
+	 * struct miscdevice that was opened.  The rest of our file operations
+	 * want to refer to the channel that's been opened, so we swap that
+	 * pointer out with a fresh channel.
+	 *
+	 * This is called with the miscdev lock held, which is also held while
+	 * registering/unregistering the miscdev.  The miscdev must be
+	 * registered for this to get called, which means there must be an
+	 * outstanding reference to the target, which means it cannot be freed
+	 * out from under us despite us not holding a reference yet.
+	 */
+	t = container_of(file->private_data, struct target, miscdev);
+	mutex_lock(&t->lock);
+	file->private_data = c = channel_alloc(t);
+
+	if (c == NULL) {
+		mutex_unlock(&t->lock);
+		return -ENOSPC;
+	}
+
+	mutex_unlock(&t->lock);
+	return 0;
+}
+
+static ssize_t dev_read(struct kiocb *iocb, struct iov_iter *to)
+{
+	struct channel *c = channel_from_file(iocb->ki_filp);
+	ssize_t total_processed = 0;
+	ssize_t processed;
+
+	mutex_lock(&c->lock);
+
+	if (unlikely(c->to_user_error)) {
+		total_processed = c->to_user_error;
+		goto cleanup_unlock;
+	}
+
+	if (c->cur_to_user == NULL) {
+		struct target *t = target_from_channel(c);
+
+		mutex_lock(&t->lock);
+
+		while (!target_poll(t)) {
+			int e;
+
+			mutex_unlock(&t->lock);
+			mutex_unlock(&c->lock);
+			e = wait_event_interruptible(t->wq, target_poll(t));
+			mutex_lock(&c->lock);
+			mutex_lock(&t->lock);
+
+			if (unlikely(e != 0)) {
+				/*
+				 * We haven't processed any bytes in either the
+				 * BIO or the IOV, so we can just terminate
+				 * right now.  Elsewhere in the kernel handles
+				 * restarting the syscall when appropriate.
+				 */
+				total_processed = e;
+				mutex_unlock(&t->lock);
+				goto cleanup_unlock;
+			}
+		}
+
+		if (unlikely(t->dm_destroyed)) {
+			/*
+			 * DM has destroyed this target, so just lock
+			 * the user out.  There's really nothing else
+			 * we can do here.  Note that we don't actually
+			 * tear any thing down until userspace has
+			 * closed the FD, as there may still be
+			 * outstanding BIOs.
+			 *
+			 * This is kind of a wacky error code to
+			 * return.  My goal was really just to try and
+			 * find something that wasn't likely to be
+			 * returned by anything else in the miscdev
+			 * path.  The message "block device required"
+			 * seems like a somewhat reasonable thing to
+			 * say when the target has disappeared out from
+			 * under us, but "not block" isn't sensible.
+			 */
+			c->to_user_error = total_processed = -ENOTBLK;
+			mutex_unlock(&t->lock);
+			goto cleanup_unlock;
+		}
+
+		/*
+		 * Ensures that accesses to the message data are not ordered
+		 * before the remote accesses that produce that message data.
+		 *
+		 * This pairs with the barrier in user_map(), via the
+		 * conditional within the while loop above. Also see the lack
+		 * of barrier in user_dtr(), which is why this can be after the
+		 * destroyed check.
+		 */
+		smp_rmb();
+
+		c->cur_to_user = msg_get_to_user(t);
+		WARN_ON(c->cur_to_user == NULL);
+		mutex_unlock(&t->lock);
+	}
+
+	processed = msg_copy_to_iov(c->cur_to_user, to);
+	total_processed += processed;
+
+	WARN_ON(c->cur_to_user->posn_to_user > c->cur_to_user->total_to_user);
+	if (c->cur_to_user->posn_to_user == c->cur_to_user->total_to_user) {
+		struct message *m = c->cur_to_user;
+
+		c->cur_to_user = NULL;
+		list_add_tail(&m->from_user, &c->from_user);
+	}
+
+cleanup_unlock:
+	mutex_unlock(&c->lock);
+	return total_processed;
+}
+
+static ssize_t dev_splice_read(struct file *in, loff_t *ppos,
+			       struct pipe_inode_info *pipe, size_t len,
+			       unsigned int flags)
+{
+	return -EOPNOTSUPP;
+}
+
+static ssize_t dev_write(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct channel *c = channel_from_file(iocb->ki_filp);
+	ssize_t total_processed = 0;
+	ssize_t processed;
+
+	mutex_lock(&c->lock);
+
+	if (unlikely(c->from_user_error)) {
+		total_processed = c->from_user_error;
+		goto cleanup_unlock;
+	}
+
+	/*
+	 * cur_from_user can never be NULL.  If there's no real message it must
+	 * point to the scratch space.
+	 */
+	WARN_ON(c->cur_from_user == NULL);
+	if (c->cur_from_user->posn_from_user < sizeof(struct dm_user_message)) {
+		struct message *msg, *old;
+
+		processed = msg_copy_from_iov(c->cur_from_user, from);
+		if (processed <= 0) {
+			pr_warn("msg_copy_from_iov() returned %zu\n",
+				processed);
+			c->from_user_error = -EINVAL;
+			goto cleanup_unlock;
+		}
+		total_processed += processed;
+
+		/*
+		 * In the unlikely event the user has provided us a very short
+		 * write, not even big enough to fill a message, just succeed.
+		 * We'll eventually build up enough bytes to do something.
+		 */
+		if (unlikely(c->cur_from_user->posn_from_user <
+			     sizeof(struct dm_user_message)))
+			goto cleanup_unlock;
+
+		old = c->cur_from_user;
+		mutex_lock(&c->target->lock);
+		msg = msg_get_from_user(c, c->cur_from_user->msg.seq);
+		if (msg == NULL) {
+			pr_info("user provided an invalid messag seq of %llx\n",
+				old->msg.seq);
+			mutex_unlock(&c->target->lock);
+			c->from_user_error = -EINVAL;
+			goto cleanup_unlock;
+		}
+		mutex_unlock(&c->target->lock);
+
+		WARN_ON(old->posn_from_user != sizeof(struct dm_user_message));
+		msg->posn_from_user = sizeof(struct dm_user_message);
+		msg->return_type = old->msg.type;
+		msg->return_flags = old->msg.flags;
+		WARN_ON(msg->posn_from_user > msg->total_from_user);
+		c->cur_from_user = msg;
+		WARN_ON(old != &c->scratch_message_from_user);
+	}
+
+	/*
+	 * Userspace can signal an error for single requests by overwriting the
+	 * seq field.
+	 */
+	switch (c->cur_from_user->return_type) {
+	case DM_USER_RESP_SUCCESS:
+		c->cur_from_user->bio->bi_status = BLK_STS_OK;
+		break;
+	case DM_USER_RESP_ERROR:
+	case DM_USER_RESP_UNSUPPORTED:
+	default:
+		c->cur_from_user->bio->bi_status = BLK_STS_IOERR;
+		goto finish_bio;
+	}
+
+	/*
+	 * The op was a success as far as userspace is concerned, so process
+	 * whatever data may come along with it.  The user may provide the BIO
+	 * data in multiple chunks, in which case we don't need to finish the
+	 * BIO.
+	 */
+	processed = msg_copy_from_iov(c->cur_from_user, from);
+	total_processed += processed;
+
+	if (c->cur_from_user->posn_from_user <
+	    c->cur_from_user->total_from_user)
+		goto cleanup_unlock;
+
+finish_bio:
+	/*
+	 * When we set up this message the BIO's size matched the
+	 * message size, if that's not still the case then something
+	 * has gone off the rails.
+	 */
+	WARN_ON(bio_size(c->cur_from_user->bio) != 0);
+	bio_endio(c->cur_from_user->bio);
+	bio_put(c->cur_from_user->bio);
+
+	/*
+	 * We don't actually need to take the target lock here, as all
+	 * we're doing is freeing the message and mempools have their
+	 * own lock.  Each channel has its ows scratch message.
+	 */
+	WARN_ON(c->cur_from_user == &c->scratch_message_from_user);
+	mempool_free(c->cur_from_user, &c->target->message_pool);
+	c->scratch_message_from_user.posn_from_user = 0;
+	c->cur_from_user = &c->scratch_message_from_user;
+
+cleanup_unlock:
+	mutex_unlock(&c->lock);
+	return total_processed;
+}
+
+static ssize_t dev_splice_write(struct pipe_inode_info *pipe, struct file *out,
+				loff_t *ppos, size_t len, unsigned int flags)
+{
+	return -EOPNOTSUPP;
+}
+
+static __poll_t dev_poll(struct file *file, poll_table *wait)
+{
+	return -EOPNOTSUPP;
+}
+
+static int dev_release(struct inode *inode, struct file *file)
+{
+	struct channel *c;
+
+	c = channel_from_file(file);
+	mutex_lock(&c->lock);
+	channel_free(c);
+
+	return 0;
+}
+
+static int dev_fasync(int fd, struct file *file, int on)
+{
+	return -EOPNOTSUPP;
+}
+
+static long dev_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+	return -EOPNOTSUPP;
+}
+
+static const struct file_operations file_operations = {
+	.owner = THIS_MODULE,
+	.open = dev_open,
+	.llseek = no_llseek,
+	.read_iter = dev_read,
+	.splice_read = dev_splice_read,
+	.write_iter = dev_write,
+	.splice_write = dev_splice_write,
+	.poll = dev_poll,
+	.release = dev_release,
+	.fasync = dev_fasync,
+	.unlocked_ioctl = dev_ioctl,
+};
+
+static int user_ctr(struct dm_target *ti, unsigned int argc, char **argv)
+{
+	struct target *t;
+	int r;
+
+	if (argc != 3) {
+		ti->error = "Invalid argument count";
+		r = -EINVAL;
+		goto cleanup_none;
+	}
+
+	t = kzalloc(sizeof(*t), GFP_KERNEL);
+	if (t == NULL) {
+		r = -ENOSPC;
+		goto cleanup_none;
+	}
+	ti->private = t;
+
+	/*
+	 * We begin with a single reference to the target, which is miscdev's
+	 * reference.  This ensures that the target won't be freed
+	 * until after the miscdev has been unregistered and all extant
+	 * channels have been closed.
+	 */
+	kref_init(&t->references);
+	kref_get(&t->references);
+
+	mutex_init(&t->lock);
+	init_waitqueue_head(&t->wq);
+	INIT_LIST_HEAD(&t->to_user);
+	mempool_init_kmalloc_pool(&t->message_pool, MAX_OUTSTANDING_MESSAGES,
+				  sizeof(struct message));
+
+	t->miscdev.minor = MISC_DYNAMIC_MINOR;
+	t->miscdev.fops = &file_operations;
+	t->miscdev.name = kasprintf(GFP_KERNEL, "dm-user/%s", argv[2]);
+	if (t->miscdev.name == NULL) {
+		r = -ENOSPC;
+		goto cleanup_message_pool;
+	}
+
+	/*
+	 * Once the miscdev is registered it can be opened and therefor
+	 * concurrent references to the channel can happen.  Holding the target
+	 * lock during misc_register() could deadlock.  If registration
+	 * succeeds then we will not access the target again so we just stick a
+	 * barrier here, which pairs with taking the target lock everywhere
+	 * else the target is accessed.
+	 *
+	 * I forgot where we ended up on the RCpc/RCsc locks.  IIU RCsc locks
+	 * would mean that we could take the target lock earlier and release it
+	 * here instead of the memory barrier.  I'm not sure that's any better,
+	 * though, and this isn't on a hot path so it probably doesn't matter
+	 * either way.
+	 */
+	smp_mb();
+
+	r = misc_register(&t->miscdev);
+	if (r) {
+		DMERR("Unable to register miscdev %s for dm-user",
+		      t->miscdev.name);
+		r = -ENOSPC;
+		goto cleanup_misc_name;
+	}
+
+	return 0;
+
+cleanup_misc_name:
+	kfree(t->miscdev.name);
+cleanup_message_pool:
+	mempool_exit(&t->message_pool);
+	kfree(t);
+cleanup_none:
+	return r;
+}
+
+static void user_dtr(struct dm_target *ti)
+{
+	struct target *t = target_from_target(ti);
+
+	/*
+	 * Removes the miscdev.  This must be called without the target lock
+	 * held to avoid a possible deadlock because our open implementation is
+	 * called holding the miscdev lock and must later take the target lock.
+	 *
+	 * There is no race here because only DM can register/unregister the
+	 * miscdev, and DM ensures that doesn't happen twice.  The internal
+	 * miscdev lock is sufficient to ensure there are no races between
+	 * deregistering the miscdev and open.
+	 */
+	misc_deregister(&t->miscdev);
+
+	/*
+	 * We are now free to take the target's lock and drop our reference to
+	 * the target.  There are almost certainly tasks sleeping in read on at
+	 * least one of the channels associated with this target, this
+	 * explicitly wakes them up and terminates the read.
+	 */
+	mutex_lock(&t->lock);
+	/*
+	 * No barrier here, as wait/wake ensures that the flag visibility is
+	 * correct WRT the wake/sleep state of the target tasks.
+	 */
+	t->dm_destroyed = true;
+	wake_up_all(&t->wq);
+	target_put(t);
+}
+
+/*
+ * Consumes a BIO from device mapper, queueing it up for userspace.
+ */
+static int user_map(struct dm_target *ti, struct bio *bio)
+{
+	struct target *t;
+	struct message *entry;
+
+	t = target_from_target(ti);
+	/*
+	 * FIXME
+	 *
+	 * This seems like a bad idea.  Specifically, here we're
+	 * directly on the IO path when we take the target lock, which may also
+	 * be taken from a user context.  The user context doesn't actively
+	 * trigger anything that may sleep while holding the lock, but this
+	 * still seems like a bad idea.
+	 *
+	 * The obvious way to fix this would be to use a proper queue, which
+	 * would result in no shared locks between the direct IO path and user
+	 * tasks.  I had a version that did this, but the head-of-line blocking
+	 * from the circular buffer resulted in us needing a fairly large
+	 * allocation in order to avoid situations in which the queue fills up
+	 * and everything goes off the rails.
+	 *
+	 * I could jump through a some hoops to avoid a shared lock while still
+	 * allowing for a large queue, but I'm not actually sure that allowing
+	 * for very large queues is the right thing to do here.  Intuitively it
+	 * seems better to keep the queues small in here (essentially sized to
+	 * the user latency for performance reasons only) and signal up the
+	 * stack to start throttling IOs.  I don't see a way to do that
+	 * (returning DM_MAPIO_REQUEUE seems like it'd work, but doesn't do
+	 * that).
+	 *
+	 * The best way I could come up with to fix this would be to use a
+	 * two-lock concurrent queue that's of infinite size (ie, linked list
+	 * based), which would get rid of the explicit shared lock.  The
+	 * mempool spinlock would still be shared, but I could just defer the
+	 * free from dev_write to user_map (and probably a worker).
+	 */
+	mutex_lock(&t->lock);
+
+	/*
+	 * FIXME
+	 *
+	 * The assumption here is that there's no benefit to returning
+	 * DM_MAPIO_KILL as opposed to just erroring out the BIO, but I'm not
+	 * sure that's actually true -- for example, I could imagine users
+	 * expecting that submitted BIOs are unlikely to fail and therefor
+	 * relying on submission failure to indicate an unsupported type.
+	 *
+	 * There's two ways I can think of to fix this:
+	 *   - Add DM arguments that are parsed during the constructor that
+	 *     allow various dm_target flags to be set that indicate the op
+	 *     types supported by this target.  This may make sense for things
+	 *     like discard, where DM can already transform the BIOs to a form
+	 *     that's likely to be supported.
+	 *   - Some sort of pre-filter that allows userspace to hook in here
+	 *     and kill BIOs before marking them as submitted.  My guess would
+	 *     be that a userspace round trip is a bad idea here, but a BPF
+	 *     call seems resonable.
+	 *
+	 * My guess is that we'd likely want to do both.  The first one is easy
+	 * and gives DM the proper info, so it seems better.  The BPF call
+	 * seems overly complex for just this, but one could imagine wanting to
+	 * sometimes return _MAPPED and a BPF filter would be the way to do
+	 * that.
+	 *
+	 * For example, in Android we have an in-kernel DM device called
+	 * "dm-bow" that takes advange of some portion of the space that has
+	 * been discarded on a device to provide opportunistic block-level
+	 * backups.  While one could imagine just implementing this entirely in
+	 * userspace, that would come with an appreciable performance penalty.
+	 * Instead one could keep a BPF program that forwards most accesses
+	 * directly to the backing block device while informing a userspace
+	 * daemon of any discarded space and on writes to blocks that are to be
+	 * backed up.
+	 */
+	if (unlikely((bio_type_to_user_type(bio) < 0)
+		     || (bio_flags_to_user_flags(bio) < 0))) {
+		mutex_unlock(&t->lock);
+		pr_warn("dm-user: unsupported bio_op() %d\n", bio_op(bio));
+		return DM_MAPIO_KILL;
+	}
+
+	entry = msg_get_map(t);
+	if (unlikely(entry == NULL)) {
+		mutex_unlock(&t->lock);
+		pr_warn("dm-user: unable to allocate message\n");
+		return DM_MAPIO_KILL;
+	}
+
+	bio_get(bio);
+	entry->msg.type = bio_type_to_user_type(bio);
+	entry->msg.flags = bio_flags_to_user_flags(bio);
+	entry->msg.sector = bio->bi_iter.bi_sector;
+	entry->msg.len = bio_size(bio);
+	entry->bio = bio;
+	entry->posn_to_user = 0;
+	entry->total_to_user = bio_bytes_needed_to_user(bio);
+	entry->posn_from_user = 0;
+	entry->total_from_user = bio_bytes_needed_from_user(bio);
+	/* Pairs with the barrier in dev_read() */
+	smp_wmb();
+	list_add_tail(&entry->to_user, &t->to_user);
+	wake_up_interruptible(&t->wq);
+	mutex_unlock(&t->lock);
+	return DM_MAPIO_SUBMITTED;
+}
+
+static struct target_type user_target = {
+	.name = "user",
+	.version = { 1, 0, 0 },
+	.module = THIS_MODULE,
+	.ctr = user_ctr,
+	.dtr = user_dtr,
+	.map = user_map,
+};
+
+static int __init dm_user_init(void)
+{
+	int r;
+
+	r = dm_register_target(&user_target);
+	if (r) {
+		DMERR("register failed %d", r);
+		goto error;
+	}
+
+	return 0;
+
+error:
+	return r;
+}
+
+static void __exit dm_user_exit(void)
+{
+	dm_unregister_target(&user_target);
+}
+
+module_init(dm_user_init);
+module_exit(dm_user_exit);
+MODULE_AUTHOR("Palmer Dabbelt <palmerdabbelt@google.com>");
+MODULE_DESCRIPTION(DM_NAME " target returning blocks from userspace");
+MODULE_LICENSE("GPL");
-- 
2.29.2.454.gaff20da3a2-goog


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [dm-devel] [PATCH v1 3/5] dm: dm-user: New target that proxies BIOs to userspace
@ 2020-12-03 21:58   ` Palmer Dabbelt
  0 siblings, 0 replies; 43+ messages in thread
From: Palmer Dabbelt @ 2020-12-03 21:58 UTC (permalink / raw)
  To: dm-devel, agk, snitzer
  Cc: corbet, linux-doc, kernel-team, Palmer Dabbelt, linux-kernel,
	linux-raid, song, linux-kselftest, shuah

From: Palmer Dabbelt <palmerdabbelt@google.com>

dm-user is a device mapper target that allows a userpsace process to
handle each incoming BIO -- essentially it's Fuse, but for the block
layer.

Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>

---

This has numerous issues with this, which I've enumerated via FIXMEs
scattered throughout the code.  While it's obviously in no shape to be
merged, this does at least function at a basic level (the next patch has
some tests).  Many of the FIXMEs are simply missing functionality, but I
wanted to send this out earlier rather than later as I have some higher
level questions:

* Does it even make sense to have this within device mapper?  There's no
  fundamental reason for this to be a device mapper target (ie, it could
  just be its own block device), but being this does allow us to
  piggyback on existing mechanisms to handle the device lifecycle.
* Is dm-user (in cooperation with the userspace daemon) responsible for
  ordering flush-related BIOs with any other BIOs, or is that handled
  elsewhere within the kernel?
* Is my shared target mutex legal?
* Is there any benefit to returing DM_MAPIO_KILLED as opposed to later
  terminating the BIO with an IO error after it has been submitted?

Each of the above is discussed in more detail in the code.
---
 drivers/md/Kconfig   |   13 +
 drivers/md/Makefile  |    1 +
 drivers/md/dm-user.c | 1227 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 1241 insertions(+)
 create mode 100644 drivers/md/dm-user.c

diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 30ba3573626c..bcafca0e571d 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -617,4 +617,17 @@ config DM_ZONED
 
 	  If unsure, say N.
 
+config DM_USER
+	tristate "Block device in userspace"
+	depends on BLK_DEV_DM
+	help
+	  This device-mapper target allows a userspace daemon to provide the
+	  contents of a block device.  See
+	  <file:Documentation/block/dm-user.rst> for more information.
+
+	  To compile this code as a module, choose M here: the module will be
+	  called dm-user.
+
+	  If unsure, say N.
+
 endif # MD
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index 6d3e234dc46a..82ae3d496a00 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -51,6 +51,7 @@ obj-$(CONFIG_BLK_DEV_DM_BUILTIN) += dm-builtin.o
 obj-$(CONFIG_DM_UNSTRIPED)	+= dm-unstripe.o
 obj-$(CONFIG_DM_BUFIO)		+= dm-bufio.o
 obj-$(CONFIG_DM_BIO_PRISON)	+= dm-bio-prison.o
+obj-$(CONFIG_DM_USER)		+= dm-user.o
 obj-$(CONFIG_DM_CRYPT)		+= dm-crypt.o
 obj-$(CONFIG_DM_DELAY)		+= dm-delay.o
 obj-$(CONFIG_DM_DUST)		+= dm-dust.o
diff --git a/drivers/md/dm-user.c b/drivers/md/dm-user.c
new file mode 100644
index 000000000000..0aaa8f39f18a
--- /dev/null
+++ b/drivers/md/dm-user.c
@@ -0,0 +1,1227 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2020 Palmer Dabbelt <palmerdabbelt@google.com>
+ */
+
+#include <linux/device-mapper.h>
+#include <uapi/linux/dm-user.h>
+
+#include <linux/bio.h>
+#include <linux/init.h>
+#include <linux/mempool.h>
+#include <linux/miscdevice.h>
+#include <linux/module.h>
+#include <linux/poll.h>
+#include <linux/uio.h>
+#include <linux/wait.h>
+
+#define DM_MSG_PREFIX "user"
+
+#define MAX_OUTSTANDING_MESSAGES 128
+
+/*
+ * dm-user uses four structures:
+ *
+ *  - "struct target", the outermost structure, corresponds to a single device
+ *    mapper target.  This contains the set of outstanding BIOs that have been
+ *    provided by DM and are not actively being processed by the user, along
+ *    with a misc device that userspace can open to communicate with the
+ *    kernel.  Each time userspaces opens the misc device a new channel is
+ *    created.
+ *  - "struct channel", which represents a single active communication channel
+ *    with userspace.  Userspace may choose arbitrary read/write sizes to use
+ *    when processing messages, channels form these into logical accesses.
+ *    When userspace responds to a full message the channel completes the BIO
+ *    and obtains a new message to process from the target.
+ *  - "struct message", which wraps a BIO with the additional information
+ *    required by the kernel to sort out what to do with BIOs when they return
+ *    from userspace.
+ *  - "struct dm_user_message", which is the exact message format that
+ *    userspace sees.
+ *
+ * The hot path contains three distinct operations:
+ *
+ *  - user_map(), which is provided a BIO from device mapper that is queued
+ *    into the target.  This allocates and enqueues a new message.
+ *  - dev_read(), which dequeues a message, copies it to userspace.
+ *  - dev_write(), which looks up a message (keyed by sequence number) and
+ *    completes the corresponding BIO.
+ *
+ * Lock ordering (outer to inner)
+ *
+ * 1) miscdevice's global lock.  This is held around dev_open, so it has to be
+ *    the outermost lock.
+ * 2) target->lock
+ * 3) channel->lock
+ */
+
+struct message {
+	/*
+	 * Messages themselves do not need a lock, they're protected by either
+	 * the target or channel's lock, depending on which can reference them
+	 * directly.
+	 */
+	struct dm_user_message msg;
+	struct bio *bio;
+	size_t posn_to_user;
+	size_t total_to_user;
+	size_t posn_from_user;
+	size_t total_from_user;
+
+	struct list_head from_user;
+	struct list_head to_user;
+
+	/*
+	 * These are written back from the user.  They live in the same spot in
+	 * the message, but we need to either keep the old values around or
+	 * call a bunch more BIO helpers.  These are only valid after write has
+	 * adopted the message.
+	 */
+	u64 return_type;
+	u64 return_flags;
+};
+
+struct target {
+	/*
+	 * A target has a single lock, which protects everything in the target
+	 * (but does not protect the channels associated with a target).
+	 */
+	struct mutex lock;
+
+	/*
+	 * There is only one point at which anything blocks: userspace blocks
+	 * reading a new message, which is woken up by device mapper providing
+	 * a new BIO to process (or tearing down the target).  The
+	 * corresponding write side doesn't block, instead we treat userspace's
+	 * response containing a message that has yet to be mapped as an
+	 * invalid operation.
+	 */
+	struct wait_queue_head wq;
+
+	/*
+	 * Messages are delivered to userspace in order, but may be returned
+	 * out of order.  This allows userspace to schedule IO if it wants to.
+	 */
+	mempool_t message_pool;
+	u64 next_seq_to_map;
+	u64 next_seq_to_user;
+	struct list_head to_user;
+
+	/*
+	 * There is a misc device per target.  The name is selected by
+	 * userspace (via a DM create ioctl argument), and each ends up in
+	 * /dev/dm-user/.  It looks like a better way to do this may be to have
+	 * a filesystem to manage these, but this was more expedient.  The
+	 * current mechanism is functional, but does result in an arbitrary
+	 * number of dynamically created misc devices.
+	 */
+	struct miscdevice miscdev;
+
+	/*
+	 * Device mapper's target destructor triggers tearing this all down,
+	 * but we can't actually free until every channel associated with this
+	 * target has been destroyed.  Channels each have a reference to their
+	 * target, and there is an additional single reference that corresponds
+	 * to both DM and the misc device (both of which are destroyed by DM).
+	 *
+	 * In the common case userspace will be asleep waiting for a new
+	 * message when device mapper decides to destroy the target, which
+	 * means no new messages will appear.  The destroyed flag triggers a
+	 * wakeup, which will end up removing the reference.
+	 */
+	struct kref references;
+	int dm_destroyed;
+};
+
+struct channel {
+	struct target *target;
+
+	/*
+	 * A channel has a single lock, which prevents multiple reads (or
+	 * multiple writes) from conflicting with each other.
+	 */
+	struct mutex lock;
+
+	struct message *cur_to_user;
+	struct message *cur_from_user;
+	ssize_t to_user_error;
+	ssize_t from_user_error;
+
+	/*
+	 * Once a message has been forwarded to userspace on a channel it must
+	 * be responded to on the same channel.  This allows us to error out
+	 * the messages that have not yet been responded to by a channel when
+	 * that channel closes, which makes handling errors more reasonable for
+	 * fault-tolerant userspace daemons.  It also happens to make avoiding
+	 * shared locks between user_map() and dev_read() a lot easier.
+	 *
+	 * This does preclude a multi-threaded work stealing userspace
+	 * implementation (or at least, force a degree of head-of-line blocking
+	 * on the response path).
+	 */
+	struct list_head from_user;
+
+	/*
+	 * Responses from userspace can arrive in arbitrarily small chunks.
+	 * We need some place to buffer one up until we can find the
+	 * corresponding kernel-side message to continue processing, so instead
+	 * of allocating them we just keep one off to the side here.  This can
+	 * only ever be pointed to by from_user_cur, and will never have a BIO.
+	 */
+	struct message scratch_message_from_user;
+};
+
+static inline struct target *target_from_target(struct dm_target *target)
+{
+	WARN_ON(target->private == NULL);
+	return target->private;
+}
+
+static inline struct target *target_from_miscdev(struct miscdevice *miscdev)
+{
+	return container_of(miscdev, struct target, miscdev);
+}
+
+static inline struct channel *channel_from_file(struct file *file)
+{
+	WARN_ON(file->private_data == NULL);
+	return file->private_data;
+}
+
+static inline struct target *target_from_channel(struct channel *c)
+{
+	WARN_ON(c->target == NULL);
+	return c->target;
+}
+
+static inline size_t bio_size(struct bio *bio)
+{
+	struct bio_vec bvec;
+	struct bvec_iter iter;
+	size_t out = 0;
+
+	bio_for_each_segment(bvec, bio, iter)
+		out += bio_iter_len(bio, iter);
+	return out;
+}
+
+static inline size_t bio_bytes_needed_to_user(struct bio *bio)
+{
+	switch (bio_op(bio)) {
+	case REQ_OP_WRITE:
+		return sizeof(struct dm_user_message) + bio_size(bio);
+	case REQ_OP_READ:
+	case REQ_OP_FLUSH:
+	case REQ_OP_DISCARD:
+	case REQ_OP_SECURE_ERASE:
+	case REQ_OP_WRITE_SAME:
+	case REQ_OP_WRITE_ZEROES:
+	case REQ_OP_ZONE_OPEN:
+	case REQ_OP_ZONE_CLOSE:
+	case REQ_OP_ZONE_FINISH:
+	case REQ_OP_ZONE_APPEND:
+	case REQ_OP_ZONE_RESET:
+		return sizeof(struct dm_user_message);
+
+	/*
+	 * These ops are not passed to userspace under the assumption that
+	 * they're not going to be particularly useful in that context.
+	 */
+	case REQ_OP_SCSI_IN:
+	case REQ_OP_SCSI_OUT:
+	case REQ_OP_DRV_IN:
+	case REQ_OP_DRV_OUT:
+	/* Anything new isn't supported,at least not yet. */
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+static inline size_t bio_bytes_needed_from_user(struct bio *bio)
+{
+	switch (bio_op(bio)) {
+	case REQ_OP_READ:
+		return sizeof(struct dm_user_message) + bio_size(bio);
+	case REQ_OP_WRITE:
+	case REQ_OP_FLUSH:
+	case REQ_OP_DISCARD:
+	case REQ_OP_SECURE_ERASE:
+	case REQ_OP_WRITE_SAME:
+	case REQ_OP_WRITE_ZEROES:
+	case REQ_OP_ZONE_OPEN:
+	case REQ_OP_ZONE_CLOSE:
+	case REQ_OP_ZONE_FINISH:
+	case REQ_OP_ZONE_APPEND:
+	case REQ_OP_ZONE_RESET:
+		return sizeof(struct dm_user_message);
+
+	/*
+	 * These ops are not passed to userspace under the assumption that
+	 * they're not going to be particularly useful in that context.
+	 */
+	case REQ_OP_SCSI_IN:
+	case REQ_OP_SCSI_OUT:
+	case REQ_OP_DRV_IN:
+	case REQ_OP_DRV_OUT:
+	/* Anything new isn't supported,at least not yet. */
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+static inline long bio_type_to_user_type(struct bio *bio)
+{
+	switch (bio_op(bio)) {
+	case REQ_OP_READ:
+		return DM_USER_REQ_MAP_READ;
+	case REQ_OP_WRITE:
+		return DM_USER_REQ_MAP_WRITE;
+	case REQ_OP_FLUSH:
+		return DM_USER_REQ_MAP_FLUSH;
+	case REQ_OP_DISCARD:
+		return DM_USER_REQ_MAP_DISCARD;
+	case REQ_OP_SECURE_ERASE:
+		return DM_USER_REQ_MAP_SECURE_ERASE;
+	case REQ_OP_WRITE_SAME:
+		return DM_USER_REQ_MAP_WRITE_SAME;
+	case REQ_OP_WRITE_ZEROES:
+		return DM_USER_REQ_MAP_WRITE_ZEROES;
+	case REQ_OP_ZONE_OPEN:
+		return DM_USER_REQ_MAP_ZONE_OPEN;
+	case REQ_OP_ZONE_CLOSE:
+		return DM_USER_REQ_MAP_ZONE_CLOSE;
+	case REQ_OP_ZONE_FINISH:
+		return DM_USER_REQ_MAP_ZONE_FINISH;
+	case REQ_OP_ZONE_APPEND:
+		return DM_USER_REQ_MAP_ZONE_APPEND;
+	case REQ_OP_ZONE_RESET:
+		return DM_USER_REQ_MAP_ZONE_RESET;
+
+	/*
+	 * These ops are not passed to userspace under the assumption that
+	 * they're not going to be particularly useful in that context.
+	 */
+	case REQ_OP_SCSI_IN:
+	case REQ_OP_SCSI_OUT:
+	case REQ_OP_DRV_IN:
+	case REQ_OP_DRV_OUT:
+	/* Anything new isn't supported,at least not yet. */
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+static inline long bio_flags_to_user_flags(struct bio *bio)
+{
+	u64 out = 0;
+	typeof(bio->bi_opf) opf = bio->bi_opf & ~REQ_OP_MASK;
+
+	if (opf & REQ_FAILFAST_DEV) {
+		opf &= ~REQ_FAILFAST_DEV;
+		out |= DM_USER_REQ_MAP_FLAG_FAILFAST_DEV;
+	}
+
+	if (opf & REQ_FAILFAST_TRANSPORT) {
+		opf &= ~REQ_FAILFAST_TRANSPORT;
+		out |= DM_USER_REQ_MAP_FLAG_FAILFAST_TRANSPORT;
+	}
+
+	if (opf & REQ_FAILFAST_DRIVER) {
+		opf &= ~REQ_FAILFAST_DRIVER;
+		out |= DM_USER_REQ_MAP_FLAG_FAILFAST_DRIVER;
+	}
+
+	if (opf & REQ_SYNC) {
+		opf &= ~REQ_SYNC;
+		out |= DM_USER_REQ_MAP_FLAG_SYNC;
+	}
+
+	if (opf & REQ_META) {
+		opf &= ~REQ_META;
+		out |= DM_USER_REQ_MAP_FLAG_META;
+	}
+
+	if (opf & REQ_PRIO) {
+		opf &= ~REQ_PRIO;
+		out |= DM_USER_REQ_MAP_FLAG_PRIO;
+	}
+
+	if (opf & REQ_NOMERGE) {
+		opf &= ~REQ_NOMERGE;
+		out |= DM_USER_REQ_MAP_FLAG_NOMERGE;
+	}
+
+	if (opf & REQ_IDLE) {
+		opf &= ~REQ_IDLE;
+		out |= DM_USER_REQ_MAP_FLAG_IDLE;
+	}
+
+	if (opf & REQ_INTEGRITY) {
+		opf &= ~REQ_INTEGRITY;
+		out |= DM_USER_REQ_MAP_FLAG_INTEGRITY;
+	}
+
+	if (opf & REQ_FUA) {
+		opf &= ~REQ_FUA;
+		out |= DM_USER_REQ_MAP_FLAG_FUA;
+	}
+
+	if (opf & REQ_PREFLUSH) {
+		opf &= ~REQ_PREFLUSH;
+		out |= DM_USER_REQ_MAP_FLAG_PREFLUSH;
+	}
+
+	if (opf & REQ_PREFLUSH) {
+		opf &= ~REQ_PREFLUSH;
+		out |= DM_USER_REQ_MAP_FLAG_PREFLUSH;
+	}
+
+	if (opf & REQ_RAHEAD) {
+		opf &= ~REQ_RAHEAD;
+		out |= DM_USER_REQ_MAP_FLAG_RAHEAD;
+	}
+
+	if (opf & REQ_BACKGROUND) {
+		opf &= ~REQ_BACKGROUND;
+		out |= DM_USER_REQ_MAP_FLAG_BACKGROUND;
+	}
+
+	if (opf & REQ_BACKGROUND) {
+		opf &= ~REQ_BACKGROUND;
+		out |= DM_USER_REQ_MAP_FLAG_BACKGROUND;
+	}
+
+	if (opf & REQ_NOWAIT) {
+		opf &= ~REQ_NOWAIT;
+		out |= DM_USER_REQ_MAP_FLAG_NOWAIT;
+	}
+
+	if (opf & REQ_CGROUP_PUNT) {
+		opf &= ~REQ_CGROUP_PUNT;
+		out |= DM_USER_REQ_MAP_FLAG_CGROUP_PUNT;
+	}
+
+	if (opf & REQ_NOUNMAP) {
+		opf &= ~REQ_NOUNMAP;
+		out |= DM_USER_REQ_MAP_FLAG_NOUNMAP;
+	}
+
+	if (opf & REQ_HIPRI) {
+		opf &= ~REQ_HIPRI;
+		out |= DM_USER_REQ_MAP_FLAG_HIPRI;
+	}
+
+	if (unlikely(opf)) {
+		pr_warn("unsupported BIO type %x\n", opf);
+		return -EOPNOTSUPP;
+	}
+	WARN_ON(out < 0);
+	return out;
+}
+
+/*
+ * Not quite what's in blk-map.c, but instead what I thought the functions in
+ * blk-map did.  This one seems more generally useful and I think we could
+ * write the blk-map version in terms of this one.  The differences are that
+ * this has a return value that counts, and blk-map uses the BIO _all iters.
+ * Neither  advance the BIO iter but don't advance the IOV iter, which is a bit
+ * odd here.
+ */
+static ssize_t bio_copy_from_iter(struct bio *bio, struct iov_iter *iter)
+{
+	struct bio_vec bvec;
+	struct bvec_iter biter;
+	ssize_t out = 0;
+
+	bio_for_each_segment(bvec, bio, biter) {
+		ssize_t ret;
+
+		ret = copy_page_from_iter(bvec.bv_page, bvec.bv_offset,
+					  bvec.bv_len, iter);
+
+		/*
+		 * FIXME: I thought that IOV copies had a mechanism for
+		 * terminating early, if for example a signal came in while
+		 * sleeping waiting for a page to be mapped, but I don't see
+		 * where that would happen.
+		 */
+		WARN_ON(ret < 0);
+		out += ret;
+
+		if (!iov_iter_count(iter))
+			break;
+
+		if (ret < bvec.bv_len)
+			return ret;
+	}
+
+	return out;
+}
+
+static ssize_t bio_copy_to_iter(struct bio *bio, struct iov_iter *iter)
+{
+	struct bio_vec bvec;
+	struct bvec_iter biter;
+	ssize_t out = 0;
+
+	bio_for_each_segment(bvec, bio, biter) {
+		ssize_t ret;
+
+		ret = copy_page_to_iter(bvec.bv_page, bvec.bv_offset,
+					bvec.bv_len, iter);
+
+		/* as above */
+		WARN_ON(ret < 0);
+		out += ret;
+
+		if (!iov_iter_count(iter))
+			break;
+
+		if (ret < bvec.bv_len)
+			return ret;
+	}
+
+	return out;
+}
+
+static ssize_t msg_copy_to_iov(struct message *msg, struct iov_iter *to)
+{
+	ssize_t copied = 0;
+
+	if (!iov_iter_count(to))
+		return 0;
+
+	if (msg->posn_to_user < sizeof(msg->msg)) {
+		copied = copy_to_iter((char *)(&msg->msg) + msg->posn_to_user,
+				      sizeof(msg->msg) - msg->posn_to_user, to);
+	} else {
+		copied = bio_copy_to_iter(msg->bio, to);
+		if (copied > 0)
+			bio_advance(msg->bio, copied);
+	}
+
+	if (copied < 0)
+		return copied;
+
+	msg->posn_to_user += copied;
+	return copied;
+}
+
+static ssize_t msg_copy_from_iov(struct message *msg, struct iov_iter *from)
+{
+	ssize_t copied = 0;
+
+	if (!iov_iter_count(from))
+		return 0;
+
+	if (msg->posn_from_user < sizeof(msg->msg)) {
+		copied = copy_from_iter(
+			(char *)(&msg->msg) + msg->posn_from_user,
+			sizeof(msg->msg) - msg->posn_from_user, from);
+	} else {
+		copied = bio_copy_from_iter(msg->bio, from);
+		if (copied > 0)
+			bio_advance(msg->bio, copied);
+	}
+
+	if (copied < 0)
+		return copied;
+
+	msg->posn_from_user += copied;
+	return copied;
+}
+
+static struct message *msg_get_map(struct target *t)
+{
+	struct message *m;
+
+	lockdep_assert_held(&t->lock);
+
+	m = mempool_alloc(&t->message_pool, GFP_NOIO);
+	m->msg.seq = t->next_seq_to_map++;
+	INIT_LIST_HEAD(&m->to_user);
+	INIT_LIST_HEAD(&m->from_user);
+	return m;
+}
+
+static struct message *msg_get_to_user(struct target *t)
+{
+	struct message *m;
+
+	lockdep_assert_held(&t->lock);
+
+	if (list_empty(&t->to_user))
+		return NULL;
+
+	m = list_first_entry(&t->to_user, struct message, to_user);
+	list_del(&m->to_user);
+	return m;
+}
+
+static struct message *msg_get_from_user(struct channel *c, u64 seq)
+{
+	struct message *m;
+	struct list_head *cur;
+
+	lockdep_assert_held(&c->lock);
+
+	list_for_each(cur, &c->from_user) {
+		m = list_entry(cur, struct message, from_user);
+		if (m->msg.seq == seq) {
+			list_del(&m->from_user);
+			return m;
+		}
+	}
+
+	return NULL;
+}
+
+void message_kill(struct message *m, mempool_t *pool)
+{
+	m->bio->bi_status = BLK_STS_IOERR;
+	bio_endio(m->bio);
+	bio_put(m->bio);
+	mempool_free(m, pool);
+}
+
+/*
+ * Returns 0 when there is no work left to do.  This must be callable without
+ * holding the target lock, as it is part of the waitqueue's check expression.
+ * When called without the lock it may spuriously indicate there is remaining
+ * work, but when called with the lock it must be accurate.
+ */
+int target_poll(struct target *t)
+{
+	return !list_empty(&t->to_user) || t->dm_destroyed;
+}
+
+void target_release(struct kref *ref)
+{
+	struct target *t = container_of(ref, struct target, references);
+	struct list_head *cur;
+
+	/*
+	 * There may be outstanding BIOs that have not yet been given to
+	 * userspace.  At this point there's nothing we can do about them, as
+	 * there are and will never be any channels.
+	 */
+	list_for_each (cur, &t->to_user) {
+		message_kill(list_entry(cur, struct message, to_user),
+			     &t->message_pool);
+	}
+
+	mempool_exit(&t->message_pool);
+	mutex_unlock(&t->lock);
+	mutex_destroy(&t->lock);
+	kfree(t);
+}
+
+void target_put(struct target *t)
+{
+	/*
+	 * This both releases a reference to the target and the lock.  We leave
+	 * it up to the caller to hold the lock, as they probably needed it for
+	 * something else.
+	 */
+	lockdep_assert_held(&t->lock);
+
+	if (!kref_put(&t->references, target_release))
+		mutex_unlock(&t->lock);
+}
+
+struct channel *channel_alloc(struct target *t)
+{
+	struct channel *c;
+
+	lockdep_assert_held(&t->lock);
+
+	c = kzalloc(sizeof(*c), GFP_KERNEL);
+	if (c == NULL)
+		return NULL;
+
+	kref_get(&t->references);
+	c->target = t;
+	c->cur_from_user = &c->scratch_message_from_user;
+	mutex_init(&c->lock);
+	INIT_LIST_HEAD(&c->from_user);
+	return c;
+}
+
+void channel_free(struct channel *c)
+{
+	struct list_head *cur;
+
+	lockdep_assert_held(&c->lock);
+
+	/*
+	 * There may be outstanding BIOs that have been given to userspace but
+	 * have not yet been completed.  The channel has been shut down so
+	 * there's no way to process the rest of those messages, so we just go
+	 * ahead and error out the BIOs.  Hopefully whatever's on the other end
+	 * can handle the errors.  One could imagine splitting the BIOs and
+	 * completing as much as we got, but that seems like overkill here.
+	 *
+	 * Our only other options would be to let the BIO hang around (which
+	 * seems way worse) or to resubmit it to userspace in the hope there's
+	 * another channel.  I don't really like the idea of submitting a
+	 * message twice.
+	 */
+	if (c->cur_to_user != NULL)
+		message_kill(c->cur_to_user, &c->target->message_pool);
+	if (c->cur_from_user != &c->scratch_message_from_user)
+		message_kill(c->cur_from_user, &c->target->message_pool);
+	list_for_each(cur, &c->from_user)
+		message_kill(list_entry(cur, struct message, to_user),
+			     &c->target->message_pool);
+
+	mutex_lock(&c->target->lock);
+	target_put(c->target);
+	mutex_unlock(&c->lock);
+	mutex_destroy(&c->lock);
+	kfree(c);
+}
+
+static int dev_open(struct inode *inode, struct file *file)
+{
+	struct channel *c;
+	struct target *t;
+
+	/*
+	 * This is called by miscdev, which sets private_data to point to the
+	 * struct miscdevice that was opened.  The rest of our file operations
+	 * want to refer to the channel that's been opened, so we swap that
+	 * pointer out with a fresh channel.
+	 *
+	 * This is called with the miscdev lock held, which is also held while
+	 * registering/unregistering the miscdev.  The miscdev must be
+	 * registered for this to get called, which means there must be an
+	 * outstanding reference to the target, which means it cannot be freed
+	 * out from under us despite us not holding a reference yet.
+	 */
+	t = container_of(file->private_data, struct target, miscdev);
+	mutex_lock(&t->lock);
+	file->private_data = c = channel_alloc(t);
+
+	if (c == NULL) {
+		mutex_unlock(&t->lock);
+		return -ENOSPC;
+	}
+
+	mutex_unlock(&t->lock);
+	return 0;
+}
+
+static ssize_t dev_read(struct kiocb *iocb, struct iov_iter *to)
+{
+	struct channel *c = channel_from_file(iocb->ki_filp);
+	ssize_t total_processed = 0;
+	ssize_t processed;
+
+	mutex_lock(&c->lock);
+
+	if (unlikely(c->to_user_error)) {
+		total_processed = c->to_user_error;
+		goto cleanup_unlock;
+	}
+
+	if (c->cur_to_user == NULL) {
+		struct target *t = target_from_channel(c);
+
+		mutex_lock(&t->lock);
+
+		while (!target_poll(t)) {
+			int e;
+
+			mutex_unlock(&t->lock);
+			mutex_unlock(&c->lock);
+			e = wait_event_interruptible(t->wq, target_poll(t));
+			mutex_lock(&c->lock);
+			mutex_lock(&t->lock);
+
+			if (unlikely(e != 0)) {
+				/*
+				 * We haven't processed any bytes in either the
+				 * BIO or the IOV, so we can just terminate
+				 * right now.  Elsewhere in the kernel handles
+				 * restarting the syscall when appropriate.
+				 */
+				total_processed = e;
+				mutex_unlock(&t->lock);
+				goto cleanup_unlock;
+			}
+		}
+
+		if (unlikely(t->dm_destroyed)) {
+			/*
+			 * DM has destroyed this target, so just lock
+			 * the user out.  There's really nothing else
+			 * we can do here.  Note that we don't actually
+			 * tear any thing down until userspace has
+			 * closed the FD, as there may still be
+			 * outstanding BIOs.
+			 *
+			 * This is kind of a wacky error code to
+			 * return.  My goal was really just to try and
+			 * find something that wasn't likely to be
+			 * returned by anything else in the miscdev
+			 * path.  The message "block device required"
+			 * seems like a somewhat reasonable thing to
+			 * say when the target has disappeared out from
+			 * under us, but "not block" isn't sensible.
+			 */
+			c->to_user_error = total_processed = -ENOTBLK;
+			mutex_unlock(&t->lock);
+			goto cleanup_unlock;
+		}
+
+		/*
+		 * Ensures that accesses to the message data are not ordered
+		 * before the remote accesses that produce that message data.
+		 *
+		 * This pairs with the barrier in user_map(), via the
+		 * conditional within the while loop above. Also see the lack
+		 * of barrier in user_dtr(), which is why this can be after the
+		 * destroyed check.
+		 */
+		smp_rmb();
+
+		c->cur_to_user = msg_get_to_user(t);
+		WARN_ON(c->cur_to_user == NULL);
+		mutex_unlock(&t->lock);
+	}
+
+	processed = msg_copy_to_iov(c->cur_to_user, to);
+	total_processed += processed;
+
+	WARN_ON(c->cur_to_user->posn_to_user > c->cur_to_user->total_to_user);
+	if (c->cur_to_user->posn_to_user == c->cur_to_user->total_to_user) {
+		struct message *m = c->cur_to_user;
+
+		c->cur_to_user = NULL;
+		list_add_tail(&m->from_user, &c->from_user);
+	}
+
+cleanup_unlock:
+	mutex_unlock(&c->lock);
+	return total_processed;
+}
+
+static ssize_t dev_splice_read(struct file *in, loff_t *ppos,
+			       struct pipe_inode_info *pipe, size_t len,
+			       unsigned int flags)
+{
+	return -EOPNOTSUPP;
+}
+
+static ssize_t dev_write(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct channel *c = channel_from_file(iocb->ki_filp);
+	ssize_t total_processed = 0;
+	ssize_t processed;
+
+	mutex_lock(&c->lock);
+
+	if (unlikely(c->from_user_error)) {
+		total_processed = c->from_user_error;
+		goto cleanup_unlock;
+	}
+
+	/*
+	 * cur_from_user can never be NULL.  If there's no real message it must
+	 * point to the scratch space.
+	 */
+	WARN_ON(c->cur_from_user == NULL);
+	if (c->cur_from_user->posn_from_user < sizeof(struct dm_user_message)) {
+		struct message *msg, *old;
+
+		processed = msg_copy_from_iov(c->cur_from_user, from);
+		if (processed <= 0) {
+			pr_warn("msg_copy_from_iov() returned %zu\n",
+				processed);
+			c->from_user_error = -EINVAL;
+			goto cleanup_unlock;
+		}
+		total_processed += processed;
+
+		/*
+		 * In the unlikely event the user has provided us a very short
+		 * write, not even big enough to fill a message, just succeed.
+		 * We'll eventually build up enough bytes to do something.
+		 */
+		if (unlikely(c->cur_from_user->posn_from_user <
+			     sizeof(struct dm_user_message)))
+			goto cleanup_unlock;
+
+		old = c->cur_from_user;
+		mutex_lock(&c->target->lock);
+		msg = msg_get_from_user(c, c->cur_from_user->msg.seq);
+		if (msg == NULL) {
+			pr_info("user provided an invalid messag seq of %llx\n",
+				old->msg.seq);
+			mutex_unlock(&c->target->lock);
+			c->from_user_error = -EINVAL;
+			goto cleanup_unlock;
+		}
+		mutex_unlock(&c->target->lock);
+
+		WARN_ON(old->posn_from_user != sizeof(struct dm_user_message));
+		msg->posn_from_user = sizeof(struct dm_user_message);
+		msg->return_type = old->msg.type;
+		msg->return_flags = old->msg.flags;
+		WARN_ON(msg->posn_from_user > msg->total_from_user);
+		c->cur_from_user = msg;
+		WARN_ON(old != &c->scratch_message_from_user);
+	}
+
+	/*
+	 * Userspace can signal an error for single requests by overwriting the
+	 * seq field.
+	 */
+	switch (c->cur_from_user->return_type) {
+	case DM_USER_RESP_SUCCESS:
+		c->cur_from_user->bio->bi_status = BLK_STS_OK;
+		break;
+	case DM_USER_RESP_ERROR:
+	case DM_USER_RESP_UNSUPPORTED:
+	default:
+		c->cur_from_user->bio->bi_status = BLK_STS_IOERR;
+		goto finish_bio;
+	}
+
+	/*
+	 * The op was a success as far as userspace is concerned, so process
+	 * whatever data may come along with it.  The user may provide the BIO
+	 * data in multiple chunks, in which case we don't need to finish the
+	 * BIO.
+	 */
+	processed = msg_copy_from_iov(c->cur_from_user, from);
+	total_processed += processed;
+
+	if (c->cur_from_user->posn_from_user <
+	    c->cur_from_user->total_from_user)
+		goto cleanup_unlock;
+
+finish_bio:
+	/*
+	 * When we set up this message the BIO's size matched the
+	 * message size, if that's not still the case then something
+	 * has gone off the rails.
+	 */
+	WARN_ON(bio_size(c->cur_from_user->bio) != 0);
+	bio_endio(c->cur_from_user->bio);
+	bio_put(c->cur_from_user->bio);
+
+	/*
+	 * We don't actually need to take the target lock here, as all
+	 * we're doing is freeing the message and mempools have their
+	 * own lock.  Each channel has its ows scratch message.
+	 */
+	WARN_ON(c->cur_from_user == &c->scratch_message_from_user);
+	mempool_free(c->cur_from_user, &c->target->message_pool);
+	c->scratch_message_from_user.posn_from_user = 0;
+	c->cur_from_user = &c->scratch_message_from_user;
+
+cleanup_unlock:
+	mutex_unlock(&c->lock);
+	return total_processed;
+}
+
+static ssize_t dev_splice_write(struct pipe_inode_info *pipe, struct file *out,
+				loff_t *ppos, size_t len, unsigned int flags)
+{
+	return -EOPNOTSUPP;
+}
+
+static __poll_t dev_poll(struct file *file, poll_table *wait)
+{
+	return -EOPNOTSUPP;
+}
+
+static int dev_release(struct inode *inode, struct file *file)
+{
+	struct channel *c;
+
+	c = channel_from_file(file);
+	mutex_lock(&c->lock);
+	channel_free(c);
+
+	return 0;
+}
+
+static int dev_fasync(int fd, struct file *file, int on)
+{
+	return -EOPNOTSUPP;
+}
+
+static long dev_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+	return -EOPNOTSUPP;
+}
+
+static const struct file_operations file_operations = {
+	.owner = THIS_MODULE,
+	.open = dev_open,
+	.llseek = no_llseek,
+	.read_iter = dev_read,
+	.splice_read = dev_splice_read,
+	.write_iter = dev_write,
+	.splice_write = dev_splice_write,
+	.poll = dev_poll,
+	.release = dev_release,
+	.fasync = dev_fasync,
+	.unlocked_ioctl = dev_ioctl,
+};
+
+static int user_ctr(struct dm_target *ti, unsigned int argc, char **argv)
+{
+	struct target *t;
+	int r;
+
+	if (argc != 3) {
+		ti->error = "Invalid argument count";
+		r = -EINVAL;
+		goto cleanup_none;
+	}
+
+	t = kzalloc(sizeof(*t), GFP_KERNEL);
+	if (t == NULL) {
+		r = -ENOSPC;
+		goto cleanup_none;
+	}
+	ti->private = t;
+
+	/*
+	 * We begin with a single reference to the target, which is miscdev's
+	 * reference.  This ensures that the target won't be freed
+	 * until after the miscdev has been unregistered and all extant
+	 * channels have been closed.
+	 */
+	kref_init(&t->references);
+	kref_get(&t->references);
+
+	mutex_init(&t->lock);
+	init_waitqueue_head(&t->wq);
+	INIT_LIST_HEAD(&t->to_user);
+	mempool_init_kmalloc_pool(&t->message_pool, MAX_OUTSTANDING_MESSAGES,
+				  sizeof(struct message));
+
+	t->miscdev.minor = MISC_DYNAMIC_MINOR;
+	t->miscdev.fops = &file_operations;
+	t->miscdev.name = kasprintf(GFP_KERNEL, "dm-user/%s", argv[2]);
+	if (t->miscdev.name == NULL) {
+		r = -ENOSPC;
+		goto cleanup_message_pool;
+	}
+
+	/*
+	 * Once the miscdev is registered it can be opened and therefor
+	 * concurrent references to the channel can happen.  Holding the target
+	 * lock during misc_register() could deadlock.  If registration
+	 * succeeds then we will not access the target again so we just stick a
+	 * barrier here, which pairs with taking the target lock everywhere
+	 * else the target is accessed.
+	 *
+	 * I forgot where we ended up on the RCpc/RCsc locks.  IIU RCsc locks
+	 * would mean that we could take the target lock earlier and release it
+	 * here instead of the memory barrier.  I'm not sure that's any better,
+	 * though, and this isn't on a hot path so it probably doesn't matter
+	 * either way.
+	 */
+	smp_mb();
+
+	r = misc_register(&t->miscdev);
+	if (r) {
+		DMERR("Unable to register miscdev %s for dm-user",
+		      t->miscdev.name);
+		r = -ENOSPC;
+		goto cleanup_misc_name;
+	}
+
+	return 0;
+
+cleanup_misc_name:
+	kfree(t->miscdev.name);
+cleanup_message_pool:
+	mempool_exit(&t->message_pool);
+	kfree(t);
+cleanup_none:
+	return r;
+}
+
+static void user_dtr(struct dm_target *ti)
+{
+	struct target *t = target_from_target(ti);
+
+	/*
+	 * Removes the miscdev.  This must be called without the target lock
+	 * held to avoid a possible deadlock because our open implementation is
+	 * called holding the miscdev lock and must later take the target lock.
+	 *
+	 * There is no race here because only DM can register/unregister the
+	 * miscdev, and DM ensures that doesn't happen twice.  The internal
+	 * miscdev lock is sufficient to ensure there are no races between
+	 * deregistering the miscdev and open.
+	 */
+	misc_deregister(&t->miscdev);
+
+	/*
+	 * We are now free to take the target's lock and drop our reference to
+	 * the target.  There are almost certainly tasks sleeping in read on at
+	 * least one of the channels associated with this target, this
+	 * explicitly wakes them up and terminates the read.
+	 */
+	mutex_lock(&t->lock);
+	/*
+	 * No barrier here, as wait/wake ensures that the flag visibility is
+	 * correct WRT the wake/sleep state of the target tasks.
+	 */
+	t->dm_destroyed = true;
+	wake_up_all(&t->wq);
+	target_put(t);
+}
+
+/*
+ * Consumes a BIO from device mapper, queueing it up for userspace.
+ */
+static int user_map(struct dm_target *ti, struct bio *bio)
+{
+	struct target *t;
+	struct message *entry;
+
+	t = target_from_target(ti);
+	/*
+	 * FIXME
+	 *
+	 * This seems like a bad idea.  Specifically, here we're
+	 * directly on the IO path when we take the target lock, which may also
+	 * be taken from a user context.  The user context doesn't actively
+	 * trigger anything that may sleep while holding the lock, but this
+	 * still seems like a bad idea.
+	 *
+	 * The obvious way to fix this would be to use a proper queue, which
+	 * would result in no shared locks between the direct IO path and user
+	 * tasks.  I had a version that did this, but the head-of-line blocking
+	 * from the circular buffer resulted in us needing a fairly large
+	 * allocation in order to avoid situations in which the queue fills up
+	 * and everything goes off the rails.
+	 *
+	 * I could jump through a some hoops to avoid a shared lock while still
+	 * allowing for a large queue, but I'm not actually sure that allowing
+	 * for very large queues is the right thing to do here.  Intuitively it
+	 * seems better to keep the queues small in here (essentially sized to
+	 * the user latency for performance reasons only) and signal up the
+	 * stack to start throttling IOs.  I don't see a way to do that
+	 * (returning DM_MAPIO_REQUEUE seems like it'd work, but doesn't do
+	 * that).
+	 *
+	 * The best way I could come up with to fix this would be to use a
+	 * two-lock concurrent queue that's of infinite size (ie, linked list
+	 * based), which would get rid of the explicit shared lock.  The
+	 * mempool spinlock would still be shared, but I could just defer the
+	 * free from dev_write to user_map (and probably a worker).
+	 */
+	mutex_lock(&t->lock);
+
+	/*
+	 * FIXME
+	 *
+	 * The assumption here is that there's no benefit to returning
+	 * DM_MAPIO_KILL as opposed to just erroring out the BIO, but I'm not
+	 * sure that's actually true -- for example, I could imagine users
+	 * expecting that submitted BIOs are unlikely to fail and therefor
+	 * relying on submission failure to indicate an unsupported type.
+	 *
+	 * There's two ways I can think of to fix this:
+	 *   - Add DM arguments that are parsed during the constructor that
+	 *     allow various dm_target flags to be set that indicate the op
+	 *     types supported by this target.  This may make sense for things
+	 *     like discard, where DM can already transform the BIOs to a form
+	 *     that's likely to be supported.
+	 *   - Some sort of pre-filter that allows userspace to hook in here
+	 *     and kill BIOs before marking them as submitted.  My guess would
+	 *     be that a userspace round trip is a bad idea here, but a BPF
+	 *     call seems resonable.
+	 *
+	 * My guess is that we'd likely want to do both.  The first one is easy
+	 * and gives DM the proper info, so it seems better.  The BPF call
+	 * seems overly complex for just this, but one could imagine wanting to
+	 * sometimes return _MAPPED and a BPF filter would be the way to do
+	 * that.
+	 *
+	 * For example, in Android we have an in-kernel DM device called
+	 * "dm-bow" that takes advange of some portion of the space that has
+	 * been discarded on a device to provide opportunistic block-level
+	 * backups.  While one could imagine just implementing this entirely in
+	 * userspace, that would come with an appreciable performance penalty.
+	 * Instead one could keep a BPF program that forwards most accesses
+	 * directly to the backing block device while informing a userspace
+	 * daemon of any discarded space and on writes to blocks that are to be
+	 * backed up.
+	 */
+	if (unlikely((bio_type_to_user_type(bio) < 0)
+		     || (bio_flags_to_user_flags(bio) < 0))) {
+		mutex_unlock(&t->lock);
+		pr_warn("dm-user: unsupported bio_op() %d\n", bio_op(bio));
+		return DM_MAPIO_KILL;
+	}
+
+	entry = msg_get_map(t);
+	if (unlikely(entry == NULL)) {
+		mutex_unlock(&t->lock);
+		pr_warn("dm-user: unable to allocate message\n");
+		return DM_MAPIO_KILL;
+	}
+
+	bio_get(bio);
+	entry->msg.type = bio_type_to_user_type(bio);
+	entry->msg.flags = bio_flags_to_user_flags(bio);
+	entry->msg.sector = bio->bi_iter.bi_sector;
+	entry->msg.len = bio_size(bio);
+	entry->bio = bio;
+	entry->posn_to_user = 0;
+	entry->total_to_user = bio_bytes_needed_to_user(bio);
+	entry->posn_from_user = 0;
+	entry->total_from_user = bio_bytes_needed_from_user(bio);
+	/* Pairs with the barrier in dev_read() */
+	smp_wmb();
+	list_add_tail(&entry->to_user, &t->to_user);
+	wake_up_interruptible(&t->wq);
+	mutex_unlock(&t->lock);
+	return DM_MAPIO_SUBMITTED;
+}
+
+static struct target_type user_target = {
+	.name = "user",
+	.version = { 1, 0, 0 },
+	.module = THIS_MODULE,
+	.ctr = user_ctr,
+	.dtr = user_dtr,
+	.map = user_map,
+};
+
+static int __init dm_user_init(void)
+{
+	int r;
+
+	r = dm_register_target(&user_target);
+	if (r) {
+		DMERR("register failed %d", r);
+		goto error;
+	}
+
+	return 0;
+
+error:
+	return r;
+}
+
+static void __exit dm_user_exit(void)
+{
+	dm_unregister_target(&user_target);
+}
+
+module_init(dm_user_init);
+module_exit(dm_user_exit);
+MODULE_AUTHOR("Palmer Dabbelt <palmerdabbelt@google.com>");
+MODULE_DESCRIPTION(DM_NAME " target returning blocks from userspace");
+MODULE_LICENSE("GPL");
-- 
2.29.2.454.gaff20da3a2-goog

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v1 4/5] selftests/dm-user: A handful of tests for dm-user
  2020-12-03 21:58 ` [dm-devel] " Palmer Dabbelt
@ 2020-12-03 21:58   ` Palmer Dabbelt
  -1 siblings, 0 replies; 43+ messages in thread
From: Palmer Dabbelt @ 2020-12-03 21:58 UTC (permalink / raw)
  To: dm-devel, agk, snitzer
  Cc: corbet, song, Palmer Dabbelt, shuah, linux-doc, linux-kernel,
	linux-raid, linux-kselftest, kernel-team

From: Palmer Dabbelt <palmerdabbelt@google.com>

These tests ruly on fsstress and fio to generate accesses to a block
device backed by a dm-user daemon.

Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>

---

I've lumped these all together rather than splitting them up.  The tests
themselves are independent, but the associated build/run infastructure
is pretty simple and I got tired of handling all the merge conflicts
that came from juggling each test as its own patch.

The tests themselves sholud be portable, but the harness will only run
in my environment (ie, QEMU).  That's kind of ugly, but I'm not really
sure how to do this in a more reasonable way.  I run the tests as
follows:

    $ rm -f scratch
    $ truncate scratch --size=10G
    $ qemu-system-x86_64 \
            -m 8G -smp 32 -cpu host -accel kvm \
            -kernel "${TREE}"/arch/x86_64/boot/bzImage \
            -initrd buildroot/output/images/rootfs.cpio \
            -append "console=ttyS0" \
            -drive
            file="${TREE}"/tools/testing/selftests/kselftest_install/kselftest-packages/kselftest.squashfs,if=virtio
            \
            -drive
            file=scratch,if=virtio \
            -nographic |&
            tee log
---
 tools/testing/selftests/.gitignore            |   3 +
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/dm-user/.gitignore    |   3 +
 tools/testing/selftests/dm-user/Makefile      |  23 ++
 tools/testing/selftests/dm-user/README        |  20 ++
 .../selftests/dm-user/daemon-example.c        | 186 ++++++++++++++
 .../selftests/dm-user/daemon-parallel.c       | 240 ++++++++++++++++++
 .../testing/selftests/dm-user/daemon-short.c  | 196 ++++++++++++++
 .../selftests/dm-user/fio-rand-read-1G.fio    |  16 ++
 .../selftests/dm-user/fio-verify-1G.fio       |  10 +
 .../testing/selftests/dm-user/harness-fio.sh  |  45 ++++
 .../selftests/dm-user/harness-fsstress.sh     |  44 ++++
 .../selftests/dm-user/include/logging.h       | 148 +++++++++++
 tools/testing/selftests/dm-user/run.sh        |  74 ++++++
 14 files changed, 1009 insertions(+)
 create mode 100644 tools/testing/selftests/dm-user/.gitignore
 create mode 100644 tools/testing/selftests/dm-user/Makefile
 create mode 100644 tools/testing/selftests/dm-user/README
 create mode 100644 tools/testing/selftests/dm-user/daemon-example.c
 create mode 100644 tools/testing/selftests/dm-user/daemon-parallel.c
 create mode 100644 tools/testing/selftests/dm-user/daemon-short.c
 create mode 100644 tools/testing/selftests/dm-user/fio-rand-read-1G.fio
 create mode 100644 tools/testing/selftests/dm-user/fio-verify-1G.fio
 create mode 100755 tools/testing/selftests/dm-user/harness-fio.sh
 create mode 100755 tools/testing/selftests/dm-user/harness-fsstress.sh
 create mode 100644 tools/testing/selftests/dm-user/include/logging.h
 create mode 100755 tools/testing/selftests/dm-user/run.sh

diff --git a/tools/testing/selftests/.gitignore b/tools/testing/selftests/.gitignore
index 055a5019b13c..88b1938ea5e6 100644
--- a/tools/testing/selftests/.gitignore
+++ b/tools/testing/selftests/.gitignore
@@ -8,3 +8,6 @@ tpm2/SpaceTest.log
 # Python bytecode and cache
 __pycache__/
 *.py[cod]
+
+# selftest install dir
+/kselftest_install/
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index d9c283503159..f5e0f61c4384 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -9,6 +9,7 @@ TARGETS += clone3
 TARGETS += core
 TARGETS += cpufreq
 TARGETS += cpu-hotplug
+TARGETS += dm-user
 TARGETS += drivers/dma-buf
 TARGETS += efivarfs
 TARGETS += exec
diff --git a/tools/testing/selftests/dm-user/.gitignore b/tools/testing/selftests/dm-user/.gitignore
new file mode 100644
index 000000000000..7b0aa3e4a738
--- /dev/null
+++ b/tools/testing/selftests/dm-user/.gitignore
@@ -0,0 +1,3 @@
+/daemon-example
+/daemon-parallel
+/daemon-short
diff --git a/tools/testing/selftests/dm-user/Makefile b/tools/testing/selftests/dm-user/Makefile
new file mode 100644
index 000000000000..98ff4f5d0fad
--- /dev/null
+++ b/tools/testing/selftests/dm-user/Makefile
@@ -0,0 +1,23 @@
+# SPDX-License-Identifier: GPL-2.0
+.PHONY: all clean
+
+top_srcdir = ../../../..
+INCLUDES := -I../ -Iinclude/ -I$(top_srcdir)/usr/include
+CFLAGS := $(CFLAGS) -g -O2 -Wall -static -D_GNU_SOURCE -pthread $(INCLUDES)
+KSFT_KHDR_INSTALL := 1
+
+TEST_GEN_FILES := \
+	daemon-example \
+	daemon-parallel \
+	daemon-short
+
+TEST_PROGS := \
+	fio-rand-read-1G.fio \
+	fio-verify-1G.fio \
+	harness-fio.sh \
+	harness-fsstress.sh \
+	run.sh
+
+$(TEST_GEN_FILES): khdr
+
+include ../lib.mk
diff --git a/tools/testing/selftests/dm-user/README b/tools/testing/selftests/dm-user/README
new file mode 100644
index 000000000000..213de27db35d
--- /dev/null
+++ b/tools/testing/selftests/dm-user/README
@@ -0,0 +1,20 @@
+dm-user Tests
+=============
+Tests for dm-user.
+
+Quick Start
+-----------
+It's probably a bad idea to just run this blindly, but all you need to do is:
+
+# make
+# ./run.sh
+
+Slow Start
+----------
+These tests use `dmsetup` to manage device mapper nodes, which is part of lvm2.
+Some use `fio`, and some use the `fsstress` from xfstests.  Some of the tests
+also expect "/dev/vdb" to exist and to be at least 10G.
+
+I use a simple buildroot-based initramfs to run the tests.  I've added an
+xfstests package to get fsstress, but I haven't sent out the patches yet.  I
+run everything in QEMU.
diff --git a/tools/testing/selftests/dm-user/daemon-example.c b/tools/testing/selftests/dm-user/daemon-example.c
new file mode 100644
index 000000000000..b245fad192bf
--- /dev/null
+++ b/tools/testing/selftests/dm-user/daemon-example.c
@@ -0,0 +1,186 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright 2020 Google, Inc
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <getopt.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <poll.h>
+#include <pthread.h>
+#include <linux/dm-user.h>
+#include <sys/prctl.h>
+#include "logging.h"
+
+#define SECTOR_SIZE 512
+
+#define MAX(a, b) ((a) > (b) ? (a) : (b))
+
+int write_all(int fd, void *buf, size_t len)
+{
+	char *buf_c = buf;
+	ssize_t total = 0;
+	ssize_t once;
+
+	while (total < len) {
+		once = write(fd, buf_c + total, len - total);
+		if (once <= 0)
+			return once;
+		total += once;
+	}
+
+	return total;
+}
+
+int read_all(int fd, void *buf, size_t len)
+{
+	char *buf_c = buf;
+	ssize_t total = 0;
+	ssize_t once;
+
+	while (total < len) {
+		once = read(fd, buf_c + total, len - total);
+		if (once <= 0)
+			return once;
+		total += once;
+	}
+
+	return total;
+}
+
+int simple_daemon(char *control_dev,
+		  size_t block_bytes,
+		  char *store)
+
+{
+	int control_fd = open(control_dev, O_RDWR);
+
+	if (control_fd < 0) {
+		ksft_print_msg("Unable to open control device %s\n", control_dev);
+		return RET_FAIL;
+	}
+
+	while (1) {
+		struct dm_user_message msg;
+		__u64 type;
+		char *base;
+
+		if (read_all(control_fd, &msg, sizeof(msg)) < 0) {
+			if (errno == ENOTBLK)
+				return RET_PASS;
+
+			perror("unable to read msg");
+			return RET_FAIL;
+		}
+
+		base = store + msg.sector * SECTOR_SIZE;
+		if (base + msg.len > store + block_bytes) {
+			fprintf(stderr, "access out of bounds\n");
+			return RET_FAIL;
+		}
+
+		type = msg.type;
+		switch (type) {
+		case DM_USER_REQ_MAP_WRITE:
+			msg.type = DM_USER_RESP_SUCCESS;
+			if (read_all(control_fd, base, msg.len) < 0) {
+				if (errno == ENOTBLK)
+					return RET_PASS;
+
+				perror("unable to read buf");
+				return RET_FAIL;
+			}
+			break;
+		case DM_USER_REQ_MAP_FLUSH:
+			/* Nothing extra to do on flush, we're in memory. */
+		case DM_USER_REQ_MAP_READ:
+			msg.type = DM_USER_RESP_SUCCESS;
+			break;
+		default:
+			msg.type = DM_USER_RESP_UNSUPPORTED;
+			break;
+		}
+
+		if (write_all(control_fd, &msg, sizeof(msg)) < 0) {
+			if (errno == ENOTBLK)
+				return RET_PASS;
+
+			perror("unable to write msg");
+			return RET_FAIL;
+		}
+
+		if (type == DM_USER_REQ_MAP_READ) {
+			if (write_all(control_fd, base, msg.len) < 0) {
+				if (errno == ENOTBLK)
+					return RET_PASS;
+
+				perror("unable to write buf");
+				return RET_FAIL;
+			}
+		}
+	}
+
+	/* The daemon doesn't actully terminate for this test. */
+	perror("Unable to read from control device");
+	return RET_FAIL;
+}
+
+void usage(char *prog)
+{
+	printf("Usage: %s\n", prog);
+	printf("  -h			Display this help message\n");
+	printf("  -v L			Verbosity level: %d=QUIET %d=CRITICAL %d=INFO\n",
+	       VQUIET, VCRITICAL, VINFO);
+	printf("  -c <control dev>	Control device to use for the test\n");
+	printf("  -s <sectors>		The number of sectors in the device\n");
+}
+
+int main(int argc, char *argv[])
+{
+	int ret = RET_PASS;
+	int c;
+	char *control_dev = NULL;
+	long block_bytes = 1024;
+	char *store;
+
+	prctl(PR_SET_IO_FLUSHER, 0, 0, 0, 0);
+
+	while ((c = getopt(argc, argv, "h:v:c:s:")) != -1) {
+		switch (c) {
+		case 'h':
+			usage(basename(argv[0]));
+			exit(0);
+		case 'v':
+			log_verbosity(atoi(optarg));
+			break;
+		case 'c':
+			control_dev = strdup(optarg);
+			break;
+		case 's':
+			block_bytes = atoi(optarg) * SECTOR_SIZE;
+			break;
+		default:
+			usage(basename(argv[0]));
+			exit(1);
+		}
+	}
+
+	ksft_print_header();
+	ksft_set_plan(1);
+	ksft_print_msg("%s: block_bytes=%zu\n",
+		       basename(argv[0]),
+		       block_bytes);
+
+	store = malloc(block_bytes);
+	for (size_t i = 0; i < block_bytes/sizeof(size_t); ++i)
+		((size_t *)(store))[i] = i;
+
+	ret = simple_daemon(control_dev, block_bytes, store);
+
+	print_result(basename(argv[0]), ret);
+	exit(ret);
+	return ret;
+}
diff --git a/tools/testing/selftests/dm-user/daemon-parallel.c b/tools/testing/selftests/dm-user/daemon-parallel.c
new file mode 100644
index 000000000000..9e5303f02241
--- /dev/null
+++ b/tools/testing/selftests/dm-user/daemon-parallel.c
@@ -0,0 +1,240 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright 2020 Google, Inc
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <getopt.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <poll.h>
+#include <pthread.h>
+#include <linux/dm-user.h>
+#include <sys/prctl.h>
+#include <sys/mman.h>
+#include "logging.h"
+
+#define SECTOR_SIZE 512
+#define MAX_WORKER_COUNT 256
+
+#define MAX(a, b) ((a) > (b) ? (a) : (b))
+
+struct test_context {
+	char *control_dev;
+	size_t block_bytes;
+	char *store;
+	long worker_count;
+	char *backing_path;
+};
+
+int write_all(int fd, void *buf, size_t len)
+{
+	char *buf_c = buf;
+	ssize_t total = 0;
+	ssize_t once;
+
+	while (total < len) {
+		once = write(fd, buf_c + total, len - total);
+		if (once <= 0)
+			return once;
+		total += once;
+	}
+
+	return total;
+}
+
+int read_all(int fd, void *buf, size_t len)
+{
+	char *buf_c = buf;
+	ssize_t total = 0;
+	ssize_t once;
+
+	while (total < len) {
+		once = read(fd, buf_c + total, len - total);
+		if (once <= 0)
+			return once;
+		total += once;
+	}
+
+	return total;
+}
+
+void *simple_daemon(void *context_uc)
+{
+	struct test_context *context = context_uc;
+	char *store = context->store;
+	int control_fd = open(context->control_dev, O_RDWR);
+
+	if (control_fd < 0) {
+		ksft_print_msg("Unable to open control device %s\n", context->control_dev);
+		return (void *)(RET_FAIL);
+	}
+
+	while (1) {
+		struct dm_user_message msg;
+		__u64 type;
+		char *base;
+
+		if (read_all(control_fd, &msg, sizeof(msg)) < 0) {
+			if (errno == ENOTBLK)
+				return (void *)(RET_PASS);
+
+			perror("unable to read msg");
+			return (void *)(RET_FAIL);
+		}
+
+		base = store + msg.sector * SECTOR_SIZE;
+		if (base + msg.len > store + context->block_bytes) {
+			fprintf(stderr, "access out of bounds\n");
+			return (void *)(RET_FAIL);
+		}
+
+		type = msg.type;
+		switch (type) {
+		case DM_USER_REQ_MAP_READ:
+			msg.type = DM_USER_RESP_SUCCESS;
+			break;
+		case DM_USER_REQ_MAP_WRITE:
+			msg.type = DM_USER_RESP_SUCCESS;
+			if (read_all(control_fd, base, msg.len) < 0) {
+				if (errno == ENOTBLK)
+					return (void *)(RET_PASS);
+
+				perror("unable to read buf");
+				return (void *)(RET_FAIL);
+			}
+			break;
+		case DM_USER_REQ_MAP_FLUSH:
+			msg.type = DM_USER_RESP_SUCCESS;
+			sync();
+			break;
+		default:
+			msg.type = DM_USER_RESP_UNSUPPORTED;
+			break;
+		}
+
+		if (write_all(control_fd, &msg, sizeof(msg)) < 0) {
+			if (errno == ENOTBLK)
+				return (void *)(RET_PASS);
+
+			perror("unable to write msg");
+			return (void *)(RET_FAIL);
+		}
+
+		if (type == DM_USER_REQ_MAP_READ) {
+			if (write_all(control_fd, base, msg.len) < 0) {
+				if (errno == ENOTBLK)
+					return (void *)(RET_PASS);
+
+				perror("unable to write buf");
+				return (void *)(RET_FAIL);
+			}
+		}
+	}
+
+	/* The daemon doesn't actully terminate for this test. */
+	perror("Unable to read from control device");
+	return (void *)(RET_FAIL);
+}
+
+void usage(char *prog)
+{
+	printf("Usage: %s\n", prog);
+	printf("  -h			Display this help message\n");
+	printf("  -v L			Verbosity level: %d=QUIET %d=CRITICAL %d=INFO\n",
+	       VQUIET, VCRITICAL, VINFO);
+	printf("  -c <control dev>	Control device to use for the test\n");
+	printf("  -s <sectors>		The number of sectors in the device\n");
+}
+
+int main(int argc, char *argv[])
+{
+	int ret = RET_PASS;
+	int done = 0;
+	int c;
+	struct test_context context = {
+		.control_dev	= NULL,
+		.block_bytes	= 0,
+		.worker_count   = 1,
+		.backing_path   = NULL,
+	};
+	pthread_t daemon[MAX_WORKER_COUNT];
+	void *pthread_ret;
+
+	prctl(PR_SET_IO_FLUSHER, 0, 0, 0, 0);
+
+	while ((c = getopt(argc, argv, "h:v:c:s:w:b:")) != -1) {
+		switch (c) {
+		case 'h':
+			usage(basename(argv[0]));
+			exit(0);
+		case 'v':
+			log_verbosity(atoi(optarg));
+			break;
+		case 'c':
+			context.control_dev = strdup(optarg);
+			break;
+		case 's':
+			context.block_bytes = atoi(optarg) * SECTOR_SIZE;
+			break;
+		case 'w':
+			context.worker_count = atoi(optarg);
+			break;
+		case 'b':
+			context.backing_path = strdup(optarg);
+			break;
+		default:
+			usage(basename(argv[0]));
+			exit(1);
+		}
+	}
+
+	ksft_print_header();
+	ksft_set_plan(1);
+	ksft_print_msg("%s: block_bytes=%zu\n",
+		       basename(argv[0]),
+		       context.block_bytes);
+
+	ret = RET_PASS;
+
+	if (context.backing_path == NULL) {
+		ksft_print_msg("Using an in-memory backing store\n");
+		context.store = malloc(context.block_bytes);
+		for (size_t i = 0; i < context.block_bytes/sizeof(size_t); ++i)
+			((size_t *)(context.store))[i] = i;
+	} else {
+		int backing_fd = open(context.backing_path, O_RDWR);
+
+		ksft_print_msg("Using %s as a backing store\n", context.backing_path);
+		if (backing_fd < 0) {
+			perror("Unable to open backing store");
+			ksft_print_msg("Unable to open backing store %s\n", context.backing_path);
+			return RET_FAIL;
+		}
+
+		context.store = mmap(NULL, context.block_bytes,
+				     PROT_READ | PROT_WRITE, MAP_SHARED,
+				     backing_fd, 0);
+	}
+
+	for (size_t i = 0; i < context.worker_count; ++i)
+		if (pthread_create(&daemon[i], NULL, &simple_daemon, &context) < 0)
+			ret = RET_ERROR;
+
+	while (!done)  {
+		for (size_t i = 0; i < context.worker_count; ++i) {
+			if (pthread_tryjoin_np(daemon[i], &pthread_ret) == 0) {
+				if (pthread_ret != RET_PASS)
+					ret = RET_ERROR;
+				done = 1;
+			}
+		}
+
+		sleep(1);
+	}
+
+	print_result(basename(argv[0]), ret);
+	exit(ret);
+}
diff --git a/tools/testing/selftests/dm-user/daemon-short.c b/tools/testing/selftests/dm-user/daemon-short.c
new file mode 100644
index 000000000000..40fd114cb390
--- /dev/null
+++ b/tools/testing/selftests/dm-user/daemon-short.c
@@ -0,0 +1,196 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright 2020 Google, Inc
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <getopt.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <poll.h>
+#include <pthread.h>
+#include <linux/dm-user.h>
+#include <sys/prctl.h>
+#include "logging.h"
+
+#define SECTOR_SIZE 512
+
+#define MAX(a, b) ((a) > (b) ? (a) : (b))
+
+int write_all(int fd, void *buf, size_t len)
+{
+	char *buf_c = buf;
+	ssize_t total = 0;
+	ssize_t once;
+
+	while (total < len) {
+		size_t max = len - total;
+
+		if (max > 3)
+			max = max / 3;
+
+		once = write(fd, buf_c + total, max);
+		if (once <= 0)
+			return once;
+		total += once;
+	}
+
+	return total;
+}
+
+int read_all(int fd, void *buf, size_t len)
+{
+	char *buf_c = buf;
+	ssize_t total = 0;
+	ssize_t once;
+
+	while (total < len) {
+		size_t max = len - total;
+
+		if (max > 3)
+			max = max / 3;
+
+		once = read(fd, buf_c + total, max);
+		if (once <= 0)
+			return once;
+		total += once;
+	}
+
+	return total;
+}
+
+int simple_daemon(char *control_dev,
+		  size_t block_bytes,
+		  char *store)
+
+{
+	int control_fd = open(control_dev, O_RDWR);
+
+	if (control_fd < 0) {
+		ksft_print_msg("Unable to open control device %s\n", control_dev);
+		return RET_FAIL;
+	}
+
+	while (1) {
+		struct dm_user_message msg;
+		__u64 type;
+		char *base;
+
+		if (read_all(control_fd, &msg, sizeof(msg)) < 0) {
+			if (errno == ENOTBLK)
+				return RET_PASS;
+
+			perror("unable to read msg");
+			return RET_FAIL;
+		}
+
+		base = store + msg.sector * SECTOR_SIZE;
+		if (base + msg.len > store + block_bytes) {
+			fprintf(stderr, "access out of bounds\n");
+			return RET_FAIL;
+		}
+
+		type = msg.type;
+		switch (type) {
+		case DM_USER_REQ_MAP_WRITE:
+			msg.type = DM_USER_RESP_SUCCESS;
+			if (read_all(control_fd, base, msg.len) < 0) {
+				if (errno == ENOTBLK)
+					return RET_PASS;
+
+				perror("unable to read buf");
+				return RET_FAIL;
+			}
+			break;
+		case DM_USER_REQ_MAP_FLUSH:
+			/* Nothing extra to do on flush, we're in memory. */
+		case DM_USER_REQ_MAP_READ:
+			msg.type = DM_USER_RESP_SUCCESS;
+			break;
+		default:
+			msg.type = DM_USER_RESP_UNSUPPORTED;
+			break;
+		}
+
+		if (write_all(control_fd, &msg, sizeof(msg)) < 0) {
+			if (errno == ENOTBLK)
+				return RET_PASS;
+
+			perror("unable to write msg");
+			return RET_FAIL;
+		}
+
+		if (type == DM_USER_REQ_MAP_READ) {
+			if (write_all(control_fd, base, msg.len) < 0) {
+				if (errno == ENOTBLK)
+					return RET_PASS;
+
+				perror("unable to write buf");
+				return RET_FAIL;
+			}
+		}
+	}
+
+	/* The daemon doesn't actully terminate for this test. */
+	perror("Unable to read from control device");
+	return RET_FAIL;
+}
+
+void usage(char *prog)
+{
+	printf("Usage: %s\n", prog);
+	printf("  -h			Display this help message\n");
+	printf("  -v L			Verbosity level: %d=QUIET %d=CRITICAL %d=INFO\n",
+	       VQUIET, VCRITICAL, VINFO);
+	printf("  -c <control dev>	Control device to use for the test\n");
+	printf("  -s <sectors>		The number of sectors in the device\n");
+}
+
+int main(int argc, char *argv[])
+{
+	int ret = RET_PASS;
+	int c;
+	char *control_dev = NULL;
+	long block_bytes = 1024;
+	char *store;
+
+	prctl(PR_SET_IO_FLUSHER, 0, 0, 0, 0);
+
+	while ((c = getopt(argc, argv, "h:v:c:s:")) != -1) {
+		switch (c) {
+		case 'h':
+			usage(basename(argv[0]));
+			exit(0);
+		case 'v':
+			log_verbosity(atoi(optarg));
+			break;
+		case 'c':
+			control_dev = strdup(optarg);
+			break;
+		case 's':
+			block_bytes = atoi(optarg) * SECTOR_SIZE;
+			break;
+		default:
+			usage(basename(argv[0]));
+			exit(1);
+		}
+	}
+
+	ksft_print_header();
+	ksft_set_plan(1);
+	ksft_print_msg("%s: block_bytes=%zu\n",
+		       basename(argv[0]),
+		       block_bytes);
+
+	store = malloc(block_bytes);
+	for (size_t i = 0; i < block_bytes/sizeof(size_t); ++i)
+		((size_t *)(store))[i] = i;
+
+	ret = simple_daemon(control_dev, block_bytes, store);
+
+	print_result(basename(argv[0]), ret);
+	exit(ret);
+	return ret;
+}
diff --git a/tools/testing/selftests/dm-user/fio-rand-read-1G.fio b/tools/testing/selftests/dm-user/fio-rand-read-1G.fio
new file mode 100644
index 000000000000..f971483e0e27
--- /dev/null
+++ b/tools/testing/selftests/dm-user/fio-rand-read-1G.fio
@@ -0,0 +1,16 @@
+; fio-rand-read.job for fiotest
+
+[global]
+name=fio-rand-read-1G
+filename=fio-rand-read-1G
+rw=randread
+bs=4K
+direct=0
+numjobs=1
+time_based=1
+runtime=30
+
+[file1]
+size=1G
+ioengine=io_uring
+iodepth=16
diff --git a/tools/testing/selftests/dm-user/fio-verify-1G.fio b/tools/testing/selftests/dm-user/fio-verify-1G.fio
new file mode 100644
index 000000000000..4b626271ce7c
--- /dev/null
+++ b/tools/testing/selftests/dm-user/fio-verify-1G.fio
@@ -0,0 +1,10 @@
+# The most basic form of data verification. Write the device randomly
+# in 4K chunks, then read it back and verify the contents.
+[write-and-verify]
+rw=randwrite
+bs=4k
+ioengine=libaio
+iodepth=16
+direct=1
+verify=crc32c
+size=1G
diff --git a/tools/testing/selftests/dm-user/harness-fio.sh b/tools/testing/selftests/dm-user/harness-fio.sh
new file mode 100755
index 000000000000..4b95c9f5efd8
--- /dev/null
+++ b/tools/testing/selftests/dm-user/harness-fio.sh
@@ -0,0 +1,45 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright 2020 Google, Inc
+
+# Just a fixed size for now, but it's passed to the tests and they're supposed
+# to respect it.
+SIZE=1024
+BLOCK=kselftest-dm-user-block
+CONTROL=kselftest-dm-user-control
+unset FIO
+
+while [ x"$1" != x"--" ]
+do
+    case "$1" in
+    "-s")    SIZE="$2";                             shift 2;;
+    "-f")    FIO="$2";                              shift 2;;
+    *)       echo "$0: unknown argument $1" >&2;    exit  1;;
+    esac
+done
+shift
+
+# Run the benchmark again via dm-user, to see what the overhead is.
+dmsetup create $BLOCK << EOF
+0 $SIZE user 0 $SIZE $CONTROL
+EOF
+
+dmsetup resume $BLOCK
+
+"$@" -s $SIZE -c /dev/dm-user/$CONTROL &
+
+yes | mkfs.ext2 /dev/mapper/$BLOCK
+mount /dev/mapper/$BLOCK /mnt
+cp "$FIO" /mnt/benchmark.fio
+(cd /mnt; fio benchmark.fio)
+umount /mnt
+
+# Mount again and read the whole thing, just to see if there's any corruption.
+mount /dev/mapper/$BLOCK /mnt
+find /mnt -type f | xargs cat > /dev/null
+umount /mnt
+
+dmsetup remove $BLOCK
+
+# Make sure the daemon actually responds to DM closing it.
+wait
diff --git a/tools/testing/selftests/dm-user/harness-fsstress.sh b/tools/testing/selftests/dm-user/harness-fsstress.sh
new file mode 100755
index 000000000000..265c0ff636ca
--- /dev/null
+++ b/tools/testing/selftests/dm-user/harness-fsstress.sh
@@ -0,0 +1,44 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright 2020 Google, Inc
+
+BLOCK=kselftest-dm-user-block
+CONTROL=kselftest-dm-user-control
+unset SIZE
+unset NPROC
+unset NOP
+
+while [ x"$1" != x"--" ]
+do
+    case "$1" in
+    "-s")    SIZE="$2";                             shift 2;;
+    "-n")    NOP="$2";                              shift 2;;
+    "-p")    NPROC="$2";                            shift 2;;
+    *)       echo "$0: unknown argument $1" >&2;    exit  1;;
+    esac
+done
+shift
+
+# Runs the fs stress tests
+dmsetup create $BLOCK << EOF
+0 $SIZE user 0 $SIZE $CONTROL
+EOF
+
+dmsetup resume $BLOCK
+
+"$@" -s $SIZE -c /dev/dm-user/$CONTROL &
+
+yes | mkfs.ext2 /dev/mapper/$BLOCK
+mount /dev/mapper/$BLOCK /mnt
+/usr/xfstests/ltp/fsstress -d /mnt/ -n "$NOP" -p "$NPROC"
+umount /mnt
+
+# Mount again and read the whole thing, just to see if there's any corruption.
+mount /dev/mapper/$BLOCK /mnt
+find /mnt -type f | xargs cat > /dev/null
+umount /mnt
+
+dmsetup remove $BLOCK
+
+# Make sure the daemon actually responds to DM closing it.
+wait
diff --git a/tools/testing/selftests/dm-user/include/logging.h b/tools/testing/selftests/dm-user/include/logging.h
new file mode 100644
index 000000000000..874c69ce5cce
--- /dev/null
+++ b/tools/testing/selftests/dm-user/include/logging.h
@@ -0,0 +1,148 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/******************************************************************************
+ *
+ *   Copyright © International Business Machines  Corp., 2009
+ *
+ * DESCRIPTION
+ *      Glibc independent futex library for testing kernel functionality.
+ *
+ * AUTHOR
+ *      Darren Hart <dvhart@linux.intel.com>
+ *
+ * HISTORY
+ *      2009-Nov-6: Initial version by Darren Hart <dvhart@linux.intel.com>
+ *
+ *****************************************************************************/
+
+#ifndef _LOGGING_H
+#define _LOGGING_H
+
+#include <stdio.h>
+#include <string.h>
+#include <unistd.h>
+#include <linux/futex.h>
+#include "kselftest.h"
+
+/*
+ * Define PASS, ERROR, and FAIL strings with and without color escape
+ * sequences, default to no color.
+ */
+#define ESC 0x1B, '['
+#define BRIGHT '1'
+#define GREEN '3', '2'
+#define YELLOW '3', '3'
+#define RED '3', '1'
+#define ESCEND 'm'
+#define BRIGHT_GREEN ESC, BRIGHT, ';', GREEN, ESCEND
+#define BRIGHT_YELLOW ESC, BRIGHT, ';', YELLOW, ESCEND
+#define BRIGHT_RED ESC, BRIGHT, ';', RED, ESCEND
+#define RESET_COLOR ESC, '0', 'm'
+static const char PASS_COLOR[] = {BRIGHT_GREEN, ' ', 'P', 'A', 'S', 'S',
+				  RESET_COLOR, 0};
+static const char ERROR_COLOR[] = {BRIGHT_YELLOW, 'E', 'R', 'R', 'O', 'R',
+				   RESET_COLOR, 0};
+static const char FAIL_COLOR[] = {BRIGHT_RED, ' ', 'F', 'A', 'I', 'L',
+				  RESET_COLOR, 0};
+static const char INFO_NORMAL[] = " INFO";
+static const char PASS_NORMAL[] = " PASS";
+static const char ERROR_NORMAL[] = "ERROR";
+static const char FAIL_NORMAL[] = " FAIL";
+const char *INFO = INFO_NORMAL;
+const char *PASS = PASS_NORMAL;
+const char *ERROR = ERROR_NORMAL;
+const char *FAIL = FAIL_NORMAL;
+
+/* Verbosity setting for INFO messages */
+#define VQUIET    0
+#define VCRITICAL 1
+#define VINFO     2
+#define VMAX      VINFO
+int _verbose = VCRITICAL;
+
+/* Functional test return codes */
+#define RET_PASS   0
+#define RET_ERROR -1
+#define RET_FAIL  -2
+
+/**
+ * log_color() - Use colored output for PASS, ERROR, and FAIL strings
+ * @use_color:	use color (1) or not (0)
+ */
+void log_color(int use_color)
+{
+	if (use_color) {
+		PASS = PASS_COLOR;
+		ERROR = ERROR_COLOR;
+		FAIL = FAIL_COLOR;
+	} else {
+		PASS = PASS_NORMAL;
+		ERROR = ERROR_NORMAL;
+		FAIL = FAIL_NORMAL;
+	}
+}
+
+/**
+ * log_verbosity() - Set verbosity of test output
+ * @verbose:	Enable (1) verbose output or not (0)
+ *
+ * Currently setting verbose=1 will enable INFO messages and 0 will disable
+ * them. FAIL and ERROR messages are always displayed.
+ */
+void log_verbosity(int level)
+{
+	if (level > VMAX)
+		level = VMAX;
+	else if (level < 0)
+		level = 0;
+	_verbose = level;
+}
+
+/**
+ * print_result() - Print standard PASS | ERROR | FAIL results
+ * @ret:	the return value to be considered: 0 | RET_ERROR | RET_FAIL
+ *
+ * print_result() is primarily intended for functional tests.
+ */
+void print_result(const char *test_name, int ret)
+{
+	switch (ret) {
+	case RET_PASS:
+		ksft_test_result_pass("%s\n", test_name);
+		ksft_print_cnts();
+		return;
+	case RET_ERROR:
+		ksft_test_result_error("%s\n", test_name);
+		ksft_print_cnts();
+		return;
+	case RET_FAIL:
+		ksft_test_result_fail("%s\n", test_name);
+		ksft_print_cnts();
+		return;
+	}
+}
+
+/* log level macros */
+#define info(message, vargs...) \
+do { \
+	if (_verbose >= VINFO) \
+		fprintf(stderr, "\t%s: "message, INFO, ##vargs); \
+} while (0)
+
+#define error(message, err, args...) \
+do { \
+	if (_verbose >= VCRITICAL) {\
+		if (err) \
+			fprintf(stderr, "\t%s: %s: "message, \
+				ERROR, strerror(err), ##args); \
+		else \
+			fprintf(stderr, "\t%s: "message, ERROR, ##args); \
+	} \
+} while (0)
+
+#define fail(message, args...) \
+do { \
+	if (_verbose >= VCRITICAL) \
+		fprintf(stderr, "\t%s: "message, FAIL, ##args); \
+} while (0)
+
+#endif
diff --git a/tools/testing/selftests/dm-user/run.sh b/tools/testing/selftests/dm-user/run.sh
new file mode 100755
index 000000000000..2ed2581e4a57
--- /dev/null
+++ b/tools/testing/selftests/dm-user/run.sh
@@ -0,0 +1,74 @@
+# SPDX-License-Identifier: GPL-2.0
+# Copyright 2020 Palmer Dabbelt <palmerdabbelt@google.com>
+
+# Top-level run script for dm-user kernel self tests.  This just runs a bunch
+# of different tests back to back, relying on the kernel selftest infrastructure
+# to tease out the success/failure of each.  The tests all use the same global
+# directories and such, so it's not like there's a whole lot
+#
+# The actual test code should be fairly portable, but the scripts that run it
+# aren't.  See the README for more information.
+
+# Runs various FIO scripts against an ext2-based filesystem backed by dm-user.
+if test -e /usr/bin/fio
+then
+    ./harness-fio.sh      -s 3000000 -f fio-rand-read-1G.fio -- ./daemon-example
+    ./harness-fio.sh      -s 3000000 -f fio-rand-read-1G.fio -- ./daemon-short
+    ./harness-fio.sh      -s 3000000 -f fio-rand-read-1G.fio -- ./daemon-parallel -w   1
+    ./harness-fio.sh      -s 3000000 -f fio-rand-read-1G.fio -- ./daemon-parallel -w   4
+    ./harness-fio.sh      -s 3000000 -f fio-rand-read-1G.fio -- ./daemon-parallel -w  16
+    ./harness-fio.sh      -s 3000000 -f fio-rand-read-1G.fio -- ./daemon-parallel -w  64
+    ./harness-fio.sh      -s 3000000 -f fio-rand-read-1G.fio -- ./daemon-parallel -w 256
+    ./harness-fio.sh      -s 3000000 -f fio-rand-read-1G.fio -- ./daemon-parallel -w   1 -b /dev/vdb
+    ./harness-fio.sh      -s 3000000 -f fio-rand-read-1G.fio -- ./daemon-parallel -w   4 -b /dev/vdb
+    ./harness-fio.sh      -s 3000000 -f fio-rand-read-1G.fio -- ./daemon-parallel -w  16 -b /dev/vdb
+    ./harness-fio.sh      -s 3000000 -f fio-rand-read-1G.fio -- ./daemon-parallel -w  64 -b /dev/vdb
+    ./harness-fio.sh      -s 3000000 -f fio-rand-read-1G.fio -- ./daemon-parallel -w 256 -b /dev/vdb
+
+    ./harness-fio.sh      -s 3000000 -f fio-verify-1G.fio    -- ./daemon-example
+    ./harness-fio.sh      -s 3000000 -f fio-verify-1G.fio    -- ./daemon-short
+    ./harness-fio.sh      -s 3000000 -f fio-verify-1G.fio    -- ./daemon-parallel -w   1
+    ./harness-fio.sh      -s 3000000 -f fio-verify-1G.fio    -- ./daemon-parallel -w   4
+    ./harness-fio.sh      -s 3000000 -f fio-verify-1G.fio    -- ./daemon-parallel -w  16
+    ./harness-fio.sh      -s 3000000 -f fio-verify-1G.fio    -- ./daemon-parallel -w  64
+    ./harness-fio.sh      -s 3000000 -f fio-verify-1G.fio    -- ./daemon-parallel -w 256
+    ./harness-fio.sh      -s 3000000 -f fio-verify-1G.fio    -- ./daemon-parallel -w   1 -b /dev/vdb
+    ./harness-fio.sh      -s 3000000 -f fio-verify-1G.fio    -- ./daemon-parallel -w   4 -b /dev/vdb
+    ./harness-fio.sh      -s 3000000 -f fio-verify-1G.fio    -- ./daemon-parallel -w  16 -b /dev/vdb
+    ./harness-fio.sh      -s 3000000 -f fio-verify-1G.fio    -- ./daemon-parallel -w  64 -b /dev/vdb
+    ./harness-fio.sh      -s 3000000 -f fio-verify-1G.fio    -- ./daemon-parallel -w 256 -b /dev/vdb
+else
+    echo "Unable to find /usr/bin/fio"
+fi
+
+# Runs fsstress from xfstests against an ext2-based filesystem backed by
+# dm-user.
+if test -e /usr/xfstests/ltp/fsstress
+then
+    ./harness-fsstress.sh -s 3000000 -p   1 -n 10000 -- ./daemon-example
+    ./harness-fsstress.sh -s 3000000 -p   4 -n 10000 -- ./daemon-example
+    ./harness-fsstress.sh -s 3000000 -p  16 -n 10000 -- ./daemon-example
+    ./harness-fsstress.sh -s 3000000 -p  64 -n 10000 -- ./daemon-example
+    ./harness-fsstress.sh -s 3000000 -p 256 -n 10000 -- ./daemon-example
+
+    ./harness-fsstress.sh -s 3000000 -p   1 -n 10000 -- ./daemon-short
+    ./harness-fsstress.sh -s 3000000 -p   4 -n 10000 -- ./daemon-short
+    ./harness-fsstress.sh -s 3000000 -p  16 -n 10000 -- ./daemon-short
+    ./harness-fsstress.sh -s 3000000 -p  64 -n 10000 -- ./daemon-short
+    ./harness-fsstress.sh -s 3000000 -p 256 -n 10000 -- ./daemon-short
+
+    ./harness-fsstress.sh -s 3000000 -p 64 -n 10000 -- ./daemon-parallel -w    1
+    ./harness-fsstress.sh -s 3000000 -p 64 -n 10000 -- ./daemon-parallel -w    4
+    ./harness-fsstress.sh -s 3000000 -p 64 -n 10000 -- ./daemon-parallel -w   16
+    ./harness-fsstress.sh -s 3000000 -p 64 -n 10000 -- ./daemon-parallel -w   64
+    ./harness-fsstress.sh -s 3000000 -p 64 -n 10000 -- ./daemon-parallel -w  256
+
+    ./harness-fsstress.sh -s 3000000 -p  1 -n 10000 -- ./daemon-parallel -w    1 -b /dev/vdb
+    ./harness-fsstress.sh -s 3000000 -p 64 -n 10000 -- ./daemon-parallel -w    1 -b /dev/vdb
+    ./harness-fsstress.sh -s 3000000 -p 64 -n 10000 -- ./daemon-parallel -w    4 -b /dev/vdb
+    ./harness-fsstress.sh -s 3000000 -p 64 -n 10000 -- ./daemon-parallel -w   16 -b /dev/vdb
+    ./harness-fsstress.sh -s 3000000 -p 64 -n 10000 -- ./daemon-parallel -w   64 -b /dev/vdb
+    ./harness-fsstress.sh -s 3000000 -p 64 -n 10000 -- ./daemon-parallel -w  256 -b /dev/vdb
+else
+    echo "Unable to find /usr/xfstests/ltp/fsstress"
+fi
-- 
2.29.2.454.gaff20da3a2-goog


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [dm-devel] [PATCH v1 4/5] selftests/dm-user: A handful of tests for dm-user
@ 2020-12-03 21:58   ` Palmer Dabbelt
  0 siblings, 0 replies; 43+ messages in thread
From: Palmer Dabbelt @ 2020-12-03 21:58 UTC (permalink / raw)
  To: dm-devel, agk, snitzer
  Cc: corbet, linux-doc, kernel-team, Palmer Dabbelt, linux-kernel,
	linux-raid, song, linux-kselftest, shuah

From: Palmer Dabbelt <palmerdabbelt@google.com>

These tests ruly on fsstress and fio to generate accesses to a block
device backed by a dm-user daemon.

Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>

---

I've lumped these all together rather than splitting them up.  The tests
themselves are independent, but the associated build/run infastructure
is pretty simple and I got tired of handling all the merge conflicts
that came from juggling each test as its own patch.

The tests themselves sholud be portable, but the harness will only run
in my environment (ie, QEMU).  That's kind of ugly, but I'm not really
sure how to do this in a more reasonable way.  I run the tests as
follows:

    $ rm -f scratch
    $ truncate scratch --size=10G
    $ qemu-system-x86_64 \
            -m 8G -smp 32 -cpu host -accel kvm \
            -kernel "${TREE}"/arch/x86_64/boot/bzImage \
            -initrd buildroot/output/images/rootfs.cpio \
            -append "console=ttyS0" \
            -drive
            file="${TREE}"/tools/testing/selftests/kselftest_install/kselftest-packages/kselftest.squashfs,if=virtio
            \
            -drive
            file=scratch,if=virtio \
            -nographic |&
            tee log
---
 tools/testing/selftests/.gitignore            |   3 +
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/dm-user/.gitignore    |   3 +
 tools/testing/selftests/dm-user/Makefile      |  23 ++
 tools/testing/selftests/dm-user/README        |  20 ++
 .../selftests/dm-user/daemon-example.c        | 186 ++++++++++++++
 .../selftests/dm-user/daemon-parallel.c       | 240 ++++++++++++++++++
 .../testing/selftests/dm-user/daemon-short.c  | 196 ++++++++++++++
 .../selftests/dm-user/fio-rand-read-1G.fio    |  16 ++
 .../selftests/dm-user/fio-verify-1G.fio       |  10 +
 .../testing/selftests/dm-user/harness-fio.sh  |  45 ++++
 .../selftests/dm-user/harness-fsstress.sh     |  44 ++++
 .../selftests/dm-user/include/logging.h       | 148 +++++++++++
 tools/testing/selftests/dm-user/run.sh        |  74 ++++++
 14 files changed, 1009 insertions(+)
 create mode 100644 tools/testing/selftests/dm-user/.gitignore
 create mode 100644 tools/testing/selftests/dm-user/Makefile
 create mode 100644 tools/testing/selftests/dm-user/README
 create mode 100644 tools/testing/selftests/dm-user/daemon-example.c
 create mode 100644 tools/testing/selftests/dm-user/daemon-parallel.c
 create mode 100644 tools/testing/selftests/dm-user/daemon-short.c
 create mode 100644 tools/testing/selftests/dm-user/fio-rand-read-1G.fio
 create mode 100644 tools/testing/selftests/dm-user/fio-verify-1G.fio
 create mode 100755 tools/testing/selftests/dm-user/harness-fio.sh
 create mode 100755 tools/testing/selftests/dm-user/harness-fsstress.sh
 create mode 100644 tools/testing/selftests/dm-user/include/logging.h
 create mode 100755 tools/testing/selftests/dm-user/run.sh

diff --git a/tools/testing/selftests/.gitignore b/tools/testing/selftests/.gitignore
index 055a5019b13c..88b1938ea5e6 100644
--- a/tools/testing/selftests/.gitignore
+++ b/tools/testing/selftests/.gitignore
@@ -8,3 +8,6 @@ tpm2/SpaceTest.log
 # Python bytecode and cache
 __pycache__/
 *.py[cod]
+
+# selftest install dir
+/kselftest_install/
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index d9c283503159..f5e0f61c4384 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -9,6 +9,7 @@ TARGETS += clone3
 TARGETS += core
 TARGETS += cpufreq
 TARGETS += cpu-hotplug
+TARGETS += dm-user
 TARGETS += drivers/dma-buf
 TARGETS += efivarfs
 TARGETS += exec
diff --git a/tools/testing/selftests/dm-user/.gitignore b/tools/testing/selftests/dm-user/.gitignore
new file mode 100644
index 000000000000..7b0aa3e4a738
--- /dev/null
+++ b/tools/testing/selftests/dm-user/.gitignore
@@ -0,0 +1,3 @@
+/daemon-example
+/daemon-parallel
+/daemon-short
diff --git a/tools/testing/selftests/dm-user/Makefile b/tools/testing/selftests/dm-user/Makefile
new file mode 100644
index 000000000000..98ff4f5d0fad
--- /dev/null
+++ b/tools/testing/selftests/dm-user/Makefile
@@ -0,0 +1,23 @@
+# SPDX-License-Identifier: GPL-2.0
+.PHONY: all clean
+
+top_srcdir = ../../../..
+INCLUDES := -I../ -Iinclude/ -I$(top_srcdir)/usr/include
+CFLAGS := $(CFLAGS) -g -O2 -Wall -static -D_GNU_SOURCE -pthread $(INCLUDES)
+KSFT_KHDR_INSTALL := 1
+
+TEST_GEN_FILES := \
+	daemon-example \
+	daemon-parallel \
+	daemon-short
+
+TEST_PROGS := \
+	fio-rand-read-1G.fio \
+	fio-verify-1G.fio \
+	harness-fio.sh \
+	harness-fsstress.sh \
+	run.sh
+
+$(TEST_GEN_FILES): khdr
+
+include ../lib.mk
diff --git a/tools/testing/selftests/dm-user/README b/tools/testing/selftests/dm-user/README
new file mode 100644
index 000000000000..213de27db35d
--- /dev/null
+++ b/tools/testing/selftests/dm-user/README
@@ -0,0 +1,20 @@
+dm-user Tests
+=============
+Tests for dm-user.
+
+Quick Start
+-----------
+It's probably a bad idea to just run this blindly, but all you need to do is:
+
+# make
+# ./run.sh
+
+Slow Start
+----------
+These tests use `dmsetup` to manage device mapper nodes, which is part of lvm2.
+Some use `fio`, and some use the `fsstress` from xfstests.  Some of the tests
+also expect "/dev/vdb" to exist and to be at least 10G.
+
+I use a simple buildroot-based initramfs to run the tests.  I've added an
+xfstests package to get fsstress, but I haven't sent out the patches yet.  I
+run everything in QEMU.
diff --git a/tools/testing/selftests/dm-user/daemon-example.c b/tools/testing/selftests/dm-user/daemon-example.c
new file mode 100644
index 000000000000..b245fad192bf
--- /dev/null
+++ b/tools/testing/selftests/dm-user/daemon-example.c
@@ -0,0 +1,186 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright 2020 Google, Inc
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <getopt.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <poll.h>
+#include <pthread.h>
+#include <linux/dm-user.h>
+#include <sys/prctl.h>
+#include "logging.h"
+
+#define SECTOR_SIZE 512
+
+#define MAX(a, b) ((a) > (b) ? (a) : (b))
+
+int write_all(int fd, void *buf, size_t len)
+{
+	char *buf_c = buf;
+	ssize_t total = 0;
+	ssize_t once;
+
+	while (total < len) {
+		once = write(fd, buf_c + total, len - total);
+		if (once <= 0)
+			return once;
+		total += once;
+	}
+
+	return total;
+}
+
+int read_all(int fd, void *buf, size_t len)
+{
+	char *buf_c = buf;
+	ssize_t total = 0;
+	ssize_t once;
+
+	while (total < len) {
+		once = read(fd, buf_c + total, len - total);
+		if (once <= 0)
+			return once;
+		total += once;
+	}
+
+	return total;
+}
+
+int simple_daemon(char *control_dev,
+		  size_t block_bytes,
+		  char *store)
+
+{
+	int control_fd = open(control_dev, O_RDWR);
+
+	if (control_fd < 0) {
+		ksft_print_msg("Unable to open control device %s\n", control_dev);
+		return RET_FAIL;
+	}
+
+	while (1) {
+		struct dm_user_message msg;
+		__u64 type;
+		char *base;
+
+		if (read_all(control_fd, &msg, sizeof(msg)) < 0) {
+			if (errno == ENOTBLK)
+				return RET_PASS;
+
+			perror("unable to read msg");
+			return RET_FAIL;
+		}
+
+		base = store + msg.sector * SECTOR_SIZE;
+		if (base + msg.len > store + block_bytes) {
+			fprintf(stderr, "access out of bounds\n");
+			return RET_FAIL;
+		}
+
+		type = msg.type;
+		switch (type) {
+		case DM_USER_REQ_MAP_WRITE:
+			msg.type = DM_USER_RESP_SUCCESS;
+			if (read_all(control_fd, base, msg.len) < 0) {
+				if (errno == ENOTBLK)
+					return RET_PASS;
+
+				perror("unable to read buf");
+				return RET_FAIL;
+			}
+			break;
+		case DM_USER_REQ_MAP_FLUSH:
+			/* Nothing extra to do on flush, we're in memory. */
+		case DM_USER_REQ_MAP_READ:
+			msg.type = DM_USER_RESP_SUCCESS;
+			break;
+		default:
+			msg.type = DM_USER_RESP_UNSUPPORTED;
+			break;
+		}
+
+		if (write_all(control_fd, &msg, sizeof(msg)) < 0) {
+			if (errno == ENOTBLK)
+				return RET_PASS;
+
+			perror("unable to write msg");
+			return RET_FAIL;
+		}
+
+		if (type == DM_USER_REQ_MAP_READ) {
+			if (write_all(control_fd, base, msg.len) < 0) {
+				if (errno == ENOTBLK)
+					return RET_PASS;
+
+				perror("unable to write buf");
+				return RET_FAIL;
+			}
+		}
+	}
+
+	/* The daemon doesn't actully terminate for this test. */
+	perror("Unable to read from control device");
+	return RET_FAIL;
+}
+
+void usage(char *prog)
+{
+	printf("Usage: %s\n", prog);
+	printf("  -h			Display this help message\n");
+	printf("  -v L			Verbosity level: %d=QUIET %d=CRITICAL %d=INFO\n",
+	       VQUIET, VCRITICAL, VINFO);
+	printf("  -c <control dev>	Control device to use for the test\n");
+	printf("  -s <sectors>		The number of sectors in the device\n");
+}
+
+int main(int argc, char *argv[])
+{
+	int ret = RET_PASS;
+	int c;
+	char *control_dev = NULL;
+	long block_bytes = 1024;
+	char *store;
+
+	prctl(PR_SET_IO_FLUSHER, 0, 0, 0, 0);
+
+	while ((c = getopt(argc, argv, "h:v:c:s:")) != -1) {
+		switch (c) {
+		case 'h':
+			usage(basename(argv[0]));
+			exit(0);
+		case 'v':
+			log_verbosity(atoi(optarg));
+			break;
+		case 'c':
+			control_dev = strdup(optarg);
+			break;
+		case 's':
+			block_bytes = atoi(optarg) * SECTOR_SIZE;
+			break;
+		default:
+			usage(basename(argv[0]));
+			exit(1);
+		}
+	}
+
+	ksft_print_header();
+	ksft_set_plan(1);
+	ksft_print_msg("%s: block_bytes=%zu\n",
+		       basename(argv[0]),
+		       block_bytes);
+
+	store = malloc(block_bytes);
+	for (size_t i = 0; i < block_bytes/sizeof(size_t); ++i)
+		((size_t *)(store))[i] = i;
+
+	ret = simple_daemon(control_dev, block_bytes, store);
+
+	print_result(basename(argv[0]), ret);
+	exit(ret);
+	return ret;
+}
diff --git a/tools/testing/selftests/dm-user/daemon-parallel.c b/tools/testing/selftests/dm-user/daemon-parallel.c
new file mode 100644
index 000000000000..9e5303f02241
--- /dev/null
+++ b/tools/testing/selftests/dm-user/daemon-parallel.c
@@ -0,0 +1,240 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright 2020 Google, Inc
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <getopt.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <poll.h>
+#include <pthread.h>
+#include <linux/dm-user.h>
+#include <sys/prctl.h>
+#include <sys/mman.h>
+#include "logging.h"
+
+#define SECTOR_SIZE 512
+#define MAX_WORKER_COUNT 256
+
+#define MAX(a, b) ((a) > (b) ? (a) : (b))
+
+struct test_context {
+	char *control_dev;
+	size_t block_bytes;
+	char *store;
+	long worker_count;
+	char *backing_path;
+};
+
+int write_all(int fd, void *buf, size_t len)
+{
+	char *buf_c = buf;
+	ssize_t total = 0;
+	ssize_t once;
+
+	while (total < len) {
+		once = write(fd, buf_c + total, len - total);
+		if (once <= 0)
+			return once;
+		total += once;
+	}
+
+	return total;
+}
+
+int read_all(int fd, void *buf, size_t len)
+{
+	char *buf_c = buf;
+	ssize_t total = 0;
+	ssize_t once;
+
+	while (total < len) {
+		once = read(fd, buf_c + total, len - total);
+		if (once <= 0)
+			return once;
+		total += once;
+	}
+
+	return total;
+}
+
+void *simple_daemon(void *context_uc)
+{
+	struct test_context *context = context_uc;
+	char *store = context->store;
+	int control_fd = open(context->control_dev, O_RDWR);
+
+	if (control_fd < 0) {
+		ksft_print_msg("Unable to open control device %s\n", context->control_dev);
+		return (void *)(RET_FAIL);
+	}
+
+	while (1) {
+		struct dm_user_message msg;
+		__u64 type;
+		char *base;
+
+		if (read_all(control_fd, &msg, sizeof(msg)) < 0) {
+			if (errno == ENOTBLK)
+				return (void *)(RET_PASS);
+
+			perror("unable to read msg");
+			return (void *)(RET_FAIL);
+		}
+
+		base = store + msg.sector * SECTOR_SIZE;
+		if (base + msg.len > store + context->block_bytes) {
+			fprintf(stderr, "access out of bounds\n");
+			return (void *)(RET_FAIL);
+		}
+
+		type = msg.type;
+		switch (type) {
+		case DM_USER_REQ_MAP_READ:
+			msg.type = DM_USER_RESP_SUCCESS;
+			break;
+		case DM_USER_REQ_MAP_WRITE:
+			msg.type = DM_USER_RESP_SUCCESS;
+			if (read_all(control_fd, base, msg.len) < 0) {
+				if (errno == ENOTBLK)
+					return (void *)(RET_PASS);
+
+				perror("unable to read buf");
+				return (void *)(RET_FAIL);
+			}
+			break;
+		case DM_USER_REQ_MAP_FLUSH:
+			msg.type = DM_USER_RESP_SUCCESS;
+			sync();
+			break;
+		default:
+			msg.type = DM_USER_RESP_UNSUPPORTED;
+			break;
+		}
+
+		if (write_all(control_fd, &msg, sizeof(msg)) < 0) {
+			if (errno == ENOTBLK)
+				return (void *)(RET_PASS);
+
+			perror("unable to write msg");
+			return (void *)(RET_FAIL);
+		}
+
+		if (type == DM_USER_REQ_MAP_READ) {
+			if (write_all(control_fd, base, msg.len) < 0) {
+				if (errno == ENOTBLK)
+					return (void *)(RET_PASS);
+
+				perror("unable to write buf");
+				return (void *)(RET_FAIL);
+			}
+		}
+	}
+
+	/* The daemon doesn't actully terminate for this test. */
+	perror("Unable to read from control device");
+	return (void *)(RET_FAIL);
+}
+
+void usage(char *prog)
+{
+	printf("Usage: %s\n", prog);
+	printf("  -h			Display this help message\n");
+	printf("  -v L			Verbosity level: %d=QUIET %d=CRITICAL %d=INFO\n",
+	       VQUIET, VCRITICAL, VINFO);
+	printf("  -c <control dev>	Control device to use for the test\n");
+	printf("  -s <sectors>		The number of sectors in the device\n");
+}
+
+int main(int argc, char *argv[])
+{
+	int ret = RET_PASS;
+	int done = 0;
+	int c;
+	struct test_context context = {
+		.control_dev	= NULL,
+		.block_bytes	= 0,
+		.worker_count   = 1,
+		.backing_path   = NULL,
+	};
+	pthread_t daemon[MAX_WORKER_COUNT];
+	void *pthread_ret;
+
+	prctl(PR_SET_IO_FLUSHER, 0, 0, 0, 0);
+
+	while ((c = getopt(argc, argv, "h:v:c:s:w:b:")) != -1) {
+		switch (c) {
+		case 'h':
+			usage(basename(argv[0]));
+			exit(0);
+		case 'v':
+			log_verbosity(atoi(optarg));
+			break;
+		case 'c':
+			context.control_dev = strdup(optarg);
+			break;
+		case 's':
+			context.block_bytes = atoi(optarg) * SECTOR_SIZE;
+			break;
+		case 'w':
+			context.worker_count = atoi(optarg);
+			break;
+		case 'b':
+			context.backing_path = strdup(optarg);
+			break;
+		default:
+			usage(basename(argv[0]));
+			exit(1);
+		}
+	}
+
+	ksft_print_header();
+	ksft_set_plan(1);
+	ksft_print_msg("%s: block_bytes=%zu\n",
+		       basename(argv[0]),
+		       context.block_bytes);
+
+	ret = RET_PASS;
+
+	if (context.backing_path == NULL) {
+		ksft_print_msg("Using an in-memory backing store\n");
+		context.store = malloc(context.block_bytes);
+		for (size_t i = 0; i < context.block_bytes/sizeof(size_t); ++i)
+			((size_t *)(context.store))[i] = i;
+	} else {
+		int backing_fd = open(context.backing_path, O_RDWR);
+
+		ksft_print_msg("Using %s as a backing store\n", context.backing_path);
+		if (backing_fd < 0) {
+			perror("Unable to open backing store");
+			ksft_print_msg("Unable to open backing store %s\n", context.backing_path);
+			return RET_FAIL;
+		}
+
+		context.store = mmap(NULL, context.block_bytes,
+				     PROT_READ | PROT_WRITE, MAP_SHARED,
+				     backing_fd, 0);
+	}
+
+	for (size_t i = 0; i < context.worker_count; ++i)
+		if (pthread_create(&daemon[i], NULL, &simple_daemon, &context) < 0)
+			ret = RET_ERROR;
+
+	while (!done)  {
+		for (size_t i = 0; i < context.worker_count; ++i) {
+			if (pthread_tryjoin_np(daemon[i], &pthread_ret) == 0) {
+				if (pthread_ret != RET_PASS)
+					ret = RET_ERROR;
+				done = 1;
+			}
+		}
+
+		sleep(1);
+	}
+
+	print_result(basename(argv[0]), ret);
+	exit(ret);
+}
diff --git a/tools/testing/selftests/dm-user/daemon-short.c b/tools/testing/selftests/dm-user/daemon-short.c
new file mode 100644
index 000000000000..40fd114cb390
--- /dev/null
+++ b/tools/testing/selftests/dm-user/daemon-short.c
@@ -0,0 +1,196 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright 2020 Google, Inc
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <getopt.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <poll.h>
+#include <pthread.h>
+#include <linux/dm-user.h>
+#include <sys/prctl.h>
+#include "logging.h"
+
+#define SECTOR_SIZE 512
+
+#define MAX(a, b) ((a) > (b) ? (a) : (b))
+
+int write_all(int fd, void *buf, size_t len)
+{
+	char *buf_c = buf;
+	ssize_t total = 0;
+	ssize_t once;
+
+	while (total < len) {
+		size_t max = len - total;
+
+		if (max > 3)
+			max = max / 3;
+
+		once = write(fd, buf_c + total, max);
+		if (once <= 0)
+			return once;
+		total += once;
+	}
+
+	return total;
+}
+
+int read_all(int fd, void *buf, size_t len)
+{
+	char *buf_c = buf;
+	ssize_t total = 0;
+	ssize_t once;
+
+	while (total < len) {
+		size_t max = len - total;
+
+		if (max > 3)
+			max = max / 3;
+
+		once = read(fd, buf_c + total, max);
+		if (once <= 0)
+			return once;
+		total += once;
+	}
+
+	return total;
+}
+
+int simple_daemon(char *control_dev,
+		  size_t block_bytes,
+		  char *store)
+
+{
+	int control_fd = open(control_dev, O_RDWR);
+
+	if (control_fd < 0) {
+		ksft_print_msg("Unable to open control device %s\n", control_dev);
+		return RET_FAIL;
+	}
+
+	while (1) {
+		struct dm_user_message msg;
+		__u64 type;
+		char *base;
+
+		if (read_all(control_fd, &msg, sizeof(msg)) < 0) {
+			if (errno == ENOTBLK)
+				return RET_PASS;
+
+			perror("unable to read msg");
+			return RET_FAIL;
+		}
+
+		base = store + msg.sector * SECTOR_SIZE;
+		if (base + msg.len > store + block_bytes) {
+			fprintf(stderr, "access out of bounds\n");
+			return RET_FAIL;
+		}
+
+		type = msg.type;
+		switch (type) {
+		case DM_USER_REQ_MAP_WRITE:
+			msg.type = DM_USER_RESP_SUCCESS;
+			if (read_all(control_fd, base, msg.len) < 0) {
+				if (errno == ENOTBLK)
+					return RET_PASS;
+
+				perror("unable to read buf");
+				return RET_FAIL;
+			}
+			break;
+		case DM_USER_REQ_MAP_FLUSH:
+			/* Nothing extra to do on flush, we're in memory. */
+		case DM_USER_REQ_MAP_READ:
+			msg.type = DM_USER_RESP_SUCCESS;
+			break;
+		default:
+			msg.type = DM_USER_RESP_UNSUPPORTED;
+			break;
+		}
+
+		if (write_all(control_fd, &msg, sizeof(msg)) < 0) {
+			if (errno == ENOTBLK)
+				return RET_PASS;
+
+			perror("unable to write msg");
+			return RET_FAIL;
+		}
+
+		if (type == DM_USER_REQ_MAP_READ) {
+			if (write_all(control_fd, base, msg.len) < 0) {
+				if (errno == ENOTBLK)
+					return RET_PASS;
+
+				perror("unable to write buf");
+				return RET_FAIL;
+			}
+		}
+	}
+
+	/* The daemon doesn't actully terminate for this test. */
+	perror("Unable to read from control device");
+	return RET_FAIL;
+}
+
+void usage(char *prog)
+{
+	printf("Usage: %s\n", prog);
+	printf("  -h			Display this help message\n");
+	printf("  -v L			Verbosity level: %d=QUIET %d=CRITICAL %d=INFO\n",
+	       VQUIET, VCRITICAL, VINFO);
+	printf("  -c <control dev>	Control device to use for the test\n");
+	printf("  -s <sectors>		The number of sectors in the device\n");
+}
+
+int main(int argc, char *argv[])
+{
+	int ret = RET_PASS;
+	int c;
+	char *control_dev = NULL;
+	long block_bytes = 1024;
+	char *store;
+
+	prctl(PR_SET_IO_FLUSHER, 0, 0, 0, 0);
+
+	while ((c = getopt(argc, argv, "h:v:c:s:")) != -1) {
+		switch (c) {
+		case 'h':
+			usage(basename(argv[0]));
+			exit(0);
+		case 'v':
+			log_verbosity(atoi(optarg));
+			break;
+		case 'c':
+			control_dev = strdup(optarg);
+			break;
+		case 's':
+			block_bytes = atoi(optarg) * SECTOR_SIZE;
+			break;
+		default:
+			usage(basename(argv[0]));
+			exit(1);
+		}
+	}
+
+	ksft_print_header();
+	ksft_set_plan(1);
+	ksft_print_msg("%s: block_bytes=%zu\n",
+		       basename(argv[0]),
+		       block_bytes);
+
+	store = malloc(block_bytes);
+	for (size_t i = 0; i < block_bytes/sizeof(size_t); ++i)
+		((size_t *)(store))[i] = i;
+
+	ret = simple_daemon(control_dev, block_bytes, store);
+
+	print_result(basename(argv[0]), ret);
+	exit(ret);
+	return ret;
+}
diff --git a/tools/testing/selftests/dm-user/fio-rand-read-1G.fio b/tools/testing/selftests/dm-user/fio-rand-read-1G.fio
new file mode 100644
index 000000000000..f971483e0e27
--- /dev/null
+++ b/tools/testing/selftests/dm-user/fio-rand-read-1G.fio
@@ -0,0 +1,16 @@
+; fio-rand-read.job for fiotest
+
+[global]
+name=fio-rand-read-1G
+filename=fio-rand-read-1G
+rw=randread
+bs=4K
+direct=0
+numjobs=1
+time_based=1
+runtime=30
+
+[file1]
+size=1G
+ioengine=io_uring
+iodepth=16
diff --git a/tools/testing/selftests/dm-user/fio-verify-1G.fio b/tools/testing/selftests/dm-user/fio-verify-1G.fio
new file mode 100644
index 000000000000..4b626271ce7c
--- /dev/null
+++ b/tools/testing/selftests/dm-user/fio-verify-1G.fio
@@ -0,0 +1,10 @@
+# The most basic form of data verification. Write the device randomly
+# in 4K chunks, then read it back and verify the contents.
+[write-and-verify]
+rw=randwrite
+bs=4k
+ioengine=libaio
+iodepth=16
+direct=1
+verify=crc32c
+size=1G
diff --git a/tools/testing/selftests/dm-user/harness-fio.sh b/tools/testing/selftests/dm-user/harness-fio.sh
new file mode 100755
index 000000000000..4b95c9f5efd8
--- /dev/null
+++ b/tools/testing/selftests/dm-user/harness-fio.sh
@@ -0,0 +1,45 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright 2020 Google, Inc
+
+# Just a fixed size for now, but it's passed to the tests and they're supposed
+# to respect it.
+SIZE=1024
+BLOCK=kselftest-dm-user-block
+CONTROL=kselftest-dm-user-control
+unset FIO
+
+while [ x"$1" != x"--" ]
+do
+    case "$1" in
+    "-s")    SIZE="$2";                             shift 2;;
+    "-f")    FIO="$2";                              shift 2;;
+    *)       echo "$0: unknown argument $1" >&2;    exit  1;;
+    esac
+done
+shift
+
+# Run the benchmark again via dm-user, to see what the overhead is.
+dmsetup create $BLOCK << EOF
+0 $SIZE user 0 $SIZE $CONTROL
+EOF
+
+dmsetup resume $BLOCK
+
+"$@" -s $SIZE -c /dev/dm-user/$CONTROL &
+
+yes | mkfs.ext2 /dev/mapper/$BLOCK
+mount /dev/mapper/$BLOCK /mnt
+cp "$FIO" /mnt/benchmark.fio
+(cd /mnt; fio benchmark.fio)
+umount /mnt
+
+# Mount again and read the whole thing, just to see if there's any corruption.
+mount /dev/mapper/$BLOCK /mnt
+find /mnt -type f | xargs cat > /dev/null
+umount /mnt
+
+dmsetup remove $BLOCK
+
+# Make sure the daemon actually responds to DM closing it.
+wait
diff --git a/tools/testing/selftests/dm-user/harness-fsstress.sh b/tools/testing/selftests/dm-user/harness-fsstress.sh
new file mode 100755
index 000000000000..265c0ff636ca
--- /dev/null
+++ b/tools/testing/selftests/dm-user/harness-fsstress.sh
@@ -0,0 +1,44 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright 2020 Google, Inc
+
+BLOCK=kselftest-dm-user-block
+CONTROL=kselftest-dm-user-control
+unset SIZE
+unset NPROC
+unset NOP
+
+while [ x"$1" != x"--" ]
+do
+    case "$1" in
+    "-s")    SIZE="$2";                             shift 2;;
+    "-n")    NOP="$2";                              shift 2;;
+    "-p")    NPROC="$2";                            shift 2;;
+    *)       echo "$0: unknown argument $1" >&2;    exit  1;;
+    esac
+done
+shift
+
+# Runs the fs stress tests
+dmsetup create $BLOCK << EOF
+0 $SIZE user 0 $SIZE $CONTROL
+EOF
+
+dmsetup resume $BLOCK
+
+"$@" -s $SIZE -c /dev/dm-user/$CONTROL &
+
+yes | mkfs.ext2 /dev/mapper/$BLOCK
+mount /dev/mapper/$BLOCK /mnt
+/usr/xfstests/ltp/fsstress -d /mnt/ -n "$NOP" -p "$NPROC"
+umount /mnt
+
+# Mount again and read the whole thing, just to see if there's any corruption.
+mount /dev/mapper/$BLOCK /mnt
+find /mnt -type f | xargs cat > /dev/null
+umount /mnt
+
+dmsetup remove $BLOCK
+
+# Make sure the daemon actually responds to DM closing it.
+wait
diff --git a/tools/testing/selftests/dm-user/include/logging.h b/tools/testing/selftests/dm-user/include/logging.h
new file mode 100644
index 000000000000..874c69ce5cce
--- /dev/null
+++ b/tools/testing/selftests/dm-user/include/logging.h
@@ -0,0 +1,148 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/******************************************************************************
+ *
+ *   Copyright © International Business Machines  Corp., 2009
+ *
+ * DESCRIPTION
+ *      Glibc independent futex library for testing kernel functionality.
+ *
+ * AUTHOR
+ *      Darren Hart <dvhart@linux.intel.com>
+ *
+ * HISTORY
+ *      2009-Nov-6: Initial version by Darren Hart <dvhart@linux.intel.com>
+ *
+ *****************************************************************************/
+
+#ifndef _LOGGING_H
+#define _LOGGING_H
+
+#include <stdio.h>
+#include <string.h>
+#include <unistd.h>
+#include <linux/futex.h>
+#include "kselftest.h"
+
+/*
+ * Define PASS, ERROR, and FAIL strings with and without color escape
+ * sequences, default to no color.
+ */
+#define ESC 0x1B, '['
+#define BRIGHT '1'
+#define GREEN '3', '2'
+#define YELLOW '3', '3'
+#define RED '3', '1'
+#define ESCEND 'm'
+#define BRIGHT_GREEN ESC, BRIGHT, ';', GREEN, ESCEND
+#define BRIGHT_YELLOW ESC, BRIGHT, ';', YELLOW, ESCEND
+#define BRIGHT_RED ESC, BRIGHT, ';', RED, ESCEND
+#define RESET_COLOR ESC, '0', 'm'
+static const char PASS_COLOR[] = {BRIGHT_GREEN, ' ', 'P', 'A', 'S', 'S',
+				  RESET_COLOR, 0};
+static const char ERROR_COLOR[] = {BRIGHT_YELLOW, 'E', 'R', 'R', 'O', 'R',
+				   RESET_COLOR, 0};
+static const char FAIL_COLOR[] = {BRIGHT_RED, ' ', 'F', 'A', 'I', 'L',
+				  RESET_COLOR, 0};
+static const char INFO_NORMAL[] = " INFO";
+static const char PASS_NORMAL[] = " PASS";
+static const char ERROR_NORMAL[] = "ERROR";
+static const char FAIL_NORMAL[] = " FAIL";
+const char *INFO = INFO_NORMAL;
+const char *PASS = PASS_NORMAL;
+const char *ERROR = ERROR_NORMAL;
+const char *FAIL = FAIL_NORMAL;
+
+/* Verbosity setting for INFO messages */
+#define VQUIET    0
+#define VCRITICAL 1
+#define VINFO     2
+#define VMAX      VINFO
+int _verbose = VCRITICAL;
+
+/* Functional test return codes */
+#define RET_PASS   0
+#define RET_ERROR -1
+#define RET_FAIL  -2
+
+/**
+ * log_color() - Use colored output for PASS, ERROR, and FAIL strings
+ * @use_color:	use color (1) or not (0)
+ */
+void log_color(int use_color)
+{
+	if (use_color) {
+		PASS = PASS_COLOR;
+		ERROR = ERROR_COLOR;
+		FAIL = FAIL_COLOR;
+	} else {
+		PASS = PASS_NORMAL;
+		ERROR = ERROR_NORMAL;
+		FAIL = FAIL_NORMAL;
+	}
+}
+
+/**
+ * log_verbosity() - Set verbosity of test output
+ * @verbose:	Enable (1) verbose output or not (0)
+ *
+ * Currently setting verbose=1 will enable INFO messages and 0 will disable
+ * them. FAIL and ERROR messages are always displayed.
+ */
+void log_verbosity(int level)
+{
+	if (level > VMAX)
+		level = VMAX;
+	else if (level < 0)
+		level = 0;
+	_verbose = level;
+}
+
+/**
+ * print_result() - Print standard PASS | ERROR | FAIL results
+ * @ret:	the return value to be considered: 0 | RET_ERROR | RET_FAIL
+ *
+ * print_result() is primarily intended for functional tests.
+ */
+void print_result(const char *test_name, int ret)
+{
+	switch (ret) {
+	case RET_PASS:
+		ksft_test_result_pass("%s\n", test_name);
+		ksft_print_cnts();
+		return;
+	case RET_ERROR:
+		ksft_test_result_error("%s\n", test_name);
+		ksft_print_cnts();
+		return;
+	case RET_FAIL:
+		ksft_test_result_fail("%s\n", test_name);
+		ksft_print_cnts();
+		return;
+	}
+}
+
+/* log level macros */
+#define info(message, vargs...) \
+do { \
+	if (_verbose >= VINFO) \
+		fprintf(stderr, "\t%s: "message, INFO, ##vargs); \
+} while (0)
+
+#define error(message, err, args...) \
+do { \
+	if (_verbose >= VCRITICAL) {\
+		if (err) \
+			fprintf(stderr, "\t%s: %s: "message, \
+				ERROR, strerror(err), ##args); \
+		else \
+			fprintf(stderr, "\t%s: "message, ERROR, ##args); \
+	} \
+} while (0)
+
+#define fail(message, args...) \
+do { \
+	if (_verbose >= VCRITICAL) \
+		fprintf(stderr, "\t%s: "message, FAIL, ##args); \
+} while (0)
+
+#endif
diff --git a/tools/testing/selftests/dm-user/run.sh b/tools/testing/selftests/dm-user/run.sh
new file mode 100755
index 000000000000..2ed2581e4a57
--- /dev/null
+++ b/tools/testing/selftests/dm-user/run.sh
@@ -0,0 +1,74 @@
+# SPDX-License-Identifier: GPL-2.0
+# Copyright 2020 Palmer Dabbelt <palmerdabbelt@google.com>
+
+# Top-level run script for dm-user kernel self tests.  This just runs a bunch
+# of different tests back to back, relying on the kernel selftest infrastructure
+# to tease out the success/failure of each.  The tests all use the same global
+# directories and such, so it's not like there's a whole lot
+#
+# The actual test code should be fairly portable, but the scripts that run it
+# aren't.  See the README for more information.
+
+# Runs various FIO scripts against an ext2-based filesystem backed by dm-user.
+if test -e /usr/bin/fio
+then
+    ./harness-fio.sh      -s 3000000 -f fio-rand-read-1G.fio -- ./daemon-example
+    ./harness-fio.sh      -s 3000000 -f fio-rand-read-1G.fio -- ./daemon-short
+    ./harness-fio.sh      -s 3000000 -f fio-rand-read-1G.fio -- ./daemon-parallel -w   1
+    ./harness-fio.sh      -s 3000000 -f fio-rand-read-1G.fio -- ./daemon-parallel -w   4
+    ./harness-fio.sh      -s 3000000 -f fio-rand-read-1G.fio -- ./daemon-parallel -w  16
+    ./harness-fio.sh      -s 3000000 -f fio-rand-read-1G.fio -- ./daemon-parallel -w  64
+    ./harness-fio.sh      -s 3000000 -f fio-rand-read-1G.fio -- ./daemon-parallel -w 256
+    ./harness-fio.sh      -s 3000000 -f fio-rand-read-1G.fio -- ./daemon-parallel -w   1 -b /dev/vdb
+    ./harness-fio.sh      -s 3000000 -f fio-rand-read-1G.fio -- ./daemon-parallel -w   4 -b /dev/vdb
+    ./harness-fio.sh      -s 3000000 -f fio-rand-read-1G.fio -- ./daemon-parallel -w  16 -b /dev/vdb
+    ./harness-fio.sh      -s 3000000 -f fio-rand-read-1G.fio -- ./daemon-parallel -w  64 -b /dev/vdb
+    ./harness-fio.sh      -s 3000000 -f fio-rand-read-1G.fio -- ./daemon-parallel -w 256 -b /dev/vdb
+
+    ./harness-fio.sh      -s 3000000 -f fio-verify-1G.fio    -- ./daemon-example
+    ./harness-fio.sh      -s 3000000 -f fio-verify-1G.fio    -- ./daemon-short
+    ./harness-fio.sh      -s 3000000 -f fio-verify-1G.fio    -- ./daemon-parallel -w   1
+    ./harness-fio.sh      -s 3000000 -f fio-verify-1G.fio    -- ./daemon-parallel -w   4
+    ./harness-fio.sh      -s 3000000 -f fio-verify-1G.fio    -- ./daemon-parallel -w  16
+    ./harness-fio.sh      -s 3000000 -f fio-verify-1G.fio    -- ./daemon-parallel -w  64
+    ./harness-fio.sh      -s 3000000 -f fio-verify-1G.fio    -- ./daemon-parallel -w 256
+    ./harness-fio.sh      -s 3000000 -f fio-verify-1G.fio    -- ./daemon-parallel -w   1 -b /dev/vdb
+    ./harness-fio.sh      -s 3000000 -f fio-verify-1G.fio    -- ./daemon-parallel -w   4 -b /dev/vdb
+    ./harness-fio.sh      -s 3000000 -f fio-verify-1G.fio    -- ./daemon-parallel -w  16 -b /dev/vdb
+    ./harness-fio.sh      -s 3000000 -f fio-verify-1G.fio    -- ./daemon-parallel -w  64 -b /dev/vdb
+    ./harness-fio.sh      -s 3000000 -f fio-verify-1G.fio    -- ./daemon-parallel -w 256 -b /dev/vdb
+else
+    echo "Unable to find /usr/bin/fio"
+fi
+
+# Runs fsstress from xfstests against an ext2-based filesystem backed by
+# dm-user.
+if test -e /usr/xfstests/ltp/fsstress
+then
+    ./harness-fsstress.sh -s 3000000 -p   1 -n 10000 -- ./daemon-example
+    ./harness-fsstress.sh -s 3000000 -p   4 -n 10000 -- ./daemon-example
+    ./harness-fsstress.sh -s 3000000 -p  16 -n 10000 -- ./daemon-example
+    ./harness-fsstress.sh -s 3000000 -p  64 -n 10000 -- ./daemon-example
+    ./harness-fsstress.sh -s 3000000 -p 256 -n 10000 -- ./daemon-example
+
+    ./harness-fsstress.sh -s 3000000 -p   1 -n 10000 -- ./daemon-short
+    ./harness-fsstress.sh -s 3000000 -p   4 -n 10000 -- ./daemon-short
+    ./harness-fsstress.sh -s 3000000 -p  16 -n 10000 -- ./daemon-short
+    ./harness-fsstress.sh -s 3000000 -p  64 -n 10000 -- ./daemon-short
+    ./harness-fsstress.sh -s 3000000 -p 256 -n 10000 -- ./daemon-short
+
+    ./harness-fsstress.sh -s 3000000 -p 64 -n 10000 -- ./daemon-parallel -w    1
+    ./harness-fsstress.sh -s 3000000 -p 64 -n 10000 -- ./daemon-parallel -w    4
+    ./harness-fsstress.sh -s 3000000 -p 64 -n 10000 -- ./daemon-parallel -w   16
+    ./harness-fsstress.sh -s 3000000 -p 64 -n 10000 -- ./daemon-parallel -w   64
+    ./harness-fsstress.sh -s 3000000 -p 64 -n 10000 -- ./daemon-parallel -w  256
+
+    ./harness-fsstress.sh -s 3000000 -p  1 -n 10000 -- ./daemon-parallel -w    1 -b /dev/vdb
+    ./harness-fsstress.sh -s 3000000 -p 64 -n 10000 -- ./daemon-parallel -w    1 -b /dev/vdb
+    ./harness-fsstress.sh -s 3000000 -p 64 -n 10000 -- ./daemon-parallel -w    4 -b /dev/vdb
+    ./harness-fsstress.sh -s 3000000 -p 64 -n 10000 -- ./daemon-parallel -w   16 -b /dev/vdb
+    ./harness-fsstress.sh -s 3000000 -p 64 -n 10000 -- ./daemon-parallel -w   64 -b /dev/vdb
+    ./harness-fsstress.sh -s 3000000 -p 64 -n 10000 -- ./daemon-parallel -w  256 -b /dev/vdb
+else
+    echo "Unable to find /usr/xfstests/ltp/fsstress"
+fi
-- 
2.29.2.454.gaff20da3a2-goog

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v1 5/5] MAINTAINERS: Add myself as a dm-user maintainer
  2020-12-03 21:58 ` [dm-devel] " Palmer Dabbelt
@ 2020-12-03 21:58   ` Palmer Dabbelt
  -1 siblings, 0 replies; 43+ messages in thread
From: Palmer Dabbelt @ 2020-12-03 21:58 UTC (permalink / raw)
  To: dm-devel, agk, snitzer
  Cc: corbet, song, Palmer Dabbelt, shuah, linux-doc, linux-kernel,
	linux-raid, linux-kselftest, kernel-team

From: Palmer Dabbelt <palmerdabbelt@google.com>

I'm not sure this is big enough to warrant an entry in the MAINTAINERS file,
but I know it can be quite a bit of work to maintain something like this so I'm
happy to sign up if that helps.

Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
---
 MAINTAINERS | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 2daa6ee673f7..ab9d7746cfb4 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5198,6 +5198,13 @@ W:	http://sources.redhat.com/cluster/
 T:	git git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm.git
 F:	fs/dlm/
 
+DM USER (Block Device in Userspace)
+M:	Palmer Dabbelt <palmerdabbelt@google.com>
+S:	Maintained
+F:	include/linux/dm-user.h
+F:	drivers/md/dm-user.c
+F:	tools/testing/selftests/dm-user/
+
 DMA BUFFER SHARING FRAMEWORK
 M:	Sumit Semwal <sumit.semwal@linaro.org>
 M:	Christian König <christian.koenig@amd.com>
-- 
2.29.2.454.gaff20da3a2-goog


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [dm-devel] [PATCH v1 5/5] MAINTAINERS: Add myself as a dm-user maintainer
@ 2020-12-03 21:58   ` Palmer Dabbelt
  0 siblings, 0 replies; 43+ messages in thread
From: Palmer Dabbelt @ 2020-12-03 21:58 UTC (permalink / raw)
  To: dm-devel, agk, snitzer
  Cc: corbet, linux-doc, kernel-team, Palmer Dabbelt, linux-kernel,
	linux-raid, song, linux-kselftest, shuah

From: Palmer Dabbelt <palmerdabbelt@google.com>

I'm not sure this is big enough to warrant an entry in the MAINTAINERS file,
but I know it can be quite a bit of work to maintain something like this so I'm
happy to sign up if that helps.

Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
---
 MAINTAINERS | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 2daa6ee673f7..ab9d7746cfb4 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5198,6 +5198,13 @@ W:	http://sources.redhat.com/cluster/
 T:	git git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm.git
 F:	fs/dlm/
 
+DM USER (Block Device in Userspace)
+M:	Palmer Dabbelt <palmerdabbelt@google.com>
+S:	Maintained
+F:	include/linux/dm-user.h
+F:	drivers/md/dm-user.c
+F:	tools/testing/selftests/dm-user/
+
 DMA BUFFER SHARING FRAMEWORK
 M:	Sumit Semwal <sumit.semwal@linaro.org>
 M:	Christian König <christian.koenig@amd.com>
-- 
2.29.2.454.gaff20da3a2-goog

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace
  2020-12-03 21:58 ` [dm-devel] " Palmer Dabbelt
@ 2020-12-04 10:33   ` Christoph Hellwig
  -1 siblings, 0 replies; 43+ messages in thread
From: Christoph Hellwig @ 2020-12-04 10:33 UTC (permalink / raw)
  To: Palmer Dabbelt
  Cc: dm-devel, agk, snitzer, corbet, song, Palmer Dabbelt, shuah,
	linux-doc, linux-kernel, linux-raid, linux-kselftest,
	kernel-team

What is the advantage over simply using nbd?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dm-devel] [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace
@ 2020-12-04 10:33   ` Christoph Hellwig
  0 siblings, 0 replies; 43+ messages in thread
From: Christoph Hellwig @ 2020-12-04 10:33 UTC (permalink / raw)
  To: Palmer Dabbelt
  Cc: snitzer, corbet, kernel-team, Palmer Dabbelt, linux-doc,
	linux-kernel, linux-raid, song, dm-devel, linux-kselftest, shuah,
	agk

What is the advantage over simply using nbd?

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace
  2020-12-04 10:33   ` [dm-devel] " Christoph Hellwig
@ 2020-12-07 18:55     ` Palmer Dabbelt
  -1 siblings, 0 replies; 43+ messages in thread
From: Palmer Dabbelt @ 2020-12-07 18:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dm-devel, agk, snitzer, corbet, song, shuah, linux-doc,
	linux-kernel, linux-raid, linux-kselftest, kernel-team

On Fri, 04 Dec 2020 02:33:36 PST (-0800), Christoph Hellwig wrote:
> What is the advantage over simply using nbd?

There's a short bit about that in the cover letter (and in some talks), but
I'll expand on it here -- I suppose my most important question is "is this
interesting enough to take upstream?", so there should be at least a bit of a
description of what it actually enables:

I don't think there's any deep fundamental advantages to doing this as opposed
to nbd/iscsi over localhost/unix (or by just writing a kernel implementation,
for that matter), at least in terms of anything that was previously impossible
now becoming possible.  There are a handful of things that are easier and/or
faster, though.

dm-user looks a lot like NBD without the networking.  The major difference is
which side initiates messages: in NBD the kernel initiates messages, while in
dm-user userspace initiates messages (via a read that will block if there is no
message, but presumably we'd want to add support for a non-blocking userspace
implementations eventually).  The NBD approach certainly makes sense for a
networked system, as one generally wants to have a single storage server
handling multiple clients, but inverting that makes some things simpler in
dm-user.  

One specific advantage of this change is that a dm-user target can be
transitioned from one daemon to another without any IO errors: just spin up the
second daemon, signal the first to stop requesting new messages, and let it
exit.  We're using that mechanism to replace the daemon launched by early init
(which runs before the security subsystem is up, as in our use case dm-user
provides the root filesystem) with one that's properly sandboxed (which can
only be launched after the root filesystem has come up).  There are ways around
this (replacing the DM table, for example), but they don't fit it as cleanly.

Unless I'm missing something, NBD servers aren't capable of that style of
transition: soft disconnects can only be initiated by the client (the kernel,
in this case), which leaves no way for the server to transition while
guaranteeing that no IOs error out.  It's usually possible to shoehorn this
sort of direction reversing concept into network protocols, but it's also
usually ugly (I'm thinking of IDLE, for example).  I didn't try to actually do
it, but my guess would be that adding a way for the server to ask the client to
stop sending messages until a new server shows up would be at least as much
work as doing this.

There are also a handful of possible performance advantages, but I haven't gone
through the work to prove any of them out yet as performance isn't all that
important for our first use case.  For example:

* Cutting out the network stack is unlikely to hurt performance.  I'm not sure
  if it will help performance, though.  I think if we really had workload where
  the extra copy was likely to be an issue we'd want an explicit ring buffer,
  but I have a theory that it would be possible to get very good performance out
  of a stream-style API by using multiple channels and relying on io_uring to
  plumb through multiple ops per channel.
* There's a comment in the implementation about allowing userspace to insert
  itself into user_map(), likely by uploading a BPF fragment.  There's a whole
  class of interesting block devices that could be written in this fashion:
  essentially you keep a cache on a regular block device that handles the common
  cases by remapping BIOs and passing them along, relegating the more complicated
  logic to fetch cache misses and watching some subset of the access stream where
  necessary.

  We have a use case like this in Android, where we opportunistically store
  backups in a portion of the TRIM'd space on devices.  It's currently
  implemented entirely in kernel by the dm-bow target, but IIUC that was deemed
  too Android-specific to merge.  Assuming we could get good enough performance
  we could move that logic to userspace, which lets us shrink our diff with
  upstream.  It feels like some other interesting block devices could be
  written in a similar fashion.

All in all, I've found it a bit hard to figure out what sort of interest people
have in dm-user: when I bring this up I seem to run into people who've done
similar things before and are vaguely interested, but certainly nobody is
chomping at the bit.  I'm sending it out in this early state to try and figure
out if it's interesting enough to keep going.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dm-devel] [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace
@ 2020-12-07 18:55     ` Palmer Dabbelt
  0 siblings, 0 replies; 43+ messages in thread
From: Palmer Dabbelt @ 2020-12-07 18:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: snitzer, corbet, kernel-team, linux-doc, linux-kernel,
	linux-raid, song, dm-devel, linux-kselftest, shuah, agk

On Fri, 04 Dec 2020 02:33:36 PST (-0800), Christoph Hellwig wrote:
> What is the advantage over simply using nbd?

There's a short bit about that in the cover letter (and in some talks), but
I'll expand on it here -- I suppose my most important question is "is this
interesting enough to take upstream?", so there should be at least a bit of a
description of what it actually enables:

I don't think there's any deep fundamental advantages to doing this as opposed
to nbd/iscsi over localhost/unix (or by just writing a kernel implementation,
for that matter), at least in terms of anything that was previously impossible
now becoming possible.  There are a handful of things that are easier and/or
faster, though.

dm-user looks a lot like NBD without the networking.  The major difference is
which side initiates messages: in NBD the kernel initiates messages, while in
dm-user userspace initiates messages (via a read that will block if there is no
message, but presumably we'd want to add support for a non-blocking userspace
implementations eventually).  The NBD approach certainly makes sense for a
networked system, as one generally wants to have a single storage server
handling multiple clients, but inverting that makes some things simpler in
dm-user.  

One specific advantage of this change is that a dm-user target can be
transitioned from one daemon to another without any IO errors: just spin up the
second daemon, signal the first to stop requesting new messages, and let it
exit.  We're using that mechanism to replace the daemon launched by early init
(which runs before the security subsystem is up, as in our use case dm-user
provides the root filesystem) with one that's properly sandboxed (which can
only be launched after the root filesystem has come up).  There are ways around
this (replacing the DM table, for example), but they don't fit it as cleanly.

Unless I'm missing something, NBD servers aren't capable of that style of
transition: soft disconnects can only be initiated by the client (the kernel,
in this case), which leaves no way for the server to transition while
guaranteeing that no IOs error out.  It's usually possible to shoehorn this
sort of direction reversing concept into network protocols, but it's also
usually ugly (I'm thinking of IDLE, for example).  I didn't try to actually do
it, but my guess would be that adding a way for the server to ask the client to
stop sending messages until a new server shows up would be at least as much
work as doing this.

There are also a handful of possible performance advantages, but I haven't gone
through the work to prove any of them out yet as performance isn't all that
important for our first use case.  For example:

* Cutting out the network stack is unlikely to hurt performance.  I'm not sure
  if it will help performance, though.  I think if we really had workload where
  the extra copy was likely to be an issue we'd want an explicit ring buffer,
  but I have a theory that it would be possible to get very good performance out
  of a stream-style API by using multiple channels and relying on io_uring to
  plumb through multiple ops per channel.
* There's a comment in the implementation about allowing userspace to insert
  itself into user_map(), likely by uploading a BPF fragment.  There's a whole
  class of interesting block devices that could be written in this fashion:
  essentially you keep a cache on a regular block device that handles the common
  cases by remapping BIOs and passing them along, relegating the more complicated
  logic to fetch cache misses and watching some subset of the access stream where
  necessary.

  We have a use case like this in Android, where we opportunistically store
  backups in a portion of the TRIM'd space on devices.  It's currently
  implemented entirely in kernel by the dm-bow target, but IIUC that was deemed
  too Android-specific to merge.  Assuming we could get good enough performance
  we could move that logic to userspace, which lets us shrink our diff with
  upstream.  It feels like some other interesting block devices could be
  written in a similar fashion.

All in all, I've found it a bit hard to figure out what sort of interest people
have in dm-user: when I bring this up I seem to run into people who've done
similar things before and are vaguely interested, but certainly nobody is
chomping at the bit.  I'm sending it out in this early state to try and figure
out if it's interesting enough to keep going.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dm-devel] [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace
  2020-12-07 18:55     ` [dm-devel] " Palmer Dabbelt
@ 2020-12-10  3:38       ` Bart Van Assche
  -1 siblings, 0 replies; 43+ messages in thread
From: Bart Van Assche @ 2020-12-10  3:38 UTC (permalink / raw)
  To: Palmer Dabbelt, Christoph Hellwig
  Cc: snitzer, corbet, kernel-team, linux-doc, linux-kernel,
	linux-raid, song, dm-devel, linux-kselftest, shuah, agk,
	Josef Bacik, Mike Christie

On 12/7/20 10:55 AM, Palmer Dabbelt wrote:
> All in all, I've found it a bit hard to figure out what sort of interest
> people
> have in dm-user: when I bring this up I seem to run into people who've done
> similar things before and are vaguely interested, but certainly nobody is
> chomping at the bit.  I'm sending it out in this early state to try and
> figure
> out if it's interesting enough to keep going.

Cc-ing Josef and Mike since their nbd contributions make me wonder
whether this new driver could be useful to their use cases?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dm-devel] [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace
@ 2020-12-10  3:38       ` Bart Van Assche
  0 siblings, 0 replies; 43+ messages in thread
From: Bart Van Assche @ 2020-12-10  3:38 UTC (permalink / raw)
  To: Palmer Dabbelt, Christoph Hellwig
  Cc: snitzer, linux-doc, shuah, corbet, linux-kernel, Josef Bacik,
	linux-raid, song, dm-devel, Mike Christie, linux-kselftest,
	kernel-team, agk

On 12/7/20 10:55 AM, Palmer Dabbelt wrote:
> All in all, I've found it a bit hard to figure out what sort of interest
> people
> have in dm-user: when I bring this up I seem to run into people who've done
> similar things before and are vaguely interested, but certainly nobody is
> chomping at the bit.  I'm sending it out in this early state to try and
> figure
> out if it's interesting enough to keep going.

Cc-ing Josef and Mike since their nbd contributions make me wonder
whether this new driver could be useful to their use cases?

Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dm-devel] [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace
  2020-12-10  3:38       ` Bart Van Assche
@ 2020-12-10 17:03         ` Josef Bacik
  -1 siblings, 0 replies; 43+ messages in thread
From: Josef Bacik @ 2020-12-10 17:03 UTC (permalink / raw)
  To: Bart Van Assche, Palmer Dabbelt, Christoph Hellwig
  Cc: snitzer, corbet, kernel-team, linux-doc, linux-kernel,
	linux-raid, song, dm-devel, linux-kselftest, shuah, agk,
	Mike Christie

On 12/9/20 10:38 PM, Bart Van Assche wrote:
> On 12/7/20 10:55 AM, Palmer Dabbelt wrote:
>> All in all, I've found it a bit hard to figure out what sort of interest
>> people
>> have in dm-user: when I bring this up I seem to run into people who've done
>> similar things before and are vaguely interested, but certainly nobody is
>> chomping at the bit.  I'm sending it out in this early state to try and
>> figure
>> out if it's interesting enough to keep going.
> 
> Cc-ing Josef and Mike since their nbd contributions make me wonder
> whether this new driver could be useful to their use cases?
> 

Sorry gmail+imap sucks and I can't get my email client to get at the original 
thread.  However here is my take.

1) The advantages of using dm-user of NBD that you listed aren't actually 
problems for NBD.  We have NBD working in production where you can hand off the 
sockets for the server without ending in timeouts, it was actually the main 
reason we wrote our own server so we could use the FD transfer stuff to restart 
the server without impacting any clients that had the device in use.

2) The extra copy is a big deal, in fact we already have too many copies in our 
existing NBD setup and are actively looking for ways to avoid those.

Don't take this as I don't think dm-user is a good idea, but I think at the very 
least it should start with the very best we have to offer, starting with as few 
copies as possible.

If you are using it currently in production then cool, there's clearly a usecase 
for it.  Personally as I get older and grouchier I want less things in the 
kernel, so if this enables us to eventually do everything NBD related in 
userspace with no performance drop then I'd be down.  I don't think you need to 
make that your primary goal, but at least polishing this up so it could 
potentially be abused in the future would make it more compelling for merging. 
Thanks,

Josef

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dm-devel] [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace
@ 2020-12-10 17:03         ` Josef Bacik
  0 siblings, 0 replies; 43+ messages in thread
From: Josef Bacik @ 2020-12-10 17:03 UTC (permalink / raw)
  To: Bart Van Assche, Palmer Dabbelt, Christoph Hellwig
  Cc: snitzer, linux-doc, shuah, corbet, linux-kernel, linux-raid,
	song, dm-devel, Mike Christie, linux-kselftest, kernel-team, agk

On 12/9/20 10:38 PM, Bart Van Assche wrote:
> On 12/7/20 10:55 AM, Palmer Dabbelt wrote:
>> All in all, I've found it a bit hard to figure out what sort of interest
>> people
>> have in dm-user: when I bring this up I seem to run into people who've done
>> similar things before and are vaguely interested, but certainly nobody is
>> chomping at the bit.  I'm sending it out in this early state to try and
>> figure
>> out if it's interesting enough to keep going.
> 
> Cc-ing Josef and Mike since their nbd contributions make me wonder
> whether this new driver could be useful to their use cases?
> 

Sorry gmail+imap sucks and I can't get my email client to get at the original 
thread.  However here is my take.

1) The advantages of using dm-user of NBD that you listed aren't actually 
problems for NBD.  We have NBD working in production where you can hand off the 
sockets for the server without ending in timeouts, it was actually the main 
reason we wrote our own server so we could use the FD transfer stuff to restart 
the server without impacting any clients that had the device in use.

2) The extra copy is a big deal, in fact we already have too many copies in our 
existing NBD setup and are actively looking for ways to avoid those.

Don't take this as I don't think dm-user is a good idea, but I think at the very 
least it should start with the very best we have to offer, starting with as few 
copies as possible.

If you are using it currently in production then cool, there's clearly a usecase 
for it.  Personally as I get older and grouchier I want less things in the 
kernel, so if this enables us to eventually do everything NBD related in 
userspace with no performance drop then I'd be down.  I don't think you need to 
make that your primary goal, but at least polishing this up so it could 
potentially be abused in the future would make it more compelling for merging. 
Thanks,

Josef

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dm-devel] [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace
  2020-12-10 17:03         ` Josef Bacik
@ 2020-12-15  3:00           ` Palmer Dabbelt
  -1 siblings, 0 replies; 43+ messages in thread
From: Palmer Dabbelt @ 2020-12-15  3:00 UTC (permalink / raw)
  To: josef
  Cc: bvanassche, Christoph Hellwig, snitzer, corbet, kernel-team,
	linux-doc, linux-kernel, linux-raid, song, dm-devel,
	linux-kselftest, shuah, agk, michael.christie

On Thu, 10 Dec 2020 09:03:21 PST (-0800), josef@toxicpanda.com wrote:
> On 12/9/20 10:38 PM, Bart Van Assche wrote:
>> On 12/7/20 10:55 AM, Palmer Dabbelt wrote:
>>> All in all, I've found it a bit hard to figure out what sort of interest
>>> people
>>> have in dm-user: when I bring this up I seem to run into people who've done
>>> similar things before and are vaguely interested, but certainly nobody is
>>> chomping at the bit.  I'm sending it out in this early state to try and
>>> figure
>>> out if it's interesting enough to keep going.
>>
>> Cc-ing Josef and Mike since their nbd contributions make me wonder
>> whether this new driver could be useful to their use cases?
>>
>
> Sorry gmail+imap sucks and I can't get my email client to get at the original
> thread.  However here is my take.

and I guess I then have to apoligize for missing your email ;).  Hopefully that
was the problem, but who knows.

> 1) The advantages of using dm-user of NBD that you listed aren't actually
> problems for NBD.  We have NBD working in production where you can hand off the
> sockets for the server without ending in timeouts, it was actually the main
> reason we wrote our own server so we could use the FD transfer stuff to restart
> the server without impacting any clients that had the device in use.

OK.  So you just send the FD around using one of the standard mechanisms to
orchestrate the handoff?  I guess that might work for our use case, assuming
whatever the security side of things was doing was OK with the old FD.  TBH I'm
not sure how all that works and while we thought about doing that sort of
transfer scheme we decided to just open it again -- not sure how far we were
down the dm-user rabbit hole at that point, though, as this sort of arose out
of some other ideas.

> 2) The extra copy is a big deal, in fact we already have too many copies in our
> existing NBD setup and are actively looking for ways to avoid those.
>
> Don't take this as I don't think dm-user is a good idea, but I think at the very
> least it should start with the very best we have to offer, starting with as few
> copies as possible.

I was really experting someone to say that.  It does seem kind of silly to build
out the new interface, but not go all the way to a ring buffer.  We just didn't
really have any way to justify the extra complexity as our use cases aren't
that high performance.   I kind of like to have benchmarks for this sort of
thing, though, and I didn't have anyone who had bothered avoiding the last copy
to compare against.

> If you are using it currently in production then cool, there's clearly a usecase
> for it.  Personally as I get older and grouchier I want less things in the
> kernel, so if this enables us to eventually do everything NBD related in
> userspace with no performance drop then I'd be down.  I don't think you need to
> make that your primary goal, but at least polishing this up so it could
> potentially be abused in the future would make it more compelling for merging.
> Thanks,

Ya, it's in Android already and we'll be shipping it as part of the new OTA
flow for the next release.  The rules on deprecation are a bit different over
there, though, so it's not like we're wed to it.  The whole point of bringing
this up here was to try and get something usable by everyone, and while I'd
eventually like to get whatever's in Android into the kernel proper we'd really
planned on supporting an extra Android-only ABI for a cycle at least.  

I'm kind of inclined to take a crack at the extra copy, to at least see if
building something that eliminates it is viable.  I'm not really sure if it is
(or at least, if it'll net us a meaningful amount of performance), but it'd at
least be interesting to try.

It'd be nice to have some benchmark target, though, as otherwise this stuff
hangs on forever.  My workloads are in selftests later on in the patch set, but
I'm essentially using tmpfs as a baseline to compare against ext4+dm-user with
some FIO examples as workloads.  Our early benchmark numbers indicated this was
way faster than we needed, so I didn't even bother putting together a proper
system to run on so I don't really have any meaningful numbers there.  Is there
an NBD server that's fast that I should be comparing against?

I haven't gotten a whole lot of feedback, so I'm inclined to at least have some
reasonable performance numbers before bothering with a v2.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dm-devel] [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace
@ 2020-12-15  3:00           ` Palmer Dabbelt
  0 siblings, 0 replies; 43+ messages in thread
From: Palmer Dabbelt @ 2020-12-15  3:00 UTC (permalink / raw)
  To: josef
  Cc: linux-raid, bvanassche, snitzer, linux-doc, shuah, corbet,
	linux-kernel, Christoph Hellwig, song, dm-devel,
	michael.christie, linux-kselftest, kernel-team, agk

On Thu, 10 Dec 2020 09:03:21 PST (-0800), josef@toxicpanda.com wrote:
> On 12/9/20 10:38 PM, Bart Van Assche wrote:
>> On 12/7/20 10:55 AM, Palmer Dabbelt wrote:
>>> All in all, I've found it a bit hard to figure out what sort of interest
>>> people
>>> have in dm-user: when I bring this up I seem to run into people who've done
>>> similar things before and are vaguely interested, but certainly nobody is
>>> chomping at the bit.  I'm sending it out in this early state to try and
>>> figure
>>> out if it's interesting enough to keep going.
>>
>> Cc-ing Josef and Mike since their nbd contributions make me wonder
>> whether this new driver could be useful to their use cases?
>>
>
> Sorry gmail+imap sucks and I can't get my email client to get at the original
> thread.  However here is my take.

and I guess I then have to apoligize for missing your email ;).  Hopefully that
was the problem, but who knows.

> 1) The advantages of using dm-user of NBD that you listed aren't actually
> problems for NBD.  We have NBD working in production where you can hand off the
> sockets for the server without ending in timeouts, it was actually the main
> reason we wrote our own server so we could use the FD transfer stuff to restart
> the server without impacting any clients that had the device in use.

OK.  So you just send the FD around using one of the standard mechanisms to
orchestrate the handoff?  I guess that might work for our use case, assuming
whatever the security side of things was doing was OK with the old FD.  TBH I'm
not sure how all that works and while we thought about doing that sort of
transfer scheme we decided to just open it again -- not sure how far we were
down the dm-user rabbit hole at that point, though, as this sort of arose out
of some other ideas.

> 2) The extra copy is a big deal, in fact we already have too many copies in our
> existing NBD setup and are actively looking for ways to avoid those.
>
> Don't take this as I don't think dm-user is a good idea, but I think at the very
> least it should start with the very best we have to offer, starting with as few
> copies as possible.

I was really experting someone to say that.  It does seem kind of silly to build
out the new interface, but not go all the way to a ring buffer.  We just didn't
really have any way to justify the extra complexity as our use cases aren't
that high performance.   I kind of like to have benchmarks for this sort of
thing, though, and I didn't have anyone who had bothered avoiding the last copy
to compare against.

> If you are using it currently in production then cool, there's clearly a usecase
> for it.  Personally as I get older and grouchier I want less things in the
> kernel, so if this enables us to eventually do everything NBD related in
> userspace with no performance drop then I'd be down.  I don't think you need to
> make that your primary goal, but at least polishing this up so it could
> potentially be abused in the future would make it more compelling for merging.
> Thanks,

Ya, it's in Android already and we'll be shipping it as part of the new OTA
flow for the next release.  The rules on deprecation are a bit different over
there, though, so it's not like we're wed to it.  The whole point of bringing
this up here was to try and get something usable by everyone, and while I'd
eventually like to get whatever's in Android into the kernel proper we'd really
planned on supporting an extra Android-only ABI for a cycle at least.  

I'm kind of inclined to take a crack at the extra copy, to at least see if
building something that eliminates it is viable.  I'm not really sure if it is
(or at least, if it'll net us a meaningful amount of performance), but it'd at
least be interesting to try.

It'd be nice to have some benchmark target, though, as otherwise this stuff
hangs on forever.  My workloads are in selftests later on in the patch set, but
I'm essentially using tmpfs as a baseline to compare against ext4+dm-user with
some FIO examples as workloads.  Our early benchmark numbers indicated this was
way faster than we needed, so I didn't even bother putting together a proper
system to run on so I don't really have any meaningful numbers there.  Is there
an NBD server that's fast that I should be comparing against?

I haven't gotten a whole lot of feedback, so I'm inclined to at least have some
reasonable performance numbers before bothering with a v2.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dm-devel] [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace
  2020-12-15  3:00           ` Palmer Dabbelt
  (?)
@ 2020-12-16  6:17           ` Akira Hayakawa
  2020-12-16 20:35               ` Palmer Dabbelt
  -1 siblings, 1 reply; 43+ messages in thread
From: Akira Hayakawa @ 2020-12-16  6:17 UTC (permalink / raw)
  To: Palmer Dabbelt
  Cc: Christoph Hellwig, bvanassche, Mike Snitzer, linux-doc,
	kernel-team, corbet, josef, linux-kernel, linux-raid, song,
	dm-devel, Alasdair G Kergon, linux-kselftest, shuah,
	michael.christie


[-- Attachment #1.1: Type: text/plain, Size: 5756 bytes --]

Hi my name is Akira Hayakawa. I am maintaining an out-of-tree DM target
named dm-writeboost.

Sorry to step in. But this is a very interesting topic at least to me.

I have been looking for something like dm-user because I believe we should
be able to implement virtual block devices in Rust language.

I know proxying IO requests to userland always causes some overhead but for
some type of device that performance doesn't matter or some research
prototyping or pseudo device for testing, this way should be developed. Of
course, implementation in Rust will give us opportunities to develop more
complicated software in high quality.

I noticed this thread few days ago then I started to prototype this library
https://github.com/akiradeveloper/userland-io

It is what I want but the transport is still NBD which I don't like so
much. If dm-user is available, I will implement a transport using dm-user.

- Akira

On Tue, Dec 15, 2020 at 7:00 PM Palmer Dabbelt <palmer@dabbelt.com> wrote:

> On Thu, 10 Dec 2020 09:03:21 PST (-0800), josef@toxicpanda.com wrote:
> > On 12/9/20 10:38 PM, Bart Van Assche wrote:
> >> On 12/7/20 10:55 AM, Palmer Dabbelt wrote:
> >>> All in all, I've found it a bit hard to figure out what sort of
> interest
> >>> people
> >>> have in dm-user: when I bring this up I seem to run into people who've
> done
> >>> similar things before and are vaguely interested, but certainly nobody
> is
> >>> chomping at the bit.  I'm sending it out in this early state to try and
> >>> figure
> >>> out if it's interesting enough to keep going.
> >>
> >> Cc-ing Josef and Mike since their nbd contributions make me wonder
> >> whether this new driver could be useful to their use cases?
> >>
> >
> > Sorry gmail+imap sucks and I can't get my email client to get at the
> original
> > thread.  However here is my take.
>
> and I guess I then have to apoligize for missing your email ;).  Hopefully
> that
> was the problem, but who knows.
>
> > 1) The advantages of using dm-user of NBD that you listed aren't actually
> > problems for NBD.  We have NBD working in production where you can hand
> off the
> > sockets for the server without ending in timeouts, it was actually the
> main
> > reason we wrote our own server so we could use the FD transfer stuff to
> restart
> > the server without impacting any clients that had the device in use.
>
> OK.  So you just send the FD around using one of the standard mechanisms to
> orchestrate the handoff?  I guess that might work for our use case,
> assuming
> whatever the security side of things was doing was OK with the old FD.
> TBH I'm
> not sure how all that works and while we thought about doing that sort of
> transfer scheme we decided to just open it again -- not sure how far we
> were
> down the dm-user rabbit hole at that point, though, as this sort of arose
> out
> of some other ideas.
>
> > 2) The extra copy is a big deal, in fact we already have too many copies
> in our
> > existing NBD setup and are actively looking for ways to avoid those.
> >
> > Don't take this as I don't think dm-user is a good idea, but I think at
> the very
> > least it should start with the very best we have to offer, starting with
> as few
> > copies as possible.
>
> I was really experting someone to say that.  It does seem kind of silly to
> build
> out the new interface, but not go all the way to a ring buffer.  We just
> didn't
> really have any way to justify the extra complexity as our use cases aren't
> that high performance.   I kind of like to have benchmarks for this sort of
> thing, though, and I didn't have anyone who had bothered avoiding the last
> copy
> to compare against.
>
> > If you are using it currently in production then cool, there's clearly a
> usecase
> > for it.  Personally as I get older and grouchier I want less things in
> the
> > kernel, so if this enables us to eventually do everything NBD related in
> > userspace with no performance drop then I'd be down.  I don't think you
> need to
> > make that your primary goal, but at least polishing this up so it could
> > potentially be abused in the future would make it more compelling for
> merging.
> > Thanks,
>
> Ya, it's in Android already and we'll be shipping it as part of the new OTA
> flow for the next release.  The rules on deprecation are a bit different
> over
> there, though, so it's not like we're wed to it.  The whole point of
> bringing
> this up here was to try and get something usable by everyone, and while I'd
> eventually like to get whatever's in Android into the kernel proper we'd
> really
> planned on supporting an extra Android-only ABI for a cycle at least.
>
> I'm kind of inclined to take a crack at the extra copy, to at least see if
> building something that eliminates it is viable.  I'm not really sure if
> it is
> (or at least, if it'll net us a meaningful amount of performance), but
> it'd at
> least be interesting to try.
>
> It'd be nice to have some benchmark target, though, as otherwise this stuff
> hangs on forever.  My workloads are in selftests later on in the patch
> set, but
> I'm essentially using tmpfs as a baseline to compare against ext4+dm-user
> with
> some FIO examples as workloads.  Our early benchmark numbers indicated
> this was
> way faster than we needed, so I didn't even bother putting together a
> proper
> system to run on so I don't really have any meaningful numbers there.  Is
> there
> an NBD server that's fast that I should be comparing against?
>
> I haven't gotten a whole lot of feedback, so I'm inclined to at least have
> some
> reasonable performance numbers before bothering with a v2.
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel



-- 
Akira Hayakawa

[-- Attachment #1.2: Type: text/html, Size: 7253 bytes --]

[-- Attachment #2: Type: text/plain, Size: 93 bytes --]

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dm-devel] [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace
  2020-12-15  3:00           ` Palmer Dabbelt
@ 2020-12-16 18:24             ` Vitaly Mayatskih
  -1 siblings, 0 replies; 43+ messages in thread
From: Vitaly Mayatskih @ 2020-12-16 18:24 UTC (permalink / raw)
  To: Palmer Dabbelt
  Cc: josef, bvanassche, Christoph Hellwig, Mike Snitzer, corbet,
	kernel-team, linux-doc, linux-kernel, linux-raid, Song Liu,
	dm-devel, linux-kselftest, shuah, agk, michael.christie

On Mon, Dec 14, 2020 at 10:03 PM Palmer Dabbelt <palmer@dabbelt.com> wrote:

> I was really experting someone to say that.  It does seem kind of silly to build
> out the new interface, but not go all the way to a ring buffer.  We just didn't
> really have any way to justify the extra complexity as our use cases aren't
> that high performance.   I kind of like to have benchmarks for this sort of
> thing, though, and I didn't have anyone who had bothered avoiding the last copy
> to compare against.

I worked on something very similar, though performance was one of the
goals. The implementation was floating around lockless ring buffers,
shared memory for zerocopy, multiqueue and error handling. It could be
that every disk storage vendor has to implement something like that in
order to bridge Linux kernel to their own proprietary datapath running
in userspace.

-- 
wbr, Vitaly

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dm-devel] [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace
@ 2020-12-16 18:24             ` Vitaly Mayatskih
  0 siblings, 0 replies; 43+ messages in thread
From: Vitaly Mayatskih @ 2020-12-16 18:24 UTC (permalink / raw)
  To: Palmer Dabbelt
  Cc: linux-raid, bvanassche, Mike Snitzer, corbet, shuah, linux-doc,
	josef, linux-kernel, Christoph Hellwig, Song Liu, dm-devel,
	michael.christie, linux-kselftest, kernel-team, agk

On Mon, Dec 14, 2020 at 10:03 PM Palmer Dabbelt <palmer@dabbelt.com> wrote:

> I was really experting someone to say that.  It does seem kind of silly to build
> out the new interface, but not go all the way to a ring buffer.  We just didn't
> really have any way to justify the extra complexity as our use cases aren't
> that high performance.   I kind of like to have benchmarks for this sort of
> thing, though, and I didn't have anyone who had bothered avoiding the last copy
> to compare against.

I worked on something very similar, though performance was one of the
goals. The implementation was floating around lockless ring buffers,
shared memory for zerocopy, multiqueue and error handling. It could be
that every disk storage vendor has to implement something like that in
order to bridge Linux kernel to their own proprietary datapath running
in userspace.

-- 
wbr, Vitaly

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dm-devel] [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace
  2020-12-16  6:17           ` Akira Hayakawa
@ 2020-12-16 20:35               ` Palmer Dabbelt
  0 siblings, 0 replies; 43+ messages in thread
From: Palmer Dabbelt @ 2020-12-16 20:35 UTC (permalink / raw)
  To: ruby.wktk
  Cc: josef, linux-raid, bvanassche, snitzer, linux-doc, shuah, corbet,
	linux-kernel, Christoph Hellwig, song, dm-devel,
	michael.christie, linux-kselftest, kernel-team, agk

On Tue, 15 Dec 2020 22:17:06 PST (-0800), ruby.wktk@gmail.com wrote:
> Hi my name is Akira Hayakawa. I am maintaining an out-of-tree DM target
> named dm-writeboost.
>
> Sorry to step in. But this is a very interesting topic at least to me.
>
> I have been looking for something like dm-user because I believe we should
> be able to implement virtual block devices in Rust language.
>
> I know proxying IO requests to userland always causes some overhead but for
> some type of device that performance doesn't matter or some research
> prototyping or pseudo device for testing, this way should be developed. Of
> course, implementation in Rust will give us opportunities to develop more
> complicated software in high quality.
>
> I noticed this thread few days ago then I started to prototype this library
> https://github.com/akiradeveloper/userland-io
>
> It is what I want but the transport is still NBD which I don't like so
> much. If dm-user is available, I will implement a transport using dm-user.

Great, I'm glad to hear that.  Obviously this is still in the early days and
we're talking about high-level ABI design here, so things are almost certainly
going to change, but it's always good to have people pushing on stuff.

Just be warned: we've only had two people write userspaces for this (one of
which was me, and all that is test code) so I'd be shocked if you manage to
avoid running into bugs.

>
> - Akira
>
> On Tue, Dec 15, 2020 at 7:00 PM Palmer Dabbelt <palmer@dabbelt.com> wrote:
>
>> On Thu, 10 Dec 2020 09:03:21 PST (-0800), josef@toxicpanda.com wrote:
>> > On 12/9/20 10:38 PM, Bart Van Assche wrote:
>> >> On 12/7/20 10:55 AM, Palmer Dabbelt wrote:
>> >>> All in all, I've found it a bit hard to figure out what sort of
>> interest
>> >>> people
>> >>> have in dm-user: when I bring this up I seem to run into people who've
>> done
>> >>> similar things before and are vaguely interested, but certainly nobody
>> is
>> >>> chomping at the bit.  I'm sending it out in this early state to try and
>> >>> figure
>> >>> out if it's interesting enough to keep going.
>> >>
>> >> Cc-ing Josef and Mike since their nbd contributions make me wonder
>> >> whether this new driver could be useful to their use cases?
>> >>
>> >
>> > Sorry gmail+imap sucks and I can't get my email client to get at the
>> original
>> > thread.  However here is my take.
>>
>> and I guess I then have to apoligize for missing your email ;).  Hopefully
>> that
>> was the problem, but who knows.
>>
>> > 1) The advantages of using dm-user of NBD that you listed aren't actually
>> > problems for NBD.  We have NBD working in production where you can hand
>> off the
>> > sockets for the server without ending in timeouts, it was actually the
>> main
>> > reason we wrote our own server so we could use the FD transfer stuff to
>> restart
>> > the server without impacting any clients that had the device in use.
>>
>> OK.  So you just send the FD around using one of the standard mechanisms to
>> orchestrate the handoff?  I guess that might work for our use case,
>> assuming
>> whatever the security side of things was doing was OK with the old FD.
>> TBH I'm
>> not sure how all that works and while we thought about doing that sort of
>> transfer scheme we decided to just open it again -- not sure how far we
>> were
>> down the dm-user rabbit hole at that point, though, as this sort of arose
>> out
>> of some other ideas.
>>
>> > 2) The extra copy is a big deal, in fact we already have too many copies
>> in our
>> > existing NBD setup and are actively looking for ways to avoid those.
>> >
>> > Don't take this as I don't think dm-user is a good idea, but I think at
>> the very
>> > least it should start with the very best we have to offer, starting with
>> as few
>> > copies as possible.
>>
>> I was really experting someone to say that.  It does seem kind of silly to
>> build
>> out the new interface, but not go all the way to a ring buffer.  We just
>> didn't
>> really have any way to justify the extra complexity as our use cases aren't
>> that high performance.   I kind of like to have benchmarks for this sort of
>> thing, though, and I didn't have anyone who had bothered avoiding the last
>> copy
>> to compare against.
>>
>> > If you are using it currently in production then cool, there's clearly a
>> usecase
>> > for it.  Personally as I get older and grouchier I want less things in
>> the
>> > kernel, so if this enables us to eventually do everything NBD related in
>> > userspace with no performance drop then I'd be down.  I don't think you
>> need to
>> > make that your primary goal, but at least polishing this up so it could
>> > potentially be abused in the future would make it more compelling for
>> merging.
>> > Thanks,
>>
>> Ya, it's in Android already and we'll be shipping it as part of the new OTA
>> flow for the next release.  The rules on deprecation are a bit different
>> over
>> there, though, so it's not like we're wed to it.  The whole point of
>> bringing
>> this up here was to try and get something usable by everyone, and while I'd
>> eventually like to get whatever's in Android into the kernel proper we'd
>> really
>> planned on supporting an extra Android-only ABI for a cycle at least.
>>
>> I'm kind of inclined to take a crack at the extra copy, to at least see if
>> building something that eliminates it is viable.  I'm not really sure if
>> it is
>> (or at least, if it'll net us a meaningful amount of performance), but
>> it'd at
>> least be interesting to try.
>>
>> It'd be nice to have some benchmark target, though, as otherwise this stuff
>> hangs on forever.  My workloads are in selftests later on in the patch
>> set, but
>> I'm essentially using tmpfs as a baseline to compare against ext4+dm-user
>> with
>> some FIO examples as workloads.  Our early benchmark numbers indicated
>> this was
>> way faster than we needed, so I didn't even bother putting together a
>> proper
>> system to run on so I don't really have any meaningful numbers there.  Is
>> there
>> an NBD server that's fast that I should be comparing against?
>>
>> I haven't gotten a whole lot of feedback, so I'm inclined to at least have
>> some
>> reasonable performance numbers before bothering with a v2.
>>
>> --
>> dm-devel mailing list
>> dm-devel@redhat.com
>> https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dm-devel] [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace
@ 2020-12-16 20:35               ` Palmer Dabbelt
  0 siblings, 0 replies; 43+ messages in thread
From: Palmer Dabbelt @ 2020-12-16 20:35 UTC (permalink / raw)
  To: ruby.wktk
  Cc: Christoph Hellwig, bvanassche, snitzer, linux-doc, kernel-team,
	corbet, josef, linux-kernel, linux-raid, song, dm-devel, agk,
	linux-kselftest, shuah, michael.christie

On Tue, 15 Dec 2020 22:17:06 PST (-0800), ruby.wktk@gmail.com wrote:
> Hi my name is Akira Hayakawa. I am maintaining an out-of-tree DM target
> named dm-writeboost.
>
> Sorry to step in. But this is a very interesting topic at least to me.
>
> I have been looking for something like dm-user because I believe we should
> be able to implement virtual block devices in Rust language.
>
> I know proxying IO requests to userland always causes some overhead but for
> some type of device that performance doesn't matter or some research
> prototyping or pseudo device for testing, this way should be developed. Of
> course, implementation in Rust will give us opportunities to develop more
> complicated software in high quality.
>
> I noticed this thread few days ago then I started to prototype this library
> https://github.com/akiradeveloper/userland-io
>
> It is what I want but the transport is still NBD which I don't like so
> much. If dm-user is available, I will implement a transport using dm-user.

Great, I'm glad to hear that.  Obviously this is still in the early days and
we're talking about high-level ABI design here, so things are almost certainly
going to change, but it's always good to have people pushing on stuff.

Just be warned: we've only had two people write userspaces for this (one of
which was me, and all that is test code) so I'd be shocked if you manage to
avoid running into bugs.

>
> - Akira
>
> On Tue, Dec 15, 2020 at 7:00 PM Palmer Dabbelt <palmer@dabbelt.com> wrote:
>
>> On Thu, 10 Dec 2020 09:03:21 PST (-0800), josef@toxicpanda.com wrote:
>> > On 12/9/20 10:38 PM, Bart Van Assche wrote:
>> >> On 12/7/20 10:55 AM, Palmer Dabbelt wrote:
>> >>> All in all, I've found it a bit hard to figure out what sort of
>> interest
>> >>> people
>> >>> have in dm-user: when I bring this up I seem to run into people who've
>> done
>> >>> similar things before and are vaguely interested, but certainly nobody
>> is
>> >>> chomping at the bit.  I'm sending it out in this early state to try and
>> >>> figure
>> >>> out if it's interesting enough to keep going.
>> >>
>> >> Cc-ing Josef and Mike since their nbd contributions make me wonder
>> >> whether this new driver could be useful to their use cases?
>> >>
>> >
>> > Sorry gmail+imap sucks and I can't get my email client to get at the
>> original
>> > thread.  However here is my take.
>>
>> and I guess I then have to apoligize for missing your email ;).  Hopefully
>> that
>> was the problem, but who knows.
>>
>> > 1) The advantages of using dm-user of NBD that you listed aren't actually
>> > problems for NBD.  We have NBD working in production where you can hand
>> off the
>> > sockets for the server without ending in timeouts, it was actually the
>> main
>> > reason we wrote our own server so we could use the FD transfer stuff to
>> restart
>> > the server without impacting any clients that had the device in use.
>>
>> OK.  So you just send the FD around using one of the standard mechanisms to
>> orchestrate the handoff?  I guess that might work for our use case,
>> assuming
>> whatever the security side of things was doing was OK with the old FD.
>> TBH I'm
>> not sure how all that works and while we thought about doing that sort of
>> transfer scheme we decided to just open it again -- not sure how far we
>> were
>> down the dm-user rabbit hole at that point, though, as this sort of arose
>> out
>> of some other ideas.
>>
>> > 2) The extra copy is a big deal, in fact we already have too many copies
>> in our
>> > existing NBD setup and are actively looking for ways to avoid those.
>> >
>> > Don't take this as I don't think dm-user is a good idea, but I think at
>> the very
>> > least it should start with the very best we have to offer, starting with
>> as few
>> > copies as possible.
>>
>> I was really experting someone to say that.  It does seem kind of silly to
>> build
>> out the new interface, but not go all the way to a ring buffer.  We just
>> didn't
>> really have any way to justify the extra complexity as our use cases aren't
>> that high performance.   I kind of like to have benchmarks for this sort of
>> thing, though, and I didn't have anyone who had bothered avoiding the last
>> copy
>> to compare against.
>>
>> > If you are using it currently in production then cool, there's clearly a
>> usecase
>> > for it.  Personally as I get older and grouchier I want less things in
>> the
>> > kernel, so if this enables us to eventually do everything NBD related in
>> > userspace with no performance drop then I'd be down.  I don't think you
>> need to
>> > make that your primary goal, but at least polishing this up so it could
>> > potentially be abused in the future would make it more compelling for
>> merging.
>> > Thanks,
>>
>> Ya, it's in Android already and we'll be shipping it as part of the new OTA
>> flow for the next release.  The rules on deprecation are a bit different
>> over
>> there, though, so it's not like we're wed to it.  The whole point of
>> bringing
>> this up here was to try and get something usable by everyone, and while I'd
>> eventually like to get whatever's in Android into the kernel proper we'd
>> really
>> planned on supporting an extra Android-only ABI for a cycle at least.
>>
>> I'm kind of inclined to take a crack at the extra copy, to at least see if
>> building something that eliminates it is viable.  I'm not really sure if
>> it is
>> (or at least, if it'll net us a meaningful amount of performance), but
>> it'd at
>> least be interesting to try.
>>
>> It'd be nice to have some benchmark target, though, as otherwise this stuff
>> hangs on forever.  My workloads are in selftests later on in the patch
>> set, but
>> I'm essentially using tmpfs as a baseline to compare against ext4+dm-user
>> with
>> some FIO examples as workloads.  Our early benchmark numbers indicated
>> this was
>> way faster than we needed, so I didn't even bother putting together a
>> proper
>> system to run on so I don't really have any meaningful numbers there.  Is
>> there
>> an NBD server that's fast that I should be comparing against?
>>
>> I haven't gotten a whole lot of feedback, so I'm inclined to at least have
>> some
>> reasonable performance numbers before bothering with a v2.
>>
>> --
>> dm-devel mailing list
>> dm-devel@redhat.com
>> https://www.redhat.com/mailman/listinfo/dm-devel

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dm-devel] [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace
  2020-12-16 18:24             ` Vitaly Mayatskih
@ 2020-12-17  6:55               ` Palmer Dabbelt
  -1 siblings, 0 replies; 43+ messages in thread
From: Palmer Dabbelt @ 2020-12-17  6:55 UTC (permalink / raw)
  To: v.mayatskih
  Cc: josef, bvanassche, Christoph Hellwig, snitzer, corbet,
	kernel-team, linux-doc, linux-kernel, linux-raid, song, dm-devel,
	linux-kselftest, shuah, agk, michael.christie

On Wed, 16 Dec 2020 10:24:59 PST (-0800), v.mayatskih@gmail.com wrote:
> On Mon, Dec 14, 2020 at 10:03 PM Palmer Dabbelt <palmer@dabbelt.com> wrote:
>
>> I was really experting someone to say that.  It does seem kind of silly to build
>> out the new interface, but not go all the way to a ring buffer.  We just didn't
>> really have any way to justify the extra complexity as our use cases aren't
>> that high performance.   I kind of like to have benchmarks for this sort of
>> thing, though, and I didn't have anyone who had bothered avoiding the last copy
>> to compare against.
>
> I worked on something very similar, though performance was one of the
> goals. The implementation was floating around lockless ring buffers,
> shared memory for zerocopy, multiqueue and error handling. It could be
> that every disk storage vendor has to implement something like that in
> order to bridge Linux kernel to their own proprietary datapath running
> in userspace.

OK, good to know.  That's kind of the feeling I'd gotten from having chatted to
a handful of people about this, but I don't remember people having actually
gotten all the way to zero-copy.  That's how we managed to end up at this
middle-ground ABI style: when I thought people were, in practice, punting on
zero copy because the complexity just wasn't worth the performance benefit.
Maybe I'd just been colored by how my projects ended up going, but I've ended
up designing complicated interfaces in the past that allow for zero-copy only
to never get around to actually making that work.  I don't know if that's just
because I've had the good fortune to avoid working on anything that ended up
with users, though :).

For our use case I think we actually get better performance out of the
copy-based (and probably more importantly kalloc-based, but that's an
implementation thing not an ABI thing) approach: essentially we're very
sensitive to memory pressure and expect this first dm-user daemon to mostly be
idle, so we're really worried about avoiding excess memory usage while idle and
less worried about throughput when active.  This stream-based interface means
that userspace doesn't need much memory allocated to service a request, which
helps with sleep/wake latencies and/or idle memory usage.  That's also why we
have the simple locking scheme: no sense splitting locks if there's no
contention, and we only need a single thread to saturate the storage bandwidth
on these phones.

That said, it does sound like people really do care about the sort of
performance levels where zero copy is relevant in this space.  I'll take a shot
at something along those lines, and while it will add a degree of userspace
complexity I'm not sure it'll add much in the way of kernel complexity -- at
least compared to a fast version of this, where we'd need most of that stuff
anyway (obviously the malloc+single lock design is simple, but probably
wouldn't stick around for long).  At a bare minimum it'll be interesting to
play around with, but if people are doing it in practice then I'm more
confident that I can put something together that at least serves as a starting
point for further discussion.

I haven't gotten around to writing any code yet, but I had spent a bit of time
thinking about how to put this zero-copy version together and am leaning
towards it being a standalone block device (as opposed to a DM target).  I'd
avoided that before as I didn't want to mess around with my own device control
scheme so I'll still try to do the DM thing, but I'm not sure it'll be viable.
That's all speculation now, but it does bring up one interesting question:

IIUC, this version of dm-user handles BIOs before they reach the block
scheduler while a standalone driver would likely handle them after blk-mq.  I
don't have direct experience with this, but the last time I ran into people who
had these sorts of performance requirements for userspace drivers they weren't
actually trying to write userspace drivers but were instead trying to write a
userspace scheduler, with the userspace drivers just being the mechanism to
implement that scheduler.  This was a decade ago and I'm not sure that's what
people are trying to do in the new blk-mq world, but if it is then it's going
to be a major design consideration.  I'm also not entirely sure that we're
really solving the same problem at that point.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dm-devel] [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace
@ 2020-12-17  6:55               ` Palmer Dabbelt
  0 siblings, 0 replies; 43+ messages in thread
From: Palmer Dabbelt @ 2020-12-17  6:55 UTC (permalink / raw)
  To: v.mayatskih
  Cc: linux-raid, bvanassche, snitzer, corbet, shuah, linux-doc, josef,
	linux-kernel, Christoph Hellwig, song, dm-devel,
	michael.christie, linux-kselftest, kernel-team, agk

On Wed, 16 Dec 2020 10:24:59 PST (-0800), v.mayatskih@gmail.com wrote:
> On Mon, Dec 14, 2020 at 10:03 PM Palmer Dabbelt <palmer@dabbelt.com> wrote:
>
>> I was really experting someone to say that.  It does seem kind of silly to build
>> out the new interface, but not go all the way to a ring buffer.  We just didn't
>> really have any way to justify the extra complexity as our use cases aren't
>> that high performance.   I kind of like to have benchmarks for this sort of
>> thing, though, and I didn't have anyone who had bothered avoiding the last copy
>> to compare against.
>
> I worked on something very similar, though performance was one of the
> goals. The implementation was floating around lockless ring buffers,
> shared memory for zerocopy, multiqueue and error handling. It could be
> that every disk storage vendor has to implement something like that in
> order to bridge Linux kernel to their own proprietary datapath running
> in userspace.

OK, good to know.  That's kind of the feeling I'd gotten from having chatted to
a handful of people about this, but I don't remember people having actually
gotten all the way to zero-copy.  That's how we managed to end up at this
middle-ground ABI style: when I thought people were, in practice, punting on
zero copy because the complexity just wasn't worth the performance benefit.
Maybe I'd just been colored by how my projects ended up going, but I've ended
up designing complicated interfaces in the past that allow for zero-copy only
to never get around to actually making that work.  I don't know if that's just
because I've had the good fortune to avoid working on anything that ended up
with users, though :).

For our use case I think we actually get better performance out of the
copy-based (and probably more importantly kalloc-based, but that's an
implementation thing not an ABI thing) approach: essentially we're very
sensitive to memory pressure and expect this first dm-user daemon to mostly be
idle, so we're really worried about avoiding excess memory usage while idle and
less worried about throughput when active.  This stream-based interface means
that userspace doesn't need much memory allocated to service a request, which
helps with sleep/wake latencies and/or idle memory usage.  That's also why we
have the simple locking scheme: no sense splitting locks if there's no
contention, and we only need a single thread to saturate the storage bandwidth
on these phones.

That said, it does sound like people really do care about the sort of
performance levels where zero copy is relevant in this space.  I'll take a shot
at something along those lines, and while it will add a degree of userspace
complexity I'm not sure it'll add much in the way of kernel complexity -- at
least compared to a fast version of this, where we'd need most of that stuff
anyway (obviously the malloc+single lock design is simple, but probably
wouldn't stick around for long).  At a bare minimum it'll be interesting to
play around with, but if people are doing it in practice then I'm more
confident that I can put something together that at least serves as a starting
point for further discussion.

I haven't gotten around to writing any code yet, but I had spent a bit of time
thinking about how to put this zero-copy version together and am leaning
towards it being a standalone block device (as opposed to a DM target).  I'd
avoided that before as I didn't want to mess around with my own device control
scheme so I'll still try to do the DM thing, but I'm not sure it'll be viable.
That's all speculation now, but it does bring up one interesting question:

IIUC, this version of dm-user handles BIOs before they reach the block
scheduler while a standalone driver would likely handle them after blk-mq.  I
don't have direct experience with this, but the last time I ran into people who
had these sorts of performance requirements for userspace drivers they weren't
actually trying to write userspace drivers but were instead trying to write a
userspace scheduler, with the userspace drivers just being the mechanism to
implement that scheduler.  This was a decade ago and I'm not sure that's what
people are trying to do in the new blk-mq world, but if it is then it's going
to be a major design consideration.  I'm also not entirely sure that we're
really solving the same problem at that point.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dm-devel] [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace
  2020-12-15  3:00           ` Palmer Dabbelt
@ 2020-12-22 13:32             ` Christoph Hellwig
  -1 siblings, 0 replies; 43+ messages in thread
From: Christoph Hellwig @ 2020-12-22 13:32 UTC (permalink / raw)
  To: Palmer Dabbelt
  Cc: josef, bvanassche, Christoph Hellwig, snitzer, corbet,
	kernel-team, linux-doc, linux-kernel, linux-raid, song, dm-devel,
	linux-kselftest, shuah, agk, michael.christie

On Mon, Dec 14, 2020 at 07:00:57PM -0800, Palmer Dabbelt wrote:
> I haven't gotten a whole lot of feedback, so I'm inclined to at least have some
> reasonable performance numbers before bothering with a v2.

FYI, my other main worry beside duplicating nbd is that device mapper
really is a stacked interface that sits on top of other block device.
Turning this into something else that just pipes data to userspace
seems very strange.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dm-devel] [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace
@ 2020-12-22 13:32             ` Christoph Hellwig
  0 siblings, 0 replies; 43+ messages in thread
From: Christoph Hellwig @ 2020-12-22 13:32 UTC (permalink / raw)
  To: Palmer Dabbelt
  Cc: linux-raid, bvanassche, snitzer, corbet, shuah, linux-doc, josef,
	linux-kernel, Christoph Hellwig, song, dm-devel,
	michael.christie, linux-kselftest, kernel-team, agk

On Mon, Dec 14, 2020 at 07:00:57PM -0800, Palmer Dabbelt wrote:
> I haven't gotten a whole lot of feedback, so I'm inclined to at least have some
> reasonable performance numbers before bothering with a v2.

FYI, my other main worry beside duplicating nbd is that device mapper
really is a stacked interface that sits on top of other block device.
Turning this into something else that just pipes data to userspace
seems very strange.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace
  2020-12-22 13:32             ` Christoph Hellwig
@ 2020-12-22 14:36               ` Mike Snitzer
  -1 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2020-12-22 14:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Palmer Dabbelt, josef, bvanassche, corbet, kernel-team,
	linux-doc, linux-kernel, linux-raid, song, dm-devel,
	linux-kselftest, shuah, agk, michael.christie

On Tue, Dec 22 2020 at  8:32am -0500,
Christoph Hellwig <hch@infradead.org> wrote:

> On Mon, Dec 14, 2020 at 07:00:57PM -0800, Palmer Dabbelt wrote:
> > I haven't gotten a whole lot of feedback, so I'm inclined to at least have some
> > reasonable performance numbers before bothering with a v2.
> 
> FYI, my other main worry beside duplicating nbd is that device mapper
> really is a stacked interface that sits on top of other block device.
> Turning this into something else that just pipes data to userspace
> seems very strange.

I agree.  Only way I'd be interested is if it somehow tackled enabling
much more efficient IO.  Earlier discussion in this thread mentioned
that zero-copy and low overhead wasn't a priority (because it is hard,
etc).  But the hard work has already been done with io_uring.  If
dm-user had a prereq of leaning heavily on io_uring and also enabled IO
polling for bio-based then there may be a win to supporting it.

But unless lower latency (or some other more significant win) is made
possible I just don't care to prop up an unnatural DM bolt-on.

Mike


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dm-devel] [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace
@ 2020-12-22 14:36               ` Mike Snitzer
  0 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2020-12-22 14:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: bvanassche, corbet, shuah, linux-doc, josef, linux-kernel,
	linux-raid, song, dm-devel, Palmer Dabbelt, michael.christie,
	linux-kselftest, kernel-team, agk

On Tue, Dec 22 2020 at  8:32am -0500,
Christoph Hellwig <hch@infradead.org> wrote:

> On Mon, Dec 14, 2020 at 07:00:57PM -0800, Palmer Dabbelt wrote:
> > I haven't gotten a whole lot of feedback, so I'm inclined to at least have some
> > reasonable performance numbers before bothering with a v2.
> 
> FYI, my other main worry beside duplicating nbd is that device mapper
> really is a stacked interface that sits on top of other block device.
> Turning this into something else that just pipes data to userspace
> seems very strange.

I agree.  Only way I'd be interested is if it somehow tackled enabling
much more efficient IO.  Earlier discussion in this thread mentioned
that zero-copy and low overhead wasn't a priority (because it is hard,
etc).  But the hard work has already been done with io_uring.  If
dm-user had a prereq of leaning heavily on io_uring and also enabled IO
polling for bio-based then there may be a win to supporting it.

But unless lower latency (or some other more significant win) is made
possible I just don't care to prop up an unnatural DM bolt-on.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dm-devel] [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace
  2020-12-22 13:32             ` Christoph Hellwig
@ 2020-12-22 20:31               ` Palmer Dabbelt
  -1 siblings, 0 replies; 43+ messages in thread
From: Palmer Dabbelt @ 2020-12-22 20:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: josef, bvanassche, Christoph Hellwig, snitzer, corbet,
	kernel-team, linux-doc, linux-kernel, linux-raid, song, dm-devel,
	linux-kselftest, shuah, agk, michael.christie

On Tue, 22 Dec 2020 05:32:46 PST (-0800), Christoph Hellwig wrote:
> On Mon, Dec 14, 2020 at 07:00:57PM -0800, Palmer Dabbelt wrote:
>> I haven't gotten a whole lot of feedback, so I'm inclined to at least have some
>> reasonable performance numbers before bothering with a v2.
>
> FYI, my other main worry beside duplicating nbd is that device mapper
> really is a stacked interface that sits on top of other block device.
> Turning this into something else that just pipes data to userspace
> seems very strange.

Agreed.  It certainly doesn't fit the DM model.  We'd considered doing a non-DM
version of this (maybe "ubd"), but decided to stick with dm-user because we
didn't want to duplicate all the device creation stuff that DM provides.  A
simple version of that wouldn't be that hard to do, but the DM version has a
lot of features and we get that all for free.  We essentially decided to run
with DM until it gets in the way, and the only sticking point we ended up with
was that REQUEUE stuff (though not sure how that would show up with a bare
block device) and that scheduler question.

I'm going to stick with DM for now, unless it gets in the way, to avoid coming
up with a device creation scheme myself.  In the long term it's probably best
to have this be a standalone thing, but I don't want to dump a bunch of time
into putting that stuff together only to find that this isn't interesting
enough from a performance perspective to stick around.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dm-devel] [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace
@ 2020-12-22 20:31               ` Palmer Dabbelt
  0 siblings, 0 replies; 43+ messages in thread
From: Palmer Dabbelt @ 2020-12-22 20:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-raid, bvanassche, snitzer, corbet, shuah, linux-doc, josef,
	linux-kernel, Christoph Hellwig, song, dm-devel,
	michael.christie, linux-kselftest, kernel-team, agk

On Tue, 22 Dec 2020 05:32:46 PST (-0800), Christoph Hellwig wrote:
> On Mon, Dec 14, 2020 at 07:00:57PM -0800, Palmer Dabbelt wrote:
>> I haven't gotten a whole lot of feedback, so I'm inclined to at least have some
>> reasonable performance numbers before bothering with a v2.
>
> FYI, my other main worry beside duplicating nbd is that device mapper
> really is a stacked interface that sits on top of other block device.
> Turning this into something else that just pipes data to userspace
> seems very strange.

Agreed.  It certainly doesn't fit the DM model.  We'd considered doing a non-DM
version of this (maybe "ubd"), but decided to stick with dm-user because we
didn't want to duplicate all the device creation stuff that DM provides.  A
simple version of that wouldn't be that hard to do, but the DM version has a
lot of features and we get that all for free.  We essentially decided to run
with DM until it gets in the way, and the only sticking point we ended up with
was that REQUEUE stuff (though not sure how that would show up with a bare
block device) and that scheduler question.

I'm going to stick with DM for now, unless it gets in the way, to avoid coming
up with a device creation scheme myself.  In the long term it's probably best
to have this be a standalone thing, but I don't want to dump a bunch of time
into putting that stuff together only to find that this isn't interesting
enough from a performance perspective to stick around.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace
  2020-12-22 14:36               ` [dm-devel] " Mike Snitzer
@ 2020-12-22 20:38                 ` Palmer Dabbelt
  -1 siblings, 0 replies; 43+ messages in thread
From: Palmer Dabbelt @ 2020-12-22 20:38 UTC (permalink / raw)
  To: snitzer
  Cc: Christoph Hellwig, josef, bvanassche, corbet, kernel-team,
	linux-doc, linux-kernel, linux-raid, song, dm-devel,
	linux-kselftest, shuah, agk, michael.christie

On Tue, 22 Dec 2020 06:36:16 PST (-0800), snitzer@redhat.com wrote:
> On Tue, Dec 22 2020 at  8:32am -0500,
> Christoph Hellwig <hch@infradead.org> wrote:
>
>> On Mon, Dec 14, 2020 at 07:00:57PM -0800, Palmer Dabbelt wrote:
>> > I haven't gotten a whole lot of feedback, so I'm inclined to at least have some
>> > reasonable performance numbers before bothering with a v2.
>>
>> FYI, my other main worry beside duplicating nbd is that device mapper
>> really is a stacked interface that sits on top of other block device.
>> Turning this into something else that just pipes data to userspace
>> seems very strange.
>
> I agree.  Only way I'd be interested is if it somehow tackled enabling
> much more efficient IO.  Earlier discussion in this thread mentioned
> that zero-copy and low overhead wasn't a priority (because it is hard,
> etc).  But the hard work has already been done with io_uring.  If
> dm-user had a prereq of leaning heavily on io_uring and also enabled IO
> polling for bio-based then there may be a win to supporting it.
>
> But unless lower latency (or some other more significant win) is made
> possible I just don't care to prop up an unnatural DM bolt-on.

I don't remember if I mentioned this in the thread, but it was definately in
the Plumbers talk, but I'd had the general idea bouncing around that it would
be possible to write a high-performance version of this using an interface
similar to the one provided here while relying on io_uring for the
high-performance userspace.  That definately won't work with exactly the
current interface, but my hope was to avoid writing my own high-performance
ring buffer.  My worry was that it'll be too tricky to map this all to
zero-copy, and I guess I forgot about it.

Now that you bring it up, it certainly seems worth taking a shot at.  We'd
essentially have the best of both worlds: userspace implementations that want
to be simple could just use read()/write(), while those that want to be higher
performance could have their implicit ring buffer.

I'm currently trying to put together a benchmarking setup that is of sufficient
fidelity that I would believe the numbers, which is really why I don't have any
performance numbers yet (no sense posting numbers I would shoot down :)).  I'll
try to remember to take a shot at an io_uring based userspace (probably with
some dm-user interface modifications) to see how it feels.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dm-devel] [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace
@ 2020-12-22 20:38                 ` Palmer Dabbelt
  0 siblings, 0 replies; 43+ messages in thread
From: Palmer Dabbelt @ 2020-12-22 20:38 UTC (permalink / raw)
  To: snitzer
  Cc: linux-raid, bvanassche, corbet, shuah, linux-doc, josef,
	linux-kernel, Christoph Hellwig, song, dm-devel,
	michael.christie, linux-kselftest, kernel-team, agk

On Tue, 22 Dec 2020 06:36:16 PST (-0800), snitzer@redhat.com wrote:
> On Tue, Dec 22 2020 at  8:32am -0500,
> Christoph Hellwig <hch@infradead.org> wrote:
>
>> On Mon, Dec 14, 2020 at 07:00:57PM -0800, Palmer Dabbelt wrote:
>> > I haven't gotten a whole lot of feedback, so I'm inclined to at least have some
>> > reasonable performance numbers before bothering with a v2.
>>
>> FYI, my other main worry beside duplicating nbd is that device mapper
>> really is a stacked interface that sits on top of other block device.
>> Turning this into something else that just pipes data to userspace
>> seems very strange.
>
> I agree.  Only way I'd be interested is if it somehow tackled enabling
> much more efficient IO.  Earlier discussion in this thread mentioned
> that zero-copy and low overhead wasn't a priority (because it is hard,
> etc).  But the hard work has already been done with io_uring.  If
> dm-user had a prereq of leaning heavily on io_uring and also enabled IO
> polling for bio-based then there may be a win to supporting it.
>
> But unless lower latency (or some other more significant win) is made
> possible I just don't care to prop up an unnatural DM bolt-on.

I don't remember if I mentioned this in the thread, but it was definately in
the Plumbers talk, but I'd had the general idea bouncing around that it would
be possible to write a high-performance version of this using an interface
similar to the one provided here while relying on io_uring for the
high-performance userspace.  That definately won't work with exactly the
current interface, but my hope was to avoid writing my own high-performance
ring buffer.  My worry was that it'll be too tricky to map this all to
zero-copy, and I guess I forgot about it.

Now that you bring it up, it certainly seems worth taking a shot at.  We'd
essentially have the best of both worlds: userspace implementations that want
to be simple could just use read()/write(), while those that want to be higher
performance could have their implicit ring buffer.

I'm currently trying to put together a benchmarking setup that is of sufficient
fidelity that I would believe the numbers, which is really why I don't have any
performance numbers yet (no sense posting numbers I would shoot down :)).  I'll
try to remember to take a shot at an io_uring based userspace (probably with
some dm-user interface modifications) to see how it feels.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace
  2020-12-22 20:38                 ` [dm-devel] " Palmer Dabbelt
@ 2020-12-23  7:48                   ` Christoph Hellwig
  -1 siblings, 0 replies; 43+ messages in thread
From: Christoph Hellwig @ 2020-12-23  7:48 UTC (permalink / raw)
  To: Palmer Dabbelt
  Cc: snitzer, Christoph Hellwig, josef, bvanassche, corbet,
	kernel-team, linux-doc, linux-kernel, linux-raid, song, dm-devel,
	linux-kselftest, shuah, agk, michael.christie


FYI, a few years ago I spent some time helping a customer to prepare
their block device in userspace using fuse code for upstreaming, but
at some point they abandoned the project.  But if for some reason we
don't want to use nbd I think a driver using the fuse infrastructure
would be the next logical choice.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dm-devel] [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace
@ 2020-12-23  7:48                   ` Christoph Hellwig
  0 siblings, 0 replies; 43+ messages in thread
From: Christoph Hellwig @ 2020-12-23  7:48 UTC (permalink / raw)
  To: Palmer Dabbelt
  Cc: linux-raid, bvanassche, snitzer, corbet, shuah, linux-doc, josef,
	linux-kernel, Christoph Hellwig, song, dm-devel,
	michael.christie, linux-kselftest, kernel-team, agk


FYI, a few years ago I spent some time helping a customer to prepare
their block device in userspace using fuse code for upstreaming, but
at some point they abandoned the project.  But if for some reason we
don't want to use nbd I think a driver using the fuse infrastructure
would be the next logical choice.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace
  2020-12-23  7:48                   ` [dm-devel] " Christoph Hellwig
@ 2020-12-23 16:59                     ` Bart Van Assche
  -1 siblings, 0 replies; 43+ messages in thread
From: Bart Van Assche @ 2020-12-23 16:59 UTC (permalink / raw)
  To: Christoph Hellwig, Palmer Dabbelt
  Cc: snitzer, josef, corbet, kernel-team, linux-doc, linux-kernel,
	linux-raid, song, dm-devel, linux-kselftest, shuah, agk,
	michael.christie

On 12/22/20 11:48 PM, Christoph Hellwig wrote:
> FYI, a few years ago I spent some time helping a customer to prepare
> their block device in userspace using fuse code for upstreaming, but
> at some point they abandoned the project.  But if for some reason we
> don't want to use nbd I think a driver using the fuse infrastructure
> would be the next logical choice.

Hi Christoph,

Thanks for having shared this information. Since I'm not familiar with the
FUSE code: does this mean translating block device accesses into FUSE_READ
and FUSE_WRITE messages? Does the FUSE kernel code only support exchanging
such messages between kernel and user space via the read() and write()
system calls? I'm asking this since there is already an interface in the
Linux kernel for implementing block devices in user space that uses another
approach, namely a ring buffer for messages and data that is shared between
kernel and user space (documented in Documentation/target/tcmu-design.rst).
Is one system call per read and per write operation fast enough for all
block-device-in-user-space implementations?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dm-devel] [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace
@ 2020-12-23 16:59                     ` Bart Van Assche
  0 siblings, 0 replies; 43+ messages in thread
From: Bart Van Assche @ 2020-12-23 16:59 UTC (permalink / raw)
  To: Christoph Hellwig, Palmer Dabbelt
  Cc: snitzer, linux-doc, shuah, corbet, linux-kernel, josef,
	linux-raid, song, dm-devel, michael.christie, linux-kselftest,
	kernel-team, agk

On 12/22/20 11:48 PM, Christoph Hellwig wrote:
> FYI, a few years ago I spent some time helping a customer to prepare
> their block device in userspace using fuse code for upstreaming, but
> at some point they abandoned the project.  But if for some reason we
> don't want to use nbd I think a driver using the fuse infrastructure
> would be the next logical choice.

Hi Christoph,

Thanks for having shared this information. Since I'm not familiar with the
FUSE code: does this mean translating block device accesses into FUSE_READ
and FUSE_WRITE messages? Does the FUSE kernel code only support exchanging
such messages between kernel and user space via the read() and write()
system calls? I'm asking this since there is already an interface in the
Linux kernel for implementing block devices in user space that uses another
approach, namely a ring buffer for messages and data that is shared between
kernel and user space (documented in Documentation/target/tcmu-design.rst).
Is one system call per read and per write operation fast enough for all
block-device-in-user-space implementations?

Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dm-devel] [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace
  2020-12-16 20:35               ` Palmer Dabbelt
  (?)
@ 2020-12-29 12:52               ` Akira Hayakawa
  -1 siblings, 0 replies; 43+ messages in thread
From: Akira Hayakawa @ 2020-12-29 12:52 UTC (permalink / raw)
  To: Palmer Dabbelt
  Cc: Christoph Hellwig, bvanassche, Mike Snitzer, linux-doc,
	kernel-team, corbet, Josef Bacik, linux-kernel, linux-raid, song,
	dm-devel, Alasdair G Kergon, linux-kselftest, shuah,
	michael.christie


[-- Attachment #1.1: Type: text/plain, Size: 8853 bytes --]

Since this is an interesting topic to me. As a holiday project, I started
my new kernel module.

There was a similar project named ABUSE, back in 2009. Unfortunately, it
was rejected.
https://lwn.net/Articles/343514/

The kernel module’s mechanism is very simple:
1. When client code sends a request to /dev/abuseX it is pushed to the
internal queue.
2. When some requests are queued, the userland storage engine is woken up
(usually it waits by epoll or some variant) it pulls one request from the
queue via ioctl (ABUSE_GET_REQ).
3. After processing, it pushes back a completion via ioctl (ABUSE_PUT_REQ).

The original code is bio-based but naota modified it to request-based in
2015.
https://github.com/naota/abuse-kmod

This is my starting point.

The problem I see is, the 2015 version copies bvec pages across the
user-kernel space boundary. I think this is no good because we need to copy
2x the amount of the IO buffers.

The solution I am trying to do is mmap the bvec pages in kernel-space from
the userland. Instead of copying the bvec pages, the kernel passes only the
physical page address to the userland and then the userland engine mmap the
page to some address space in user space. I know mmap has some overhead
because it is system call but the rationale behind this is “mmap is better
than copy”. Since it is not able to access kernel address space directly
from the userland, unlike Vitaly above mentions zero-copy I believe some
trick like mmaping is inevitable. Yes, if there is some non-overhead way to
access bvec pages from userland, I want to know that.

Here is the kernel code. Userland-side is written in Rust. It’s already
passed a test by badblocks.
https://github.com/akiradeveloper/userland-io/blob/20201229/abuse-kmod/src/abuse.c

- Akira

On Thu, Dec 17, 2020 at 5:35 AM Palmer Dabbelt <palmer@dabbelt.com> wrote:

> On Tue, 15 Dec 2020 22:17:06 PST (-0800), ruby.wktk@gmail.com wrote:
> > Hi my name is Akira Hayakawa. I am maintaining an out-of-tree DM target
> > named dm-writeboost.
> >
> > Sorry to step in. But this is a very interesting topic at least to me.
> >
> > I have been looking for something like dm-user because I believe we
> should
> > be able to implement virtual block devices in Rust language.
> >
> > I know proxying IO requests to userland always causes some overhead but
> for
> > some type of device that performance doesn't matter or some research
> > prototyping or pseudo device for testing, this way should be developed.
> Of
> > course, implementation in Rust will give us opportunities to develop more
> > complicated software in high quality.
> >
> > I noticed this thread few days ago then I started to prototype this
> library
> > https://github.com/akiradeveloper/userland-io
> >
> > It is what I want but the transport is still NBD which I don't like so
> > much. If dm-user is available, I will implement a transport using
> dm-user.
>
> Great, I'm glad to hear that.  Obviously this is still in the early days
> and
> we're talking about high-level ABI design here, so things are almost
> certainly
> going to change, but it's always good to have people pushing on stuff.
>
> Just be warned: we've only had two people write userspaces for this (one of
> which was me, and all that is test code) so I'd be shocked if you manage to
> avoid running into bugs.
>
> >
> > - Akira
> >
> > On Tue, Dec 15, 2020 at 7:00 PM Palmer Dabbelt <palmer@dabbelt.com>
> wrote:
> >
> >> On Thu, 10 Dec 2020 09:03:21 PST (-0800), josef@toxicpanda.com wrote:
> >> > On 12/9/20 10:38 PM, Bart Van Assche wrote:
> >> >> On 12/7/20 10:55 AM, Palmer Dabbelt wrote:
> >> >>> All in all, I've found it a bit hard to figure out what sort of
> >> interest
> >> >>> people
> >> >>> have in dm-user: when I bring this up I seem to run into people
> who've
> >> done
> >> >>> similar things before and are vaguely interested, but certainly
> nobody
> >> is
> >> >>> chomping at the bit.  I'm sending it out in this early state to try
> and
> >> >>> figure
> >> >>> out if it's interesting enough to keep going.
> >> >>
> >> >> Cc-ing Josef and Mike since their nbd contributions make me wonder
> >> >> whether this new driver could be useful to their use cases?
> >> >>
> >> >
> >> > Sorry gmail+imap sucks and I can't get my email client to get at the
> >> original
> >> > thread.  However here is my take.
> >>
> >> and I guess I then have to apoligize for missing your email ;).
> Hopefully
> >> that
> >> was the problem, but who knows.
> >>
> >> > 1) The advantages of using dm-user of NBD that you listed aren't
> actually
> >> > problems for NBD.  We have NBD working in production where you can
> hand
> >> off the
> >> > sockets for the server without ending in timeouts, it was actually the
> >> main
> >> > reason we wrote our own server so we could use the FD transfer stuff
> to
> >> restart
> >> > the server without impacting any clients that had the device in use.
> >>
> >> OK.  So you just send the FD around using one of the standard
> mechanisms to
> >> orchestrate the handoff?  I guess that might work for our use case,
> >> assuming
> >> whatever the security side of things was doing was OK with the old FD.
> >> TBH I'm
> >> not sure how all that works and while we thought about doing that sort
> of
> >> transfer scheme we decided to just open it again -- not sure how far we
> >> were
> >> down the dm-user rabbit hole at that point, though, as this sort of
> arose
> >> out
> >> of some other ideas.
> >>
> >> > 2) The extra copy is a big deal, in fact we already have too many
> copies
> >> in our
> >> > existing NBD setup and are actively looking for ways to avoid those.
> >> >
> >> > Don't take this as I don't think dm-user is a good idea, but I think
> at
> >> the very
> >> > least it should start with the very best we have to offer, starting
> with
> >> as few
> >> > copies as possible.
> >>
> >> I was really experting someone to say that.  It does seem kind of silly
> to
> >> build
> >> out the new interface, but not go all the way to a ring buffer.  We just
> >> didn't
> >> really have any way to justify the extra complexity as our use cases
> aren't
> >> that high performance.   I kind of like to have benchmarks for this
> sort of
> >> thing, though, and I didn't have anyone who had bothered avoiding the
> last
> >> copy
> >> to compare against.
> >>
> >> > If you are using it currently in production then cool, there's
> clearly a
> >> usecase
> >> > for it.  Personally as I get older and grouchier I want less things in
> >> the
> >> > kernel, so if this enables us to eventually do everything NBD related
> in
> >> > userspace with no performance drop then I'd be down.  I don't think
> you
> >> need to
> >> > make that your primary goal, but at least polishing this up so it
> could
> >> > potentially be abused in the future would make it more compelling for
> >> merging.
> >> > Thanks,
> >>
> >> Ya, it's in Android already and we'll be shipping it as part of the new
> OTA
> >> flow for the next release.  The rules on deprecation are a bit different
> >> over
> >> there, though, so it's not like we're wed to it.  The whole point of
> >> bringing
> >> this up here was to try and get something usable by everyone, and while
> I'd
> >> eventually like to get whatever's in Android into the kernel proper we'd
> >> really
> >> planned on supporting an extra Android-only ABI for a cycle at least.
> >>
> >> I'm kind of inclined to take a crack at the extra copy, to at least see
> if
> >> building something that eliminates it is viable.  I'm not really sure if
> >> it is
> >> (or at least, if it'll net us a meaningful amount of performance), but
> >> it'd at
> >> least be interesting to try.
> >>
> >> It'd be nice to have some benchmark target, though, as otherwise this
> stuff
> >> hangs on forever.  My workloads are in selftests later on in the patch
> >> set, but
> >> I'm essentially using tmpfs as a baseline to compare against
> ext4+dm-user
> >> with
> >> some FIO examples as workloads.  Our early benchmark numbers indicated
> >> this was
> >> way faster than we needed, so I didn't even bother putting together a
> >> proper
> >> system to run on so I don't really have any meaningful numbers there.
> Is
> >> there
> >> an NBD server that's fast that I should be comparing against?
> >>
> >> I haven't gotten a whole lot of feedback, so I'm inclined to at least
> have
> >> some
> >> reasonable performance numbers before bothering with a v2.
> >>
> >> --
> >> dm-devel mailing list
> >> dm-devel@redhat.com
> >> https://www.redhat.com/mailman/listinfo/dm-devel
>


-- 
Akira Hayakawa

[-- Attachment #1.2: Type: text/html, Size: 11389 bytes --]

[-- Attachment #2: Type: text/plain, Size: 93 bytes --]

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dm-devel] [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace
  2020-12-10  3:38       ` Bart Van Assche
  (?)
  (?)
@ 2021-01-05 16:16       ` Desnoyers, Peter
  -1 siblings, 0 replies; 43+ messages in thread
From: Desnoyers, Peter @ 2021-01-05 16:16 UTC (permalink / raw)
  To: Bart Van Assche, Palmer Dabbelt, Christoph Hellwig
  Cc: snitzer, corbet, kernel-team, linux-doc, Josef Bacik,
	linux-kernel, linux-raid, song, dm-devel, agk, linux-kselftest,
	shuah, Mike Christie


[-- Attachment #1.1: Type: text/plain, Size: 2952 bytes --]

I seemed to have missed this during the OSDI and end-of-term rush. I've implemented a couple of user-space block device components for research projects, so I thought I'd weigh in.

In particular, does this offer significant advantages over TCMU? It uses the UIO framework for a ring of SCSI requests to user space, and seems to be quite efficient. The differences seem to be:

  *   dm-user is at bio level, while TCMU is a SCSI device, so like NBD you get request merging. Depending on your purposes, that either does no harm or is really useful.
  *   TCMU requires either accepting the entire tcmu-runner architecture and philosophy, which may be a no-go, or writing your own SCSI implementation

(I'm actually really bummed that I didn't discover TCMU until recently, as I think we would have used it on a few research projects in the past.)

I have no idea whether you can hand things off to another process - I don't think the UIO framework would have any problems with that, so it probably comes down to whether the target_core_user module would allow its memory region to be mapped by multiple processes.


I've got the world's quickest and dirtiest SCSI implementation that I've hacked up to use with TCMU, although the only real experimentation I've done is by hacking tcmu-runner. I'd definitely be interested in talking to any people who are interested in this, to develop a proper SCSI library that would allow you to use TCMU outside of tcmu-runner.

________________________________
From: Bart Van Assche <bvanassche@acm.org>
Sent: Wednesday, December 9, 2020 10:38 PM
To: Palmer Dabbelt <palmer@dabbelt.com>; Christoph Hellwig <hch@infradead.org>
Cc: snitzer@redhat.com <snitzer@redhat.com>; linux-doc@vger.kernel.org <linux-doc@vger.kernel.org>; shuah@kernel.org <shuah@kernel.org>; corbet@lwn.net <corbet@lwn.net>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>; Josef Bacik <josef@toxicpanda.com>; linux-raid@vger.kernel.org <linux-raid@vger.kernel.org>; song@kernel.org <song@kernel.org>; dm-devel@redhat.com <dm-devel@redhat.com>; Mike Christie <michael.christie@oracle.com>; linux-kselftest@vger.kernel.org <linux-kselftest@vger.kernel.org>; kernel-team@android.com <kernel-team@android.com>; agk@redhat.com <agk@redhat.com>
Subject: Re: [dm-devel] [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace

On 12/7/20 10:55 AM, Palmer Dabbelt wrote:
> All in all, I've found it a bit hard to figure out what sort of interest
> people
> have in dm-user: when I bring this up I seem to run into people who've done
> similar things before and are vaguely interested, but certainly nobody is
> chomping at the bit.  I'm sending it out in this early state to try and
> figure
> out if it's interesting enough to keep going.

Cc-ing Josef and Mike since their nbd contributions make me wonder
whether this new driver could be useful to their use cases?

Thanks,

Bart.



[-- Attachment #1.2: Type: text/html, Size: 5192 bytes --]

[-- Attachment #2: Type: text/plain, Size: 93 bytes --]

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2021-01-05 18:56 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-03 21:58 [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace Palmer Dabbelt
2020-12-03 21:58 ` [dm-devel] " Palmer Dabbelt
2020-12-03 21:58 ` [PATCH v1 1/5] Documentation: Describe dm-user Palmer Dabbelt
2020-12-03 21:58   ` [dm-devel] " Palmer Dabbelt
2020-12-03 21:58 ` [PATCH v1 2/5] uapi: Add dm-user structure definition Palmer Dabbelt
2020-12-03 21:58   ` [dm-devel] " Palmer Dabbelt
2020-12-03 21:58 ` [PATCH v1 3/5] dm: dm-user: New target that proxies BIOs to userspace Palmer Dabbelt
2020-12-03 21:58   ` [dm-devel] " Palmer Dabbelt
2020-12-03 21:58 ` [PATCH v1 4/5] selftests/dm-user: A handful of tests for dm-user Palmer Dabbelt
2020-12-03 21:58   ` [dm-devel] " Palmer Dabbelt
2020-12-03 21:58 ` [PATCH v1 5/5] MAINTAINERS: Add myself as a dm-user maintainer Palmer Dabbelt
2020-12-03 21:58   ` [dm-devel] " Palmer Dabbelt
2020-12-04 10:33 ` [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace Christoph Hellwig
2020-12-04 10:33   ` [dm-devel] " Christoph Hellwig
2020-12-07 18:55   ` Palmer Dabbelt
2020-12-07 18:55     ` [dm-devel] " Palmer Dabbelt
2020-12-10  3:38     ` Bart Van Assche
2020-12-10  3:38       ` Bart Van Assche
2020-12-10 17:03       ` Josef Bacik
2020-12-10 17:03         ` Josef Bacik
2020-12-15  3:00         ` Palmer Dabbelt
2020-12-15  3:00           ` Palmer Dabbelt
2020-12-16  6:17           ` Akira Hayakawa
2020-12-16 20:35             ` Palmer Dabbelt
2020-12-16 20:35               ` Palmer Dabbelt
2020-12-29 12:52               ` Akira Hayakawa
2020-12-16 18:24           ` Vitaly Mayatskih
2020-12-16 18:24             ` Vitaly Mayatskih
2020-12-17  6:55             ` Palmer Dabbelt
2020-12-17  6:55               ` Palmer Dabbelt
2020-12-22 13:32           ` Christoph Hellwig
2020-12-22 13:32             ` Christoph Hellwig
2020-12-22 14:36             ` Mike Snitzer
2020-12-22 14:36               ` [dm-devel] " Mike Snitzer
2020-12-22 20:38               ` Palmer Dabbelt
2020-12-22 20:38                 ` [dm-devel] " Palmer Dabbelt
2020-12-23  7:48                 ` Christoph Hellwig
2020-12-23  7:48                   ` [dm-devel] " Christoph Hellwig
2020-12-23 16:59                   ` Bart Van Assche
2020-12-23 16:59                     ` [dm-devel] " Bart Van Assche
2020-12-22 20:31             ` Palmer Dabbelt
2020-12-22 20:31               ` Palmer Dabbelt
2021-01-05 16:16       ` Desnoyers, Peter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.