linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v11 0/8] Use copy_process in vhost layer
@ 2023-02-02 23:25 Mike Christie
  2023-02-02 23:25 ` [PATCH v11 1/8] fork: Make IO worker options flag based Mike Christie
                   ` (8 more replies)
  0 siblings, 9 replies; 98+ messages in thread
From: Mike Christie @ 2023-02-02 23:25 UTC (permalink / raw)
  To: hch, stefanha, jasowang, mst, sgarzare, virtualization, brauner,
	ebiederm, torvalds, konrad.wilk, linux-kernel

The following patches were made over Linus's tree. They allow the vhost
layer to use copy_process instead of using workqueue_structs to create
worker threads for VM's devices.

Eric, the vhost maintainer, Michael Tsirkin has ACK'd the patches, so
we are just waiting on you. I haven't got any more comments after several
postings, and the last reply from you was a year ago on Jan 8th *2022*.
Are you ok with these patches and can we merge them?

Details:
Qemu will create vhost devices in the kernel which perform network or SCSI,
IO and perform management operations from worker threads created with the
kthread API. Because the kthread API does a copy_process on the kthreadd
thread, the vhost layer has to use kthread_use_mm to access the Qemu
thread's memory and cgroup_attach_task_all to add itself to the Qemu
thread's cgroups.

The patches allow the vhost layer to do a copy_process from the thread that
does the VHOST_SET_OWNER ioctl like how io_uring does a copy_process against
its userspace thread. This allows the vhost layer's worker threads to inherit
cgroups, namespaces, address space, etc. This worker thread will also be
accounted for against that owner/parent process's RLIMIT_NPROC limit which
will prevent malicious users from creating VMs with almost unlimited threads
when these patches are used:

https://lore.kernel.org/all/20211207025117.23551-1-michael.christie@oracle.com/

which allow us to create a worker thread per N virtqueues.

V11:
- Rebase.
V10:
- Eric's cleanup patches and my vhost flush cleanup patches are merged
upstream, so rebase against Linus's tree which has everything.
V9:
- Rebase against Eric's kthread-cleanups-for-v5.19 branch. Drop patches
no longer needed due to kernel clone arg and pf io worker patches in that
branch.
V8:
- Fix kzalloc GFP use.
- Fix email subject version number.
V7:
- Drop generic user_worker_* helpers and replace with vhost_task specific
  ones.
- Drop autoreap patch. Use kernel_wait4 instead.
- Fix issue where vhost.ko could be removed while the worker function is
  still running.
V6:
- Rename kernel_worker to user_worker and fix prefixes.
- Add better patch descriptions.
V5:
- Handle kbuild errors by building patchset against current kernel that
  has all deps merged. Also add patch to remove create_io_thread code as
  it's not used anymore.
- Rebase patchset against current kernel and handle a new vm PF_IO_WORKER
  case added in 5.16-rc1.
- Add PF_USER_WORKER flag so we can check it later after the initial
  thread creation for the wake up, vm and singal cses.
- Added patch to auto reap the worker thread.
V4:
- Drop NO_SIG patch and replaced with Christian's SIG_IGN patch.
- Merged Christian's kernel_worker_flags_valid helpers into patch 5 that
  added the new kernel worker functions.
- Fixed extra "i" issue.
- Added PF_USER_WORKER flag and added check that kernel_worker_start users
  had that flag set. Also dropped patches that passed worker flags to
  copy_thread and replaced with PF_USER_WORKER check.
V3:
- Add parentheses in p->flag and work_flags check in copy_thread.
- Fix check in arm/arm64 which was doing the reverse of other archs
  where it did likely(!flags) instead of unlikely(flags).
V2:
- Rename kernel_copy_process to kernel_worker.





^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH v11 1/8] fork: Make IO worker options flag based
  2023-02-02 23:25 [PATCH v11 0/8] Use copy_process in vhost layer Mike Christie
@ 2023-02-02 23:25 ` Mike Christie
  2023-02-03  0:14   ` Linus Torvalds
  2023-02-02 23:25 ` [PATCH v11 2/8] fork/vm: Move common PF_IO_WORKER behavior to new flag Mike Christie
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 98+ messages in thread
From: Mike Christie @ 2023-02-02 23:25 UTC (permalink / raw)
  To: hch, stefanha, jasowang, mst, sgarzare, virtualization, brauner,
	ebiederm, torvalds, konrad.wilk, linux-kernel
  Cc: Mike Christie, Christoph Hellwig

This patchset adds a couple new options to kernel_clone_args for the vhost
layer which is going to work like PF_IO_WORKER but will differ enough that
we will need to add several fields to kernel_clone_args. This patch moves
us to a flags based approach for these types of users.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Suggested-by: Christian Brauner <brauner@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/sched/task.h | 4 +++-
 kernel/fork.c              | 4 ++--
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 357e0068497c..a759ce5aa603 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -18,8 +18,11 @@ struct css_set;
 /* All the bits taken by the old clone syscall. */
 #define CLONE_LEGACY_FLAGS 0xffffffffULL
 
+#define USER_WORKER_IO		BIT(0)
+
 struct kernel_clone_args {
 	u64 flags;
+	u32 worker_flags;
 	int __user *pidfd;
 	int __user *child_tid;
 	int __user *parent_tid;
@@ -31,7 +34,6 @@ struct kernel_clone_args {
 	/* Number of elements in *set_tid */
 	size_t set_tid_size;
 	int cgroup;
-	int io_thread;
 	int kthread;
 	int idle;
 	int (*fn)(void *);
diff --git a/kernel/fork.c b/kernel/fork.c
index 9f7fe3541897..b030aefba26c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2100,7 +2100,7 @@ static __latent_entropy struct task_struct *copy_process(
 	p->flags &= ~PF_KTHREAD;
 	if (args->kthread)
 		p->flags |= PF_KTHREAD;
-	if (args->io_thread) {
+	if (args->worker_flags & USER_WORKER_IO) {
 		/*
 		 * Mark us an IO worker, and block any signal that isn't
 		 * fatal or STOP
@@ -2623,7 +2623,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
 		.exit_signal	= (lower_32_bits(flags) & CSIGNAL),
 		.fn		= fn,
 		.fn_arg		= arg,
-		.io_thread	= 1,
+		.worker_flags	= USER_WORKER_IO,
 	};
 
 	return copy_process(NULL, 0, node, &args);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v11 2/8] fork/vm: Move common PF_IO_WORKER behavior to new flag
  2023-02-02 23:25 [PATCH v11 0/8] Use copy_process in vhost layer Mike Christie
  2023-02-02 23:25 ` [PATCH v11 1/8] fork: Make IO worker options flag based Mike Christie
@ 2023-02-02 23:25 ` Mike Christie
  2023-02-02 23:25 ` [PATCH v11 3/8] fork: add USER_WORKER flag to not dup/clone files Mike Christie
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 98+ messages in thread
From: Mike Christie @ 2023-02-02 23:25 UTC (permalink / raw)
  To: hch, stefanha, jasowang, mst, sgarzare, virtualization, brauner,
	ebiederm, torvalds, konrad.wilk, linux-kernel
  Cc: Mike Christie

This adds a new flag, PF_USER_WORKER, that's used for behavior common to
to both PF_IO_WORKER and users like vhost which will use a new helper
instead of create_io_thread because they require different behavior for
operations like signal handling.

The common behavior PF_USER_WORKER covers is the vm reclaim handling.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/sched.h      | 2 +-
 include/linux/sched/task.h | 3 ++-
 kernel/fork.c              | 4 ++++
 mm/vmscan.c                | 4 ++--
 4 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 853d08f7562b..2ca9269332c1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1723,7 +1723,7 @@ extern struct pid *cad_pid;
 #define PF_MEMALLOC		0x00000800	/* Allocating memory */
 #define PF_NPROC_EXCEEDED	0x00001000	/* set_user() noticed that RLIMIT_NPROC was exceeded */
 #define PF_USED_MATH		0x00002000	/* If unset the fpu must be initialized before use */
-#define PF__HOLE__00004000	0x00004000
+#define PF_USER_WORKER		0x00004000	/* Kernel thread cloned from userspace thread */
 #define PF_NOFREEZE		0x00008000	/* This thread should not be frozen */
 #define PF__HOLE__00010000	0x00010000
 #define PF_KSWAPD		0x00020000	/* I am kswapd */
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index a759ce5aa603..dfc585e0373c 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -18,7 +18,8 @@ struct css_set;
 /* All the bits taken by the old clone syscall. */
 #define CLONE_LEGACY_FLAGS 0xffffffffULL
 
-#define USER_WORKER_IO		BIT(0)
+#define USER_WORKER		BIT(0)
+#define USER_WORKER_IO		BIT(1)
 
 struct kernel_clone_args {
 	u64 flags;
diff --git a/kernel/fork.c b/kernel/fork.c
index b030aefba26c..77d2c527e917 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2100,6 +2100,10 @@ static __latent_entropy struct task_struct *copy_process(
 	p->flags &= ~PF_KTHREAD;
 	if (args->kthread)
 		p->flags |= PF_KTHREAD;
+
+	if (args->worker_flags & USER_WORKER)
+		p->flags |= PF_USER_WORKER;
+
 	if (args->worker_flags & USER_WORKER_IO) {
 		/*
 		 * Mark us an IO worker, and block any signal that isn't
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bd6637fcd8f9..54de4adb91cf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1141,12 +1141,12 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason)
 	DEFINE_WAIT(wait);
 
 	/*
-	 * Do not throttle IO workers, kthreads other than kswapd or
+	 * Do not throttle user workers, kthreads other than kswapd or
 	 * workqueues. They may be required for reclaim to make
 	 * forward progress (e.g. journalling workqueues or kthreads).
 	 */
 	if (!current_is_kswapd() &&
-	    current->flags & (PF_IO_WORKER|PF_KTHREAD)) {
+	    current->flags & (PF_USER_WORKER|PF_KTHREAD)) {
 		cond_resched();
 		return;
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v11 3/8] fork: add USER_WORKER flag to not dup/clone files
  2023-02-02 23:25 [PATCH v11 0/8] Use copy_process in vhost layer Mike Christie
  2023-02-02 23:25 ` [PATCH v11 1/8] fork: Make IO worker options flag based Mike Christie
  2023-02-02 23:25 ` [PATCH v11 2/8] fork/vm: Move common PF_IO_WORKER behavior to new flag Mike Christie
@ 2023-02-02 23:25 ` Mike Christie
  2023-02-03  0:16   ` Linus Torvalds
  2023-02-02 23:25 ` [PATCH v11 4/8] fork: Add USER_WORKER flag to ignore signals Mike Christie
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 98+ messages in thread
From: Mike Christie @ 2023-02-02 23:25 UTC (permalink / raw)
  To: hch, stefanha, jasowang, mst, sgarzare, virtualization, brauner,
	ebiederm, torvalds, konrad.wilk, linux-kernel
  Cc: Mike Christie, Christoph Hellwig

Each vhost device gets a thread that is used to perform IO and management
operations. Instead of a thread that is accessing a device, the thread is
part of the device, so when it creates a thread using a helper based on
copy_process we can't dup or clone the parent's files/FDS because it
would do an extra increment on ourself.

Later, when we do:

Qemu process exits:
        do_exit -> exit_files -> put_files_struct -> close_files

we would leak the device's resources because of that extra refcount
on the fd or file_struct.

This patch adds a no_files option so these worker threads can prevent
taking an extra refcount on themselves.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/sched/task.h |  1 +
 kernel/fork.c              | 11 +++++++++--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index dfc585e0373c..18e614591c24 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -20,6 +20,7 @@ struct css_set;
 
 #define USER_WORKER		BIT(0)
 #define USER_WORKER_IO		BIT(1)
+#define USER_WORKER_NO_FILES	BIT(2)
 
 struct kernel_clone_args {
 	u64 flags;
diff --git a/kernel/fork.c b/kernel/fork.c
index 77d2c527e917..bb98b48bc35c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1624,7 +1624,8 @@ static int copy_fs(unsigned long clone_flags, struct task_struct *tsk)
 	return 0;
 }
 
-static int copy_files(unsigned long clone_flags, struct task_struct *tsk)
+static int copy_files(unsigned long clone_flags, struct task_struct *tsk,
+		      int no_files)
 {
 	struct files_struct *oldf, *newf;
 	int error = 0;
@@ -1636,6 +1637,11 @@ static int copy_files(unsigned long clone_flags, struct task_struct *tsk)
 	if (!oldf)
 		goto out;
 
+	if (no_files) {
+		tsk->files = NULL;
+		goto out;
+	}
+
 	if (clone_flags & CLONE_FILES) {
 		atomic_inc(&oldf->count);
 		goto out;
@@ -2255,7 +2261,8 @@ static __latent_entropy struct task_struct *copy_process(
 	retval = copy_semundo(clone_flags, p);
 	if (retval)
 		goto bad_fork_cleanup_security;
-	retval = copy_files(clone_flags, p);
+	retval = copy_files(clone_flags, p,
+			    args->worker_flags & USER_WORKER_NO_FILES);
 	if (retval)
 		goto bad_fork_cleanup_semundo;
 	retval = copy_fs(clone_flags, p);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v11 4/8] fork: Add USER_WORKER flag to ignore signals
  2023-02-02 23:25 [PATCH v11 0/8] Use copy_process in vhost layer Mike Christie
                   ` (2 preceding siblings ...)
  2023-02-02 23:25 ` [PATCH v11 3/8] fork: add USER_WORKER flag to not dup/clone files Mike Christie
@ 2023-02-02 23:25 ` Mike Christie
  2023-02-03  0:19   ` Linus Torvalds
  2023-02-02 23:25 ` [PATCH v11 5/8] fork: allow kernel code to call copy_process Mike Christie
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 98+ messages in thread
From: Mike Christie @ 2023-02-02 23:25 UTC (permalink / raw)
  To: hch, stefanha, jasowang, mst, sgarzare, virtualization, brauner,
	ebiederm, torvalds, konrad.wilk, linux-kernel
  Cc: Mike Christie, Christoph Hellwig

From: Christian Brauner <brauner@kernel.org>

Since:

commit 10ab825bdef8 ("change kernel threads to ignore signals instead of
blocking them")

kthreads have been ignoring signals by default, and the vhost layer has
never had a need to change that. This patch adds an option flag,
USER_WORKER_SIG_IGN, handled in copy_process() after copy_sighand()
and copy_signals() so vhost_tasks added in the next patches can continue
to ignore singals.

Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/sched/task.h | 1 +
 kernel/fork.c              | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 18e614591c24..ce6240a006cf 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -21,6 +21,7 @@ struct css_set;
 #define USER_WORKER		BIT(0)
 #define USER_WORKER_IO		BIT(1)
 #define USER_WORKER_NO_FILES	BIT(2)
+#define USER_WORKER_SIG_IGN	BIT(3)
 
 struct kernel_clone_args {
 	u64 flags;
diff --git a/kernel/fork.c b/kernel/fork.c
index bb98b48bc35c..55c77de45271 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2287,6 +2287,9 @@ static __latent_entropy struct task_struct *copy_process(
 	if (retval)
 		goto bad_fork_cleanup_io;
 
+	if (args->worker_flags & USER_WORKER_SIG_IGN)
+		ignore_signals(p);
+
 	stackleak_task_init(p);
 
 	if (pid != &init_struct_pid) {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v11 5/8] fork: allow kernel code to call copy_process
  2023-02-02 23:25 [PATCH v11 0/8] Use copy_process in vhost layer Mike Christie
                   ` (3 preceding siblings ...)
  2023-02-02 23:25 ` [PATCH v11 4/8] fork: Add USER_WORKER flag to ignore signals Mike Christie
@ 2023-02-02 23:25 ` Mike Christie
  2023-02-02 23:25 ` [PATCH v11 6/8] vhost_task: Allow vhost layer to use copy_process Mike Christie
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 98+ messages in thread
From: Mike Christie @ 2023-02-02 23:25 UTC (permalink / raw)
  To: hch, stefanha, jasowang, mst, sgarzare, virtualization, brauner,
	ebiederm, torvalds, konrad.wilk, linux-kernel
  Cc: Mike Christie

The next patch adds helpers like create_io_thread, but for use by the
vhost layer. There are several functions, so they are in their own file
instead of cluttering up fork.c. This patch allows that new file to
call copy_process.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/sched/task.h | 2 ++
 kernel/fork.c              | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index ce6240a006cf..b0e43a1fd21d 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -94,6 +94,8 @@ extern void exit_files(struct task_struct *);
 extern void exit_itimers(struct task_struct *);
 
 extern pid_t kernel_clone(struct kernel_clone_args *kargs);
+struct task_struct *copy_process(struct pid *pid, int trace, int node,
+				 struct kernel_clone_args *args);
 struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node);
 struct task_struct *fork_idle(int);
 extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
diff --git a/kernel/fork.c b/kernel/fork.c
index 55c77de45271..93e545b08205 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2013,7 +2013,7 @@ static void rv_task_fork(struct task_struct *p)
  * parts of the process environment (as per the clone
  * flags). The actual kick-off is left to the caller.
  */
-static __latent_entropy struct task_struct *copy_process(
+__latent_entropy struct task_struct *copy_process(
 					struct pid *pid,
 					int trace,
 					int node,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v11 6/8] vhost_task: Allow vhost layer to use copy_process
  2023-02-02 23:25 [PATCH v11 0/8] Use copy_process in vhost layer Mike Christie
                   ` (4 preceding siblings ...)
  2023-02-02 23:25 ` [PATCH v11 5/8] fork: allow kernel code to call copy_process Mike Christie
@ 2023-02-02 23:25 ` Mike Christie
  2023-02-03  0:43   ` Linus Torvalds
  2023-02-02 23:25 ` [PATCH v11 7/8] vhost: move worker thread fields to new struct Mike Christie
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 98+ messages in thread
From: Mike Christie @ 2023-02-02 23:25 UTC (permalink / raw)
  To: hch, stefanha, jasowang, mst, sgarzare, virtualization, brauner,
	ebiederm, torvalds, konrad.wilk, linux-kernel
  Cc: Mike Christie

Qemu will create vhost devices in the kernel which perform network, SCSI,
etc IO and management operations from worker threads created by the
kthread API. Because the kthread API does a copy_process on the kthreadd
thread, the vhost layer has to use kthread_use_mm to access the Qemu
thread's memory and cgroup_attach_task_all to add itself to the Qemu
thread's cgroups, and it bypasses the RLIMIT_NPROC limit which can result
in VMs creating more threads than the admin expected.

This patch adds a new struct vhost_task which can be used instead of
kthreads. They allow the vhost layer to use copy_process and inherit
the userspace process's mm and cgroups, the task is accounted for
under the userspace's nproc count and can be seen in its process tree,
and other features like namespaces work and are inherited by default.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
---
 MAINTAINERS                      |   2 +
 drivers/vhost/Kconfig            |   5 ++
 include/linux/sched/vhost_task.h |  23 ++++++
 kernel/Makefile                  |   1 +
 kernel/vhost_task.c              | 122 +++++++++++++++++++++++++++++++
 5 files changed, 153 insertions(+)
 create mode 100644 include/linux/sched/vhost_task.h
 create mode 100644 kernel/vhost_task.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 8a5c25c20d00..5f7a3b3af7aa 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -22125,7 +22125,9 @@ L:	virtualization@lists.linux-foundation.org
 L:	netdev@vger.kernel.org
 S:	Maintained
 T:	git git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git
+F:	kernel/vhost_task.c
 F:	drivers/vhost/
+F:	include/linux/sched/vhost_task.h
 F:	include/linux/vhost_iotlb.h
 F:	include/uapi/linux/vhost.h
 
diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 587fbae06182..b455d9ab6f3d 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -13,9 +13,14 @@ config VHOST_RING
 	  This option is selected by any driver which needs to access
 	  the host side of a virtio ring.
 
+config VHOST_TASK
+	bool
+	default n
+
 config VHOST
 	tristate
 	select VHOST_IOTLB
+	select VHOST_TASK
 	help
 	  This option is selected by any driver which needs to access
 	  the core of vhost.
diff --git a/include/linux/sched/vhost_task.h b/include/linux/sched/vhost_task.h
new file mode 100644
index 000000000000..50d02a25d37b
--- /dev/null
+++ b/include/linux/sched/vhost_task.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_VHOST_TASK_H
+#define _LINUX_VHOST_TASK_H
+
+#include <linux/completion.h>
+
+struct task_struct;
+
+struct vhost_task {
+	int (*fn)(void *data);
+	void *data;
+	struct completion exited;
+	unsigned long flags;
+	struct task_struct *task;
+};
+
+struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg, int node);
+__printf(2, 3)
+void vhost_task_start(struct vhost_task *vtsk, const char namefmt[], ...);
+void vhost_task_stop(struct vhost_task *vtsk);
+bool vhost_task_should_stop(struct vhost_task *vtsk);
+
+#endif
diff --git a/kernel/Makefile b/kernel/Makefile
index 10ef068f598d..6fc72b3afbde 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -15,6 +15,7 @@ obj-y     = fork.o exec_domain.o panic.o \
 obj-$(CONFIG_USERMODE_DRIVER) += usermode_driver.o
 obj-$(CONFIG_MODULES) += kmod.o
 obj-$(CONFIG_MULTIUSER) += groups.o
+obj-$(CONFIG_VHOST_TASK) += vhost_task.o
 
 ifdef CONFIG_FUNCTION_TRACER
 # Do not trace internal ftrace files
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
new file mode 100644
index 000000000000..517dd166bb2b
--- /dev/null
+++ b/kernel/vhost_task.c
@@ -0,0 +1,122 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2021 Oracle Corporation
+ */
+#include <linux/slab.h>
+#include <linux/completion.h>
+#include <linux/sched/task.h>
+#include <linux/sched/vhost_task.h>
+#include <linux/sched/signal.h>
+
+enum vhost_task_flags {
+	VHOST_TASK_FLAGS_STOP,
+};
+
+static int vhost_task_fn(void *data)
+{
+	struct vhost_task *vtsk = data;
+	int ret;
+
+	ret = vtsk->fn(vtsk->data);
+	complete(&vtsk->exited);
+	do_exit(ret);
+}
+
+/**
+ * vhost_task_stop - stop a vhost_task
+ * @vtsk: vhost_task to stop
+ *
+ * Callers must call vhost_task_should_stop and return from their worker
+ * function when it returns true;
+ */
+void vhost_task_stop(struct vhost_task *vtsk)
+{
+	pid_t pid = vtsk->task->pid;
+
+	set_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags);
+	wake_up_process(vtsk->task);
+	/*
+	 * Make sure vhost_task_fn is no longer accessing the vhost_task before
+	 * freeing it below. If userspace crashed or exited without closing,
+	 * then the vhost_task->task could already be marked dead so
+	 * kernel_wait will return early.
+	 */
+	wait_for_completion(&vtsk->exited);
+	/*
+	 * If we are just closing/removing a device and the parent process is
+	 * not exiting then reap the task.
+	 */
+	kernel_wait4(pid, NULL, __WCLONE, NULL);
+	kfree(vtsk);
+}
+EXPORT_SYMBOL_GPL(vhost_task_stop);
+
+/**
+ * vhost_task_should_stop - should the vhost task return from the work function
+ */
+bool vhost_task_should_stop(struct vhost_task *vtsk)
+{
+	return test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags);
+}
+EXPORT_SYMBOL_GPL(vhost_task_should_stop);
+
+/**
+ * vhost_task_create - create a copy of a process to be used by the kernel
+ * @fn: thread stack
+ * @arg: data to be passed to fn
+ * @node: numa node to allocate task from
+ *
+ * This returns a specialized task for use by the vhost layer or NULL on
+ * failure. The returned task is inactive, and the caller must fire it up
+ * through vhost_task_start().
+ */
+struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg, int node)
+{
+	struct kernel_clone_args args = {
+		.flags		= CLONE_FS | CLONE_UNTRACED | CLONE_VM,
+		.exit_signal	= 0,
+		.worker_flags	= USER_WORKER | USER_WORKER_NO_FILES |
+				  USER_WORKER_SIG_IGN,
+		.fn		= vhost_task_fn,
+	};
+	struct vhost_task *vtsk;
+	struct task_struct *tsk;
+
+	vtsk = kzalloc(sizeof(*vtsk), GFP_KERNEL);
+	if (!vtsk)
+		return ERR_PTR(-ENOMEM);
+	init_completion(&vtsk->exited);
+	vtsk->data = arg;
+	vtsk->fn = fn;
+
+	args.fn_arg = vtsk;
+
+	tsk = copy_process(NULL, 0, node, &args);
+	if (IS_ERR(tsk)) {
+		kfree(vtsk);
+		return NULL;
+	}
+
+	vtsk->task = tsk;
+	return vtsk;
+}
+EXPORT_SYMBOL_GPL(vhost_task_create);
+
+/**
+ * vhost_task_start - start a vhost_task created with vhost_task_create
+ * @vtsk: vhost_task to wake up
+ * @namefmt: printf-style format string for the thread name
+ */
+void vhost_task_start(struct vhost_task *vtsk, const char namefmt[], ...)
+{
+	char name[TASK_COMM_LEN];
+	va_list args;
+
+	va_start(args, namefmt);
+	vsnprintf(name, sizeof(name), namefmt, args);
+	set_task_comm(vtsk->task, name);
+	va_end(args);
+
+	wake_up_new_task(vtsk->task);
+}
+EXPORT_SYMBOL_GPL(vhost_task_start);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v11 7/8] vhost: move worker thread fields to new struct
  2023-02-02 23:25 [PATCH v11 0/8] Use copy_process in vhost layer Mike Christie
                   ` (5 preceding siblings ...)
  2023-02-02 23:25 ` [PATCH v11 6/8] vhost_task: Allow vhost layer to use copy_process Mike Christie
@ 2023-02-02 23:25 ` Mike Christie
  2023-02-02 23:25 ` [PATCH v11 8/8] vhost: use vhost_tasks for worker threads Mike Christie
  2023-02-07  8:19 ` [PATCH v11 0/8] Use copy_process in vhost layer Christian Brauner
  8 siblings, 0 replies; 98+ messages in thread
From: Mike Christie @ 2023-02-02 23:25 UTC (permalink / raw)
  To: hch, stefanha, jasowang, mst, sgarzare, virtualization, brauner,
	ebiederm, torvalds, konrad.wilk, linux-kernel
  Cc: Mike Christie, Christoph Hellwig

This is just a prep patch. It moves the worker related fields to a new
vhost_worker struct and moves the code around to create some helpers that
will be used in the next patch.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/vhost/vhost.c | 98 ++++++++++++++++++++++++++++---------------
 drivers/vhost/vhost.h | 11 +++--
 2 files changed, 72 insertions(+), 37 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index cbe72bfd2f1f..74378d241f8d 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -255,8 +255,8 @@ void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
 		 * sure it was not in the list.
 		 * test_and_set_bit() implies a memory barrier.
 		 */
-		llist_add(&work->node, &dev->work_list);
-		wake_up_process(dev->worker);
+		llist_add(&work->node, &dev->worker->work_list);
+		wake_up_process(dev->worker->task);
 	}
 }
 EXPORT_SYMBOL_GPL(vhost_work_queue);
@@ -264,7 +264,7 @@ EXPORT_SYMBOL_GPL(vhost_work_queue);
 /* A lockless hint for busy polling code to exit the loop */
 bool vhost_has_work(struct vhost_dev *dev)
 {
-	return !llist_empty(&dev->work_list);
+	return dev->worker && !llist_empty(&dev->worker->work_list);
 }
 EXPORT_SYMBOL_GPL(vhost_has_work);
 
@@ -335,7 +335,8 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 
 static int vhost_worker(void *data)
 {
-	struct vhost_dev *dev = data;
+	struct vhost_worker *worker = data;
+	struct vhost_dev *dev = worker->dev;
 	struct vhost_work *work, *work_next;
 	struct llist_node *node;
 
@@ -350,7 +351,7 @@ static int vhost_worker(void *data)
 			break;
 		}
 
-		node = llist_del_all(&dev->work_list);
+		node = llist_del_all(&worker->work_list);
 		if (!node)
 			schedule();
 
@@ -360,7 +361,7 @@ static int vhost_worker(void *data)
 		llist_for_each_entry_safe(work, work_next, node, node) {
 			clear_bit(VHOST_WORK_QUEUED, &work->flags);
 			__set_current_state(TASK_RUNNING);
-			kcov_remote_start_common(dev->kcov_handle);
+			kcov_remote_start_common(worker->kcov_handle);
 			work->fn(work);
 			kcov_remote_stop();
 			if (need_resched())
@@ -479,7 +480,6 @@ void vhost_dev_init(struct vhost_dev *dev,
 	dev->byte_weight = byte_weight;
 	dev->use_worker = use_worker;
 	dev->msg_handler = msg_handler;
-	init_llist_head(&dev->work_list);
 	init_waitqueue_head(&dev->wait);
 	INIT_LIST_HEAD(&dev->read_list);
 	INIT_LIST_HEAD(&dev->pending_list);
@@ -571,10 +571,60 @@ static void vhost_detach_mm(struct vhost_dev *dev)
 	dev->mm = NULL;
 }
 
+static void vhost_worker_free(struct vhost_dev *dev)
+{
+	struct vhost_worker *worker = dev->worker;
+
+	if (!worker)
+		return;
+
+	dev->worker = NULL;
+	WARN_ON(!llist_empty(&worker->work_list));
+	kthread_stop(worker->task);
+	kfree(worker);
+}
+
+static int vhost_worker_create(struct vhost_dev *dev)
+{
+	struct vhost_worker *worker;
+	struct task_struct *task;
+	int ret;
+
+	worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT);
+	if (!worker)
+		return -ENOMEM;
+
+	dev->worker = worker;
+	worker->dev = dev;
+	worker->kcov_handle = kcov_common_handle();
+	init_llist_head(&worker->work_list);
+
+	task = kthread_create(vhost_worker, worker, "vhost-%d", current->pid);
+	if (IS_ERR(task)) {
+		ret = PTR_ERR(task);
+		goto free_worker;
+	}
+
+	worker->task = task;
+	wake_up_process(task); /* avoid contributing to loadavg */
+
+	ret = vhost_attach_cgroups(dev);
+	if (ret)
+		goto stop_worker;
+
+	return 0;
+
+stop_worker:
+	kthread_stop(worker->task);
+free_worker:
+	kfree(worker);
+	dev->worker = NULL;
+	return ret;
+}
+
 /* Caller should have device mutex */
 long vhost_dev_set_owner(struct vhost_dev *dev)
 {
-	struct task_struct *worker;
 	int err;
 
 	/* Is there an owner already? */
@@ -585,36 +635,21 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
 
 	vhost_attach_mm(dev);
 
-	dev->kcov_handle = kcov_common_handle();
 	if (dev->use_worker) {
-		worker = kthread_create(vhost_worker, dev,
-					"vhost-%d", current->pid);
-		if (IS_ERR(worker)) {
-			err = PTR_ERR(worker);
-			goto err_worker;
-		}
-
-		dev->worker = worker;
-		wake_up_process(worker); /* avoid contributing to loadavg */
-
-		err = vhost_attach_cgroups(dev);
+		err = vhost_worker_create(dev);
 		if (err)
-			goto err_cgroup;
+			goto err_worker;
 	}
 
 	err = vhost_dev_alloc_iovecs(dev);
 	if (err)
-		goto err_cgroup;
+		goto err_iovecs;
 
 	return 0;
-err_cgroup:
-	if (dev->worker) {
-		kthread_stop(dev->worker);
-		dev->worker = NULL;
-	}
+err_iovecs:
+	vhost_worker_free(dev);
 err_worker:
 	vhost_detach_mm(dev);
-	dev->kcov_handle = 0;
 err_mm:
 	return err;
 }
@@ -704,12 +739,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
 	dev->iotlb = NULL;
 	vhost_clear_msg(dev);
 	wake_up_interruptible_poll(&dev->wait, EPOLLIN | EPOLLRDNORM);
-	WARN_ON(!llist_empty(&dev->work_list));
-	if (dev->worker) {
-		kthread_stop(dev->worker);
-		dev->worker = NULL;
-		dev->kcov_handle = 0;
-	}
+	vhost_worker_free(dev);
 	vhost_detach_mm(dev);
 }
 EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index d9109107af08..2f6beab93784 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -25,6 +25,13 @@ struct vhost_work {
 	unsigned long		flags;
 };
 
+struct vhost_worker {
+	struct task_struct	*task;
+	struct llist_head	work_list;
+	struct vhost_dev	*dev;
+	u64			kcov_handle;
+};
+
 /* Poll a file (eventfd or socket) */
 /* Note: there's nothing vhost specific about this structure. */
 struct vhost_poll {
@@ -147,8 +154,7 @@ struct vhost_dev {
 	struct vhost_virtqueue **vqs;
 	int nvqs;
 	struct eventfd_ctx *log_ctx;
-	struct llist_head work_list;
-	struct task_struct *worker;
+	struct vhost_worker *worker;
 	struct vhost_iotlb *umem;
 	struct vhost_iotlb *iotlb;
 	spinlock_t iotlb_lock;
@@ -158,7 +164,6 @@ struct vhost_dev {
 	int iov_limit;
 	int weight;
 	int byte_weight;
-	u64 kcov_handle;
 	bool use_worker;
 	int (*msg_handler)(struct vhost_dev *dev, u32 asid,
 			   struct vhost_iotlb_msg *msg);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-02-02 23:25 [PATCH v11 0/8] Use copy_process in vhost layer Mike Christie
                   ` (6 preceding siblings ...)
  2023-02-02 23:25 ` [PATCH v11 7/8] vhost: move worker thread fields to new struct Mike Christie
@ 2023-02-02 23:25 ` Mike Christie
  2023-05-05 13:40   ` Nicolas Dichtel
  2023-07-20 13:06   ` Michael S. Tsirkin
  2023-02-07  8:19 ` [PATCH v11 0/8] Use copy_process in vhost layer Christian Brauner
  8 siblings, 2 replies; 98+ messages in thread
From: Mike Christie @ 2023-02-02 23:25 UTC (permalink / raw)
  To: hch, stefanha, jasowang, mst, sgarzare, virtualization, brauner,
	ebiederm, torvalds, konrad.wilk, linux-kernel
  Cc: Mike Christie

For vhost workers we use the kthread API which inherit's its values from
and checks against the kthreadd thread. This results in the wrong RLIMITs
being checked, so while tools like libvirt try to control the number of
threads based on the nproc rlimit setting we can end up creating more
threads than the user wanted.

This patch has us use the vhost_task helpers which will inherit its
values/checks from the thread that owns the device similar to if we did
a clone in userspace. The vhost threads will now be counted in the nproc
rlimits. And we get features like cgroups and mm sharing automatically,
so we can remove those calls.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/vhost/vhost.c | 58 ++++++++-----------------------------------
 drivers/vhost/vhost.h |  4 +--
 2 files changed, 13 insertions(+), 49 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 74378d241f8d..d3c7c37b69a7 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -22,11 +22,11 @@
 #include <linux/slab.h>
 #include <linux/vmalloc.h>
 #include <linux/kthread.h>
-#include <linux/cgroup.h>
 #include <linux/module.h>
 #include <linux/sort.h>
 #include <linux/sched/mm.h>
 #include <linux/sched/signal.h>
+#include <linux/sched/vhost_task.h>
 #include <linux/interval_tree_generic.h>
 #include <linux/nospec.h>
 #include <linux/kcov.h>
@@ -256,7 +256,7 @@ void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
 		 * test_and_set_bit() implies a memory barrier.
 		 */
 		llist_add(&work->node, &dev->worker->work_list);
-		wake_up_process(dev->worker->task);
+		wake_up_process(dev->worker->vtsk->task);
 	}
 }
 EXPORT_SYMBOL_GPL(vhost_work_queue);
@@ -336,17 +336,14 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 static int vhost_worker(void *data)
 {
 	struct vhost_worker *worker = data;
-	struct vhost_dev *dev = worker->dev;
 	struct vhost_work *work, *work_next;
 	struct llist_node *node;
 
-	kthread_use_mm(dev->mm);
-
 	for (;;) {
 		/* mb paired w/ kthread_stop */
 		set_current_state(TASK_INTERRUPTIBLE);
 
-		if (kthread_should_stop()) {
+		if (vhost_task_should_stop(worker->vtsk)) {
 			__set_current_state(TASK_RUNNING);
 			break;
 		}
@@ -368,7 +365,7 @@ static int vhost_worker(void *data)
 				schedule();
 		}
 	}
-	kthread_unuse_mm(dev->mm);
+
 	return 0;
 }
 
@@ -509,31 +506,6 @@ long vhost_dev_check_owner(struct vhost_dev *dev)
 }
 EXPORT_SYMBOL_GPL(vhost_dev_check_owner);
 
-struct vhost_attach_cgroups_struct {
-	struct vhost_work work;
-	struct task_struct *owner;
-	int ret;
-};
-
-static void vhost_attach_cgroups_work(struct vhost_work *work)
-{
-	struct vhost_attach_cgroups_struct *s;
-
-	s = container_of(work, struct vhost_attach_cgroups_struct, work);
-	s->ret = cgroup_attach_task_all(s->owner, current);
-}
-
-static int vhost_attach_cgroups(struct vhost_dev *dev)
-{
-	struct vhost_attach_cgroups_struct attach;
-
-	attach.owner = current;
-	vhost_work_init(&attach.work, vhost_attach_cgroups_work);
-	vhost_work_queue(dev, &attach.work);
-	vhost_dev_flush(dev);
-	return attach.ret;
-}
-
 /* Caller should have device mutex */
 bool vhost_dev_has_owner(struct vhost_dev *dev)
 {
@@ -580,14 +552,14 @@ static void vhost_worker_free(struct vhost_dev *dev)
 
 	dev->worker = NULL;
 	WARN_ON(!llist_empty(&worker->work_list));
-	kthread_stop(worker->task);
+	vhost_task_stop(worker->vtsk);
 	kfree(worker);
 }
 
 static int vhost_worker_create(struct vhost_dev *dev)
 {
 	struct vhost_worker *worker;
-	struct task_struct *task;
+	struct vhost_task *vtsk;
 	int ret;
 
 	worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT);
@@ -595,27 +567,19 @@ static int vhost_worker_create(struct vhost_dev *dev)
 		return -ENOMEM;
 
 	dev->worker = worker;
-	worker->dev = dev;
 	worker->kcov_handle = kcov_common_handle();
 	init_llist_head(&worker->work_list);
 
-	task = kthread_create(vhost_worker, worker, "vhost-%d", current->pid);
-	if (IS_ERR(task)) {
-		ret = PTR_ERR(task);
+	vtsk = vhost_task_create(vhost_worker, worker, NUMA_NO_NODE);
+	if (!vtsk) {
+		ret = -ENOMEM;
 		goto free_worker;
 	}
 
-	worker->task = task;
-	wake_up_process(task); /* avoid contributing to loadavg */
-
-	ret = vhost_attach_cgroups(dev);
-	if (ret)
-		goto stop_worker;
-
+	worker->vtsk = vtsk;
+	vhost_task_start(vtsk, "vhost-%d", current->pid);
 	return 0;
 
-stop_worker:
-	kthread_stop(worker->task);
 free_worker:
 	kfree(worker);
 	dev->worker = NULL;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 2f6beab93784..3af59c65025e 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -16,6 +16,7 @@
 #include <linux/irqbypass.h>
 
 struct vhost_work;
+struct vhost_task;
 typedef void (*vhost_work_fn_t)(struct vhost_work *work);
 
 #define VHOST_WORK_QUEUED 1
@@ -26,9 +27,8 @@ struct vhost_work {
 };
 
 struct vhost_worker {
-	struct task_struct	*task;
+	struct vhost_task	*vtsk;
 	struct llist_head	work_list;
-	struct vhost_dev	*dev;
 	u64			kcov_handle;
 };
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 1/8] fork: Make IO worker options flag based
  2023-02-02 23:25 ` [PATCH v11 1/8] fork: Make IO worker options flag based Mike Christie
@ 2023-02-03  0:14   ` Linus Torvalds
  0 siblings, 0 replies; 98+ messages in thread
From: Linus Torvalds @ 2023-02-03  0:14 UTC (permalink / raw)
  To: Mike Christie
  Cc: hch, stefanha, jasowang, mst, sgarzare, virtualization, brauner,
	ebiederm, konrad.wilk, linux-kernel, Christoph Hellwig

On Thu, Feb 2, 2023 at 3:25 PM Mike Christie
<michael.christie@oracle.com> wrote:
>
>  struct kernel_clone_args {
>         u64 flags;
> +       u32 worker_flags;
>         int __user *pidfd;
>         int __user *child_tid;
>         int __user *parent_tid;

Minor nit: please put this next to "exit_signal".

As it is, you've put a new 32-bit field in between two 64-bit fields
and are generating extra pointless padding.

We have that padding by "exit_signal" already, so let's just use it.

Also, I like moving those flags to a "flags" field, but can we please
make it consistent? We have that "args->kthread" field too, which is
100% analogous to args->io_thread.

So don't make a bit field for io_thread, and then not do the same for kthread.

Finally, why isn't this all just a bitfield - every single case would
seem to prefer something like

     if (args->user_worker) ..

instead of

    if (args->worker_flags & USER_WORKER)

which would seem to make everything simpler still?

            Linus

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 3/8] fork: add USER_WORKER flag to not dup/clone files
  2023-02-02 23:25 ` [PATCH v11 3/8] fork: add USER_WORKER flag to not dup/clone files Mike Christie
@ 2023-02-03  0:16   ` Linus Torvalds
  0 siblings, 0 replies; 98+ messages in thread
From: Linus Torvalds @ 2023-02-03  0:16 UTC (permalink / raw)
  To: Mike Christie
  Cc: hch, stefanha, jasowang, mst, sgarzare, virtualization, brauner,
	ebiederm, konrad.wilk, linux-kernel, Christoph Hellwig

On Thu, Feb 2, 2023 at 3:25 PM Mike Christie
<michael.christie@oracle.com> wrote:
>
> -       retval = copy_files(clone_flags, p);
> +       retval = copy_files(clone_flags, p,
> +                           args->worker_flags & USER_WORKER_NO_FILES);

Just to hit the previous email comment home, adding just another
bitfield case would have made this patch simpler, and this would just
be

       retval = copy_files(clone_flags, p, args->no_files);

which seems more legible too.

             Linus

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 4/8] fork: Add USER_WORKER flag to ignore signals
  2023-02-02 23:25 ` [PATCH v11 4/8] fork: Add USER_WORKER flag to ignore signals Mike Christie
@ 2023-02-03  0:19   ` Linus Torvalds
  2023-02-05 16:06     ` Mike Christie
  0 siblings, 1 reply; 98+ messages in thread
From: Linus Torvalds @ 2023-02-03  0:19 UTC (permalink / raw)
  To: Mike Christie
  Cc: hch, stefanha, jasowang, mst, sgarzare, virtualization, brauner,
	ebiederm, konrad.wilk, linux-kernel, Christoph Hellwig

On Thu, Feb 2, 2023 at 3:25 PM Mike Christie
<michael.christie@oracle.com> wrote:
>
> +       if (args->worker_flags & USER_WORKER_SIG_IGN)
> +               ignore_signals(p);

Same comment as for the other case.

There are real reasons to avoid bitfields:

 - you can't pass addresses to them around

 - it's easier to read or assign multiple fields in one go

 - they are horrible for ABI issues due to the exact bit ordering and
padding being very subtle

but none of those issues are relevant here, where it's a kernel-internal ABI.

All these use-cases seem to actually be testing one bit at a time, and
the "assignments" are structure initializers for which named bitfields
are actually perfect and just make the initializer more legible.

            Linus

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 6/8] vhost_task: Allow vhost layer to use copy_process
  2023-02-02 23:25 ` [PATCH v11 6/8] vhost_task: Allow vhost layer to use copy_process Mike Christie
@ 2023-02-03  0:43   ` Linus Torvalds
  0 siblings, 0 replies; 98+ messages in thread
From: Linus Torvalds @ 2023-02-03  0:43 UTC (permalink / raw)
  To: Mike Christie
  Cc: hch, stefanha, jasowang, mst, sgarzare, virtualization, brauner,
	ebiederm, konrad.wilk, linux-kernel

On Thu, Feb 2, 2023 at 3:25 PM Mike Christie
<michael.christie@oracle.com> wrote:
>
> +/**
> + * vhost_task_start - start a vhost_task created with vhost_task_create
> + * @vtsk: vhost_task to wake up
> + * @namefmt: printf-style format string for the thread name
> + */
> +void vhost_task_start(struct vhost_task *vtsk, const char namefmt[], ...)
> +{
> +       char name[TASK_COMM_LEN];
> +       va_list args;
> +
> +       va_start(args, namefmt);
> +       vsnprintf(name, sizeof(name), namefmt, args);
> +       set_task_comm(vtsk->task, name);
> +       va_end(args);
> +
> +       wake_up_new_task(vtsk->task);
> +}

Ok, I like this more than what we do for the IO workers - they set
their own names themselves once they start running, rather than have
the creator do it like this.

At the same time, my reaction to this was "why do we need to go
through that temporary 'name[]' buffer at all?"

And I think this patch is very much correct to do so, because
"copy_thread()" has already exposed the new thread to the rest of the
world, even though it hasn't actually started running yet.

So I think this is all doing the right thing, and I like how it does
it better than what io_uring does, BUT...

It does make me think that maybe we should make that task name
handling part of copy_process(), and simply create the task name
before we need this careful set_task_comm() with a temporary buffer.

Because if we just did it in copy_process() before the new task has
been exposed anywhere,. we could just do it as

        if (args->name)
            vsnprintf(tsk->comm, TASK_COMM_LEN, "%s-%d", args->name, tsk->pid);

or something like that.

Not a big deal, it was just me reacting to this patch with "do we
really need set_task_comm() when we're creating the task?"

               Linus

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 4/8] fork: Add USER_WORKER flag to ignore signals
  2023-02-03  0:19   ` Linus Torvalds
@ 2023-02-05 16:06     ` Mike Christie
  0 siblings, 0 replies; 98+ messages in thread
From: Mike Christie @ 2023-02-05 16:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: hch, stefanha, jasowang, mst, sgarzare, virtualization, brauner,
	ebiederm, konrad.wilk, linux-kernel, Christoph Hellwig

On 2/2/23 6:19 PM, Linus Torvalds wrote:
> On Thu, Feb 2, 2023 at 3:25 PM Mike Christie
> <michael.christie@oracle.com> wrote:
>>
>> +       if (args->worker_flags & USER_WORKER_SIG_IGN)
>> +               ignore_signals(p);
> 
> Same comment as for the other case.
> 
> There are real reasons to avoid bitfields:
> 
>  - you can't pass addresses to them around
> 
>  - it's easier to read or assign multiple fields in one go
> 
>  - they are horrible for ABI issues due to the exact bit ordering and
> padding being very subtle
> 
> but none of those issues are relevant here, where it's a kernel-internal ABI.
> 
> All these use-cases seem to actually be testing one bit at a time, and
> the "assignments" are structure initializers for which named bitfields
> are actually perfect and just make the initializer more legible.
> 

Thanks for the comments. I see what you mean and have fixed those instances and
updated kthread as well.


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 0/8] Use copy_process in vhost layer
  2023-02-02 23:25 [PATCH v11 0/8] Use copy_process in vhost layer Mike Christie
                   ` (7 preceding siblings ...)
  2023-02-02 23:25 ` [PATCH v11 8/8] vhost: use vhost_tasks for worker threads Mike Christie
@ 2023-02-07  8:19 ` Christian Brauner
  8 siblings, 0 replies; 98+ messages in thread
From: Christian Brauner @ 2023-02-07  8:19 UTC (permalink / raw)
  To: Mike Christie
  Cc: hch, stefanha, jasowang, mst, sgarzare, virtualization, ebiederm,
	torvalds, konrad.wilk, linux-kernel

On Thu, Feb 02, 2023 at 05:25:09PM -0600, Mike Christie wrote:
> The following patches were made over Linus's tree. They allow the vhost
> layer to use copy_process instead of using workqueue_structs to create
> worker threads for VM's devices.

Thanks for keeping at this, Mike.
I can pick this up once you resend with the requested changes.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-02-02 23:25 ` [PATCH v11 8/8] vhost: use vhost_tasks for worker threads Mike Christie
@ 2023-05-05 13:40   ` Nicolas Dichtel
  2023-05-05 18:22     ` Linus Torvalds
  2023-05-16 14:06     ` [PATCH v11 8/8] vhost: use vhost_tasks for worker threads Linux regression tracking #adding (Thorsten Leemhuis)
  2023-07-20 13:06   ` Michael S. Tsirkin
  1 sibling, 2 replies; 98+ messages in thread
From: Nicolas Dichtel @ 2023-05-05 13:40 UTC (permalink / raw)
  To: Mike Christie, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, brauner, ebiederm, torvalds, konrad.wilk,
	linux-kernel

Le 03/02/2023 à 00:25, Mike Christie a écrit :
> For vhost workers we use the kthread API which inherit's its values from
> and checks against the kthreadd thread. This results in the wrong RLIMITs
> being checked, so while tools like libvirt try to control the number of
> threads based on the nproc rlimit setting we can end up creating more
> threads than the user wanted.
> 
> This patch has us use the vhost_task helpers which will inherit its
> values/checks from the thread that owns the device similar to if we did
> a clone in userspace. The vhost threads will now be counted in the nproc
> rlimits. And we get features like cgroups and mm sharing automatically,
> so we can remove those calls.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Acked-by: Michael S. Tsirkin <mst@redhat.com>

I have a question about (a side effect of?) this patch. The output of the 'ps'
command has changed. Here is an example:

Before:
$ ps
    PID TTY          TIME CMD
    598 ttyS0    00:00:00 login
    640 ttyS0    00:00:00 bash
   8880 ttyS0    00:00:06 example:2
   9389 ttyS0    00:00:00 ps
$ ps a
    PID TTY      STAT   TIME COMMAND
    598 ttyS0    Ss     0:00 /bin/login -p --
    602 tty1     Ss+    0:00 /sbin/agetty -o -p -- \u --noclear tty1 linux
    640 ttyS0    S      0:00 /bin/bash -li
   8880 ttyS0    SLl    0:10 /usr/bin/example
   9396 ttyS0    R+     0:00 ps a
$ pgrep -f example
8880

After:
$ ps
    PID TTY          TIME CMD
    538 ttyS0    00:00:00 login
    574 ttyS0    00:00:00 bash
   8275 ttyS0    00:03:28 example:2
   8285 ttyS0    00:00:00 vhost-8275
   8295 ttyS0    00:00:00 vhost-8275
   8299 ttyS0    00:00:00 vhost-8275
   9054 ttyS0    00:00:00 ps
$ ps a
    PID TTY      STAT   TIME COMMAND
    538 ttyS0    Ss     0:00 /bin/login -p --
    540 tty1     Ss+    0:00 /sbin/agetty -o -p -- \u --noclear tty1 linux
    574 ttyS0    S      0:00 /bin/bash -li
   8275 ttyS0    SLl    3:28 /usr/bin/example
   8285 ttyS0    SL     0:00 /usr/bin/example
   8295 ttyS0    SL     0:00 /usr/bin/example
   8299 ttyS0    SL     0:00 /usr/bin/example
   9055 ttyS0    R+     0:00 ps a
$ pgrep -f example
8275
8285
8295
8299

Is this an intended behavior?
This breaks some of our scripts.


Regards,
Nicolas

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-05-05 13:40   ` Nicolas Dichtel
@ 2023-05-05 18:22     ` Linus Torvalds
  2023-05-05 22:37       ` Mike Christie
  2023-05-16 14:06     ` [PATCH v11 8/8] vhost: use vhost_tasks for worker threads Linux regression tracking #adding (Thorsten Leemhuis)
  1 sibling, 1 reply; 98+ messages in thread
From: Linus Torvalds @ 2023-05-05 18:22 UTC (permalink / raw)
  To: nicolas.dichtel, Christian Brauner
  Cc: Mike Christie, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, ebiederm, konrad.wilk, linux-kernel

On Fri, May 5, 2023 at 6:40 AM Nicolas Dichtel
<nicolas.dichtel@6wind.com> wrote:
>
> Is this an intended behavior?
> This breaks some of our scripts.

It doesn't just break your scripts (which counts as a regression), I
think it's really wrong.

The worker threads should show up as threads of the thing that started
them, not as processes.

So they should show up in 'ps' only when one of the "show threads" flag is set.

But I suspect the fix is trivial:  the virtio code should likely use
CLONE_THREAD for the copy_process() it does.

It should look more like "create_io_thread()" than "copy_process()", I think.

For example, do virtio worker threads really want their own signals
and files? That sounds wrong. create_io_thread() uses all of

 CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_IO

to share much more of the context with the process it is actually run within.

Christian? Mike?

                Linus

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-05-05 18:22     ` Linus Torvalds
@ 2023-05-05 22:37       ` Mike Christie
  2023-05-06  1:53         ` Linus Torvalds
                           ` (3 more replies)
  0 siblings, 4 replies; 98+ messages in thread
From: Mike Christie @ 2023-05-05 22:37 UTC (permalink / raw)
  To: Linus Torvalds, nicolas.dichtel, Christian Brauner
  Cc: hch, stefanha, jasowang, mst, sgarzare, virtualization, ebiederm,
	konrad.wilk, linux-kernel

On 5/5/23 1:22 PM, Linus Torvalds wrote:
> On Fri, May 5, 2023 at 6:40 AM Nicolas Dichtel
> <nicolas.dichtel@6wind.com> wrote:
>>
>> Is this an intended behavior?
>> This breaks some of our scripts.
> 
> It doesn't just break your scripts (which counts as a regression), I
> think it's really wrong.
> 
> The worker threads should show up as threads of the thing that started
> them, not as processes.
> 
> So they should show up in 'ps' only when one of the "show threads" flag is set.
> 
> But I suspect the fix is trivial:  the virtio code should likely use
> CLONE_THREAD for the copy_process() it does.
> 
> It should look more like "create_io_thread()" than "copy_process()", I think.
> 
> For example, do virtio worker threads really want their own signals
> and files? That sounds wrong. create_io_thread() uses all of
> 
>  CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_IO
> 
> to share much more of the context with the process it is actually run within.
> 

For the vhost tasks and the CLONE flags:

1. I didn't use CLONE_FILES in the vhost task patches because you are right
and we didn't need our own. We needed it to work like kthreads where there
are no files, so I set the kernel_clone_args.no_files bit to have copy_files
not do a dup or clone (task->files is NULL).

2. vhost tasks didn't use CLONE_SIGHAND, because userspace apps like qemu use
signals for management operations. But, the vhost thread's worker functions
assume signals are ignored like they were with kthreads. So if they were doing
IO and got a signal like a SIGHUP they might return early and fail from whatever
network/block function they were calling. And currently the parent like qemu
handles something like a SIGSTOP by shutting everything down by calling into
the vhost interface to remove the device.

So similar to files I used the kernel_clone_args.ignore_signals bit so
copy_process has the vhost thread have it's own signal handle that just ignores
signals.

3. I didn't use CLONE_THREAD because before my patches you could do
"ps -u root" and see all the vhost threads. If we use CLONE_THREAD, then we
can only see it when we do something like "ps -T -p $parent" like you mentioned
above. I guess I messed up and did the reverse and thought it would be a
regression if "ps -u root" no longer showed the vhost threads.

If it's ok to change the behavior of "ps -u root", then we can do this patch:
(Nicolas, I confirmed it fixes the 'ps a' case, but couldn't replicate the 'ps'
case. If you could test the ps only case or give me info on what /usr/bin/example
was doing I can replicate and test here):


diff --git a/kernel/fork.c b/kernel/fork.c
index ed4e01daccaa..eb9ffc58e211 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2269,8 +2269,14 @@ __latent_entropy struct task_struct *copy_process(
 	/*
 	 * Thread groups must share signals as well, and detached threads
 	 * can only be started up within the thread group.
+	 *
+	 * A userworker's parent thread will normally have a signal handler
+	 * that performs management operations, but the worker will not
+	 * because the parent will handle the signal then user a worker
+	 * specific interface to manage the thread and related resources.
 	 */
-	if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND))
+	if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND) &&
+	    !args->user_worker && !args->ignore_signals)
 		return ERR_PTR(-EINVAL);
 
 	/*
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index b7cbd66f889e..3700c21ea39d 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -75,7 +78,8 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
 				     const char *name)
 {
 	struct kernel_clone_args args = {
-		.flags		= CLONE_FS | CLONE_UNTRACED | CLONE_VM,
+		.flags		= CLONE_FS | CLONE_THREAD | CLONE_VM |
+				  CLONE_UNTRACED,
 		.exit_signal	= 0,
 		.fn		= vhost_task_fn,
 		.name		= name,











^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-05-05 22:37       ` Mike Christie
@ 2023-05-06  1:53         ` Linus Torvalds
  2023-05-08 17:13         ` Christian Brauner
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 98+ messages in thread
From: Linus Torvalds @ 2023-05-06  1:53 UTC (permalink / raw)
  To: Mike Christie
  Cc: nicolas.dichtel, Christian Brauner, hch, stefanha, jasowang, mst,
	sgarzare, virtualization, ebiederm, konrad.wilk, linux-kernel

On Fri, May 5, 2023 at 3:38 PM Mike Christie
<michael.christie@oracle.com> wrote:
>
> If it's ok to change the behavior of "ps -u root", then we can do this patch:

I think this is the right thing to do.

Making the user worker threads show up as threads with the vhost
process as the parent really seems like a much better model, and more
accurate.

Yes, they used to show up as random kernel threads, and you'd see them
as such (not just for "ps -u root", but simply also with just a normal
"ps ax" kind of thing). But that isn't all that helpful, and it's
really just annoying to see our kernel threads in "ps ax" output, and
I've often wished we didn't do that (it think of all the random
"kworker/xyz-kcryptd" etc things that show up).

So I think showing them as the threaded children of the vhost process
is much nicer, and probably the best option.

Because I don't thin kanything is going to get the *old* behavior of
showing them as the '[vhost-xyz]' system threads (or whatever the old
output ended up being in 'ps ax'), but hopefully nothing wants that
horror anyway.

At a minimum, the parenting is fundamentally going to look different
in the new model.

                   Linus

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-05-05 22:37       ` Mike Christie
  2023-05-06  1:53         ` Linus Torvalds
@ 2023-05-08 17:13         ` Christian Brauner
  2023-05-09  8:09         ` Nicolas Dichtel
  2023-05-13 12:39         ` Thorsten Leemhuis
  3 siblings, 0 replies; 98+ messages in thread
From: Christian Brauner @ 2023-05-08 17:13 UTC (permalink / raw)
  To: Mike Christie
  Cc: Linus Torvalds, nicolas.dichtel, hch, stefanha, jasowang, mst,
	sgarzare, virtualization, ebiederm, konrad.wilk, linux-kernel

On Fri, May 05, 2023 at 05:37:40PM -0500, Mike Christie wrote:
> On 5/5/23 1:22 PM, Linus Torvalds wrote:
> > On Fri, May 5, 2023 at 6:40 AM Nicolas Dichtel
> > <nicolas.dichtel@6wind.com> wrote:
> >>
> >> Is this an intended behavior?
> >> This breaks some of our scripts.
> > 
> > It doesn't just break your scripts (which counts as a regression), I
> > think it's really wrong.
> > 
> > The worker threads should show up as threads of the thing that started
> > them, not as processes.
> > 
> > So they should show up in 'ps' only when one of the "show threads" flag is set.
> > 
> > But I suspect the fix is trivial:  the virtio code should likely use
> > CLONE_THREAD for the copy_process() it does.
> > 
> > It should look more like "create_io_thread()" than "copy_process()", I think.
> > 
> > For example, do virtio worker threads really want their own signals
> > and files? That sounds wrong. create_io_thread() uses all of
> > 
> >  CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_IO
> > 
> > to share much more of the context with the process it is actually run within.
> > 
> 
> For the vhost tasks and the CLONE flags:
> 
> 1. I didn't use CLONE_FILES in the vhost task patches because you are right
> and we didn't need our own. We needed it to work like kthreads where there
> are no files, so I set the kernel_clone_args.no_files bit to have copy_files
> not do a dup or clone (task->files is NULL).
> 
> 2. vhost tasks didn't use CLONE_SIGHAND, because userspace apps like qemu use
> signals for management operations. But, the vhost thread's worker functions
> assume signals are ignored like they were with kthreads. So if they were doing
> IO and got a signal like a SIGHUP they might return early and fail from whatever
> network/block function they were calling. And currently the parent like qemu
> handles something like a SIGSTOP by shutting everything down by calling into
> the vhost interface to remove the device.
> 
> So similar to files I used the kernel_clone_args.ignore_signals bit so
> copy_process has the vhost thread have it's own signal handle that just ignores
> signals.
> 
> 3. I didn't use CLONE_THREAD because before my patches you could do
> "ps -u root" and see all the vhost threads. If we use CLONE_THREAD, then we
> can only see it when we do something like "ps -T -p $parent" like you mentioned
> above. I guess I messed up and did the reverse and thought it would be a
> regression if "ps -u root" no longer showed the vhost threads.
> 
> If it's ok to change the behavior of "ps -u root", then we can do this patch:
> (Nicolas, I confirmed it fixes the 'ps a' case, but couldn't replicate the 'ps'
> case. If you could test the ps only case or give me info on what /usr/bin/example
> was doing I can replicate and test here):
> 
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index ed4e01daccaa..eb9ffc58e211 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2269,8 +2269,14 @@ __latent_entropy struct task_struct *copy_process(
>  	/*
>  	 * Thread groups must share signals as well, and detached threads
>  	 * can only be started up within the thread group.
> +	 *
> +	 * A userworker's parent thread will normally have a signal handler
> +	 * that performs management operations, but the worker will not
> +	 * because the parent will handle the signal then user a worker
> +	 * specific interface to manage the thread and related resources.
>  	 */
> -	if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND))
> +	if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND) &&
> +	    !args->user_worker && !args->ignore_signals)
>  		return ERR_PTR(-EINVAL);

I'm currently traveling due to LSFMM so that's why my responses will be
delayed this week. I'm not yet clear if CLONE_THREAD without
CLONE_SIGHAND is safe. If there's code that assumes that
$task_in_threadgroup->sighand->siglock always covers all threads in the
threadgroup then this change would break this assumption?

>  
>  	/*
> diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
> index b7cbd66f889e..3700c21ea39d 100644
> --- a/kernel/vhost_task.c
> +++ b/kernel/vhost_task.c
> @@ -75,7 +78,8 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
>  				     const char *name)
>  {
>  	struct kernel_clone_args args = {
> -		.flags		= CLONE_FS | CLONE_UNTRACED | CLONE_VM,
> +		.flags		= CLONE_FS | CLONE_THREAD | CLONE_VM |
> +				  CLONE_UNTRACED,
>  		.exit_signal	= 0,
>  		.fn		= vhost_task_fn,
>  		.name		= name,
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-05-05 22:37       ` Mike Christie
  2023-05-06  1:53         ` Linus Torvalds
  2023-05-08 17:13         ` Christian Brauner
@ 2023-05-09  8:09         ` Nicolas Dichtel
  2023-05-09  8:17           ` Nicolas Dichtel
  2023-05-13 12:39         ` Thorsten Leemhuis
  3 siblings, 1 reply; 98+ messages in thread
From: Nicolas Dichtel @ 2023-05-09  8:09 UTC (permalink / raw)
  To: Mike Christie, Linus Torvalds, Christian Brauner
  Cc: hch, stefanha, jasowang, mst, sgarzare, virtualization, ebiederm,
	konrad.wilk, linux-kernel

Le 06/05/2023 à 00:37, Mike Christie a écrit :
[snip]
> (Nicolas, I confirmed it fixes the 'ps a' case, but couldn't replicate the 'ps'
> case. If you could test the ps only case or give me info on what /usr/bin/example
> was doing I can replicate and test here):
With you patch:
$ ps a
  PID TTY      STAT   TIME COMMAND
  191 ttyS0    Ss     0:00 /bin/sh -li
 1255 ttyS0    SLl    0:53 /usr/bin/example
 1742 ttyS0    R+     0:00 ps a
$ ps
  PID TTY          TIME CMD
  191 ttyS0    00:00:00 sh
 1743 ttyS0    00:00:00 ps

This fixes the regression on our side, but now, 'example' is not displayed
anymore with 'ps'.

Thank you,
Nicolas

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-05-09  8:09         ` Nicolas Dichtel
@ 2023-05-09  8:17           ` Nicolas Dichtel
  0 siblings, 0 replies; 98+ messages in thread
From: Nicolas Dichtel @ 2023-05-09  8:17 UTC (permalink / raw)
  To: Mike Christie, Linus Torvalds, Christian Brauner
  Cc: hch, stefanha, jasowang, mst, sgarzare, virtualization, ebiederm,
	konrad.wilk, linux-kernel

Le 09/05/2023 à 10:09, Nicolas Dichtel a écrit :
> Le 06/05/2023 à 00:37, Mike Christie a écrit :
> [snip]
>> (Nicolas, I confirmed it fixes the 'ps a' case, but couldn't replicate the 'ps'
>> case. If you could test the ps only case or give me info on what /usr/bin/example
>> was doing I can replicate and test here):
> With you patch:
> $ ps a
>   PID TTY      STAT   TIME COMMAND
>   191 ttyS0    Ss     0:00 /bin/sh -li
>  1255 ttyS0    SLl    0:53 /usr/bin/example
>  1742 ttyS0    R+     0:00 ps a
> $ ps
>   PID TTY          TIME CMD
>   191 ttyS0    00:00:00 sh
>  1743 ttyS0    00:00:00 ps
Sorry, this is wrong, here is the right screenshot:
$ ps
    PID TTY          TIME CMD
    538 ttyS0    00:00:00 login
    573 ttyS0    00:00:00 bash
   8282 ttyS0    00:00:04 example:2
   8825 ttyS0    00:00:00 ps
$ ps a
    PID TTY      STAT   TIME COMMAND
    538 ttyS0    Ss     0:00 /bin/login -p --
    540 tty1     Ss+    0:00 /sbin/agetty -o -p -- \u --noclear tty1 linux
    573 ttyS0    S      0:00 /bin/bash -li
   8282 ttyS0    RLl    0:05 /usr/bin/example
   8829 ttyS0    R+     0:00 ps a

It fixes the issue.


Thank you,
Nicolas

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-05-05 22:37       ` Mike Christie
                           ` (2 preceding siblings ...)
  2023-05-09  8:09         ` Nicolas Dichtel
@ 2023-05-13 12:39         ` Thorsten Leemhuis
  2023-05-13 15:08           ` Linus Torvalds
  3 siblings, 1 reply; 98+ messages in thread
From: Thorsten Leemhuis @ 2023-05-13 12:39 UTC (permalink / raw)
  To: Mike Christie, Linus Torvalds, nicolas.dichtel,
	Christian Brauner, Linux kernel regressions list
  Cc: hch, stefanha, jasowang, mst, sgarzare, virtualization, ebiederm,
	konrad.wilk, linux-kernel

[CCing the regression list]

On 06.05.23 00:37, Mike Christie wrote:
> On 5/5/23 1:22 PM, Linus Torvalds wrote:
>> On Fri, May 5, 2023 at 6:40 AM Nicolas Dichtel
>> <nicolas.dichtel@6wind.com> wrote:
>>>
>>> Is this an intended behavior?
>>> This breaks some of our scripts.

Jumping in here, as I found another problem with that patch: it broke
s2idle on my laptop when a qemu-kvm VM is running, as freezing user
space processes now fails for me:

```
 [  195.442949] PM: suspend entry (s2idle)
 [  195.641271] Filesystems sync: 0.198 seconds
 [  195.833828] Freezing user space processes
 [  215.841084] Freezing user space processes failed after 20.007
seconds (1 tasks refusing to freeze, wq_busy=0):
 [  215.841255] task:vhost-3221      state:R stack:0     pid:3250
ppid:3221   flags:0x00004006
 [  215.841264] Call Trace:
 [  215.841266]  <TASK>
 [  215.841270]  ? update_rq_clock+0x39/0x270
 [  215.841283]  ? _raw_spin_unlock+0x19/0x40
 [  215.841290]  ? __schedule+0x3f/0x1510
 [  215.841296]  ? sysvec_apic_timer_interrupt+0xaf/0xd0
 [  215.841306]  ? schedule+0x61/0xe0
 [  215.841313]  ? vhost_worker+0x87/0xb0 [vhost]
 [  215.841329]  ? vhost_task_fn+0x1a/0x30
 [  215.841336]  ? __pfx_vhost_task_fn+0x10/0x10
 [  215.841341]  ? ret_from_fork+0x2c/0x50
 [  215.841352]  </TASK>
 [  215.841936] OOM killer enabled.
 [  215.841938] Restarting tasks ... done.
 [  215.844204] random: crng reseeded on system resumption
 [  215.957095] PM: suspend exit
 [  215.957185] PM: suspend entry (s2idle)
 [  215.967646] Filesystems sync: 0.010 seconds
 [  215.971326] Freezing user space processes
 [  235.974400] Freezing user space processes failed after 20.003
seconds (1 tasks refusing to freeze, wq_busy=0):
 [  235.974574] task:vhost-3221      state:R stack:0     pid:3250
ppid:3221   flags:0x00004806
 [  235.974583] Call Trace:
 [  235.974586]  <TASK>
 [  235.974593]  ? __schedule+0x184/0x1510
 [  235.974605]  ? sysvec_apic_timer_interrupt+0xaf/0xd0
 [  235.974616]  ? schedule+0x61/0xe0
 [  235.974624]  ? vhost_worker+0x87/0xb0 [vhost]
 [  235.974648]  ? vhost_task_fn+0x1a/0x30
 [  235.974656]  ? __pfx_vhost_task_fn+0x10/0x10
 [  235.974662]  ? ret_from_fork+0x2c/0x50
 [  235.974673]  </TASK>
 [  235.975190] OOM killer enabled.
 [  235.975192] Restarting tasks ... done.
 [  235.978131] random: crng reseeded on system resumption
 [  236.091219] PM: suspend exit
```

After running into the problem I booted 6.3.1-rc1 again and there s2idle
still worked. Didn't do a bisection, just looked at the vhost commits
during the latest merge window; 6e890c5d502 ("vhost: use vhost_tasks for
worker threads") looked suspicious, so I reverted it on top of latest
mainline and then things work again. Through a search on lore I arrived
in this thread and found below patch from Mike. Gave it a try on top of
latest mainline, but it didn't help.

Ciao, Thorsten

> [...]
> If it's ok to change the behavior of "ps -u root", then we can do this patch:
> (Nicolas, I confirmed it fixes the 'ps a' case, but couldn't replicate the 'ps'
> case. If you could test the ps only case or give me info on what /usr/bin/example
> was doing I can replicate and test here):
> 
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index ed4e01daccaa..eb9ffc58e211 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2269,8 +2269,14 @@ __latent_entropy struct task_struct *copy_process(
>  	/*
>  	 * Thread groups must share signals as well, and detached threads
>  	 * can only be started up within the thread group.
> +	 *
> +	 * A userworker's parent thread will normally have a signal handler
> +	 * that performs management operations, but the worker will not
> +	 * because the parent will handle the signal then user a worker
> +	 * specific interface to manage the thread and related resources.
>  	 */
> -	if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND))
> +	if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND) &&
> +	    !args->user_worker && !args->ignore_signals)
>  		return ERR_PTR(-EINVAL);
>  
>  	/*
> diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
> index b7cbd66f889e..3700c21ea39d 100644
> --- a/kernel/vhost_task.c
> +++ b/kernel/vhost_task.c
> @@ -75,7 +78,8 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
>  				     const char *name)
>  {
>  	struct kernel_clone_args args = {
> -		.flags		= CLONE_FS | CLONE_UNTRACED | CLONE_VM,
> +		.flags		= CLONE_FS | CLONE_THREAD | CLONE_VM |
> +				  CLONE_UNTRACED,
>  		.exit_signal	= 0,
>  		.fn		= vhost_task_fn,
>  		.name		= name

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-05-13 12:39         ` Thorsten Leemhuis
@ 2023-05-13 15:08           ` Linus Torvalds
  2023-05-15 14:23             ` Christian Brauner
  0 siblings, 1 reply; 98+ messages in thread
From: Linus Torvalds @ 2023-05-13 15:08 UTC (permalink / raw)
  To: Thorsten Leemhuis
  Cc: Mike Christie, nicolas.dichtel, Christian Brauner,
	Linux kernel regressions list, hch, stefanha, jasowang, mst,
	sgarzare, virtualization, ebiederm, konrad.wilk, linux-kernel

On Sat, May 13, 2023 at 7:39 AM Thorsten Leemhuis <linux@leemhuis.info> wrote:
>
> Jumping in here, as I found another problem with that patch: it broke
> s2idle on my laptop when a qemu-kvm VM is running, as freezing user
> space processes now fails for me:

Hmm. kthreads have PF_NOFREEZE by default, which is probably the reason.

Adding

        current->flags |= PF_NOFREEZE;

to the vhost_task setup might just fix it, but it feels a bit off.

The way io_uring does this is to  do

                if (signal_pending(current)) {
                        struct ksignal ksig;

                        if (!get_signal(&ksig))
                                continue;
                        break;
                }

in the main loop, which ends up handling the freezer situation too.
But it should handle things like SIGSTOP etc as well, and also exit on
actual signals.

I get the feeling that the whole "vhost_task_should_stop()" logic
should have the exact logic above, and basically make those threads
killable as well.

Hmm?

                Linus

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-05-13 15:08           ` Linus Torvalds
@ 2023-05-15 14:23             ` Christian Brauner
  2023-05-15 15:44               ` Linus Torvalds
  0 siblings, 1 reply; 98+ messages in thread
From: Christian Brauner @ 2023-05-15 14:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thorsten Leemhuis, Mike Christie, nicolas.dichtel,
	Linux kernel regressions list, hch, stefanha, jasowang, mst,
	sgarzare, virtualization, ebiederm, konrad.wilk, linux-kernel,
	Jens Axboe

On Sat, May 13, 2023 at 10:08:04AM -0500, Linus Torvalds wrote:
> On Sat, May 13, 2023 at 7:39 AM Thorsten Leemhuis <linux@leemhuis.info> wrote:
> >
> > Jumping in here, as I found another problem with that patch: it broke
> > s2idle on my laptop when a qemu-kvm VM is running, as freezing user
> > space processes now fails for me:
> 
> Hmm. kthreads have PF_NOFREEZE by default, which is probably the reason.
> 
> Adding
> 
>         current->flags |= PF_NOFREEZE;
> 
> to the vhost_task setup might just fix it, but it feels a bit off.
> 
> The way io_uring does this is to  do
> 
>                 if (signal_pending(current)) {
>                         struct ksignal ksig;
> 
>                         if (!get_signal(&ksig))
>                                 continue;
>                         break;
>                 }
> 
> in the main loop, which ends up handling the freezer situation too.
> But it should handle things like SIGSTOP etc as well, and also exit on
> actual signals.
> 
> I get the feeling that the whole "vhost_task_should_stop()" logic
> should have the exact logic above, and basically make those threads
> killable as well.
> 
> Hmm?

I'm still trying to catch up after LSFMM with everything that's happened
on the fs side so coming back to this thread with a fresh set of eyes is
difficult. Sorry about the delay here.

So we seem to two immediate issues:
(1) The current logic breaks ps output because vhost creates helper
    processes instead of threads. The suggested patch by Mike was to
    make them proper threads again but somehow special threads in the
    sense that they don't unshare signal handlers. The latter part is
    possibly broken and seems hacky. (That's earlier in the thread.)
(2) Freezing of vhost tasks fails. (This mail.)

So I think we will be able to address (1) and (2) by making vhost tasks
proper threads and blocking every signal except for SIGKILL and SIGSTOP
and then having vhost handle get_signal() - as you mentioned - the same
way io uring already does. We should also remove the ingore_signals
thing completely imho. I don't think we ever want to do this with user
workers.

@Mike, can you get a patch ready ideally this week so we can get this
fixed soon?

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-05-15 14:23             ` Christian Brauner
@ 2023-05-15 15:44               ` Linus Torvalds
  2023-05-15 15:52                 ` Jens Axboe
  2023-05-15 22:23                 ` Mike Christie
  0 siblings, 2 replies; 98+ messages in thread
From: Linus Torvalds @ 2023-05-15 15:44 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Thorsten Leemhuis, Mike Christie, nicolas.dichtel,
	Linux kernel regressions list, hch, stefanha, jasowang, mst,
	sgarzare, virtualization, ebiederm, konrad.wilk, linux-kernel,
	Jens Axboe

On Mon, May 15, 2023 at 7:23 AM Christian Brauner <brauner@kernel.org> wrote:
>
> So I think we will be able to address (1) and (2) by making vhost tasks
> proper threads and blocking every signal except for SIGKILL and SIGSTOP
> and then having vhost handle get_signal() - as you mentioned - the same
> way io uring already does. We should also remove the ingore_signals
> thing completely imho. I don't think we ever want to do this with user
> workers.

Right. That's what IO_URING does:

        if (args->io_thread) {
                /*
                 * Mark us an IO worker, and block any signal that isn't
                 * fatal or STOP
                 */
                p->flags |= PF_IO_WORKER;
                siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
        }

and I really think that vhost should basically do exactly what io_uring does.

Not because io_uring fundamentally got this right - but simply because
io_uring had almost all the same bugs (and then some), and what the
io_uring worker threads ended up doing was to basically zoom in on
"this works".

And it zoomed in on it largely by just going for "make it look as much
as possible as a real user thread", because every time the kernel
thread did something different, it just caused problems.

So I think the patch should just look something like the attached.
Mike, can you test this on whatever vhost test-suite?

I did consider getting rid of ".ignore_signals" entirely, and instead
just keying the "block signals" behavior off the ".user_worker" flag.
But this approach doesn't seem wrong either, and I don't think it's
wrong to make the create_io_thread() function say that
".ignore_signals = 1" thing explicitly, rather than key it off the
".io_thread" flag.

Jens/Christian - comments?

Slightly related to this all: I think vhost should also do
CLONE_FILES, and get rid of the whole ".no_files" thing. Again, if
vhost doesn't use any files, it shouldn't matter, and looking
different just to be different is wrong. But if vhost doesn't use any
files, the current situation shouldn't be a bug either.

                     Linus

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-05-15 15:44               ` Linus Torvalds
@ 2023-05-15 15:52                 ` Jens Axboe
  2023-05-15 15:54                   ` Linus Torvalds
  2023-05-15 15:56                   ` Linus Torvalds
  2023-05-15 22:23                 ` Mike Christie
  1 sibling, 2 replies; 98+ messages in thread
From: Jens Axboe @ 2023-05-15 15:52 UTC (permalink / raw)
  To: Linus Torvalds, Christian Brauner
  Cc: Thorsten Leemhuis, Mike Christie, nicolas.dichtel,
	Linux kernel regressions list, hch, stefanha, jasowang, mst,
	sgarzare, virtualization, ebiederm, konrad.wilk, linux-kernel

On 5/15/23 9:44?AM, Linus Torvalds wrote:
> On Mon, May 15, 2023 at 7:23?AM Christian Brauner <brauner@kernel.org> wrote:
>>
>> So I think we will be able to address (1) and (2) by making vhost tasks
>> proper threads and blocking every signal except for SIGKILL and SIGSTOP
>> and then having vhost handle get_signal() - as you mentioned - the same
>> way io uring already does. We should also remove the ingore_signals
>> thing completely imho. I don't think we ever want to do this with user
>> workers.
> 
> Right. That's what IO_URING does:
> 
>         if (args->io_thread) {
>                 /*
>                  * Mark us an IO worker, and block any signal that isn't
>                  * fatal or STOP
>                  */
>                 p->flags |= PF_IO_WORKER;
>                 siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
>         }
> 
> and I really think that vhost should basically do exactly what io_uring does.
> 
> Not because io_uring fundamentally got this right - but simply because
> io_uring had almost all the same bugs (and then some), and what the
> io_uring worker threads ended up doing was to basically zoom in on
> "this works".
> 
> And it zoomed in on it largely by just going for "make it look as much
> as possible as a real user thread", because every time the kernel
> thread did something different, it just caused problems.

This is exactly what I told Christian in a private chat too - we went
through all of that, and this is what works. KISS.

> So I think the patch should just look something like the attached.
> Mike, can you test this on whatever vhost test-suite?

Seems like that didn't get attached...

> I did consider getting rid of ".ignore_signals" entirely, and instead
> just keying the "block signals" behavior off the ".user_worker" flag.
> But this approach doesn't seem wrong either, and I don't think it's
> wrong to make the create_io_thread() function say that
> ".ignore_signals = 1" thing explicitly, rather than key it off the
> ".io_thread" flag.
> 
> Jens/Christian - comments?
> 
> Slightly related to this all: I think vhost should also do
> CLONE_FILES, and get rid of the whole ".no_files" thing. Again, if
> vhost doesn't use any files, it shouldn't matter, and looking
> different just to be different is wrong. But if vhost doesn't use any
> files, the current situation shouldn't be a bug either.

Only potential downside is that it does make file references more
expensive for other syscalls, since you now have a shared file table.
But probably not something to worry about here?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-05-15 15:52                 ` Jens Axboe
@ 2023-05-15 15:54                   ` Linus Torvalds
  2023-05-15 17:23                     ` Linus Torvalds
  2023-05-15 15:56                   ` Linus Torvalds
  1 sibling, 1 reply; 98+ messages in thread
From: Linus Torvalds @ 2023-05-15 15:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christian Brauner, Thorsten Leemhuis, Mike Christie,
	nicolas.dichtel, Linux kernel regressions list, hch, stefanha,
	jasowang, mst, sgarzare, virtualization, ebiederm, konrad.wilk,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 401 bytes --]

On Mon, May 15, 2023 at 8:52 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 5/15/23 9:44?AM, Linus Torvalds wrote:
> >
> > So I think the patch should just look something like the attached.
> > Mike, can you test this on whatever vhost test-suite?
>
> Seems like that didn't get attached...

Blush. I decided to built-test it, and then forgot to attach it. Here.

                  Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 1712 bytes --]

 kernel/fork.c       | 12 +++---------
 kernel/vhost_task.c |  3 ++-
 2 files changed, 5 insertions(+), 10 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index ed4e01daccaa..cd06b137418f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2338,14 +2338,10 @@ __latent_entropy struct task_struct *copy_process(
 		p->flags |= PF_KTHREAD;
 	if (args->user_worker)
 		p->flags |= PF_USER_WORKER;
-	if (args->io_thread) {
-		/*
-		 * Mark us an IO worker, and block any signal that isn't
-		 * fatal or STOP
-		 */
+	if (args->io_thread)
 		p->flags |= PF_IO_WORKER;
+	if (args->ignore_signals)
 		siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
-	}
 
 	if (args->name)
 		strscpy_pad(p->comm, args->name, sizeof(p->comm));
@@ -2517,9 +2513,6 @@ __latent_entropy struct task_struct *copy_process(
 	if (retval)
 		goto bad_fork_cleanup_io;
 
-	if (args->ignore_signals)
-		ignore_signals(p);
-
 	stackleak_task_init(p);
 
 	if (pid != &init_struct_pid) {
@@ -2861,6 +2854,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
 		.fn_arg		= arg,
 		.io_thread	= 1,
 		.user_worker	= 1,
+		.ignore_signals	= 1,
 	};
 
 	return copy_process(NULL, 0, node, &args);
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index b7cbd66f889e..2e334b2d7cc4 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -75,7 +75,8 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
 				     const char *name)
 {
 	struct kernel_clone_args args = {
-		.flags		= CLONE_FS | CLONE_UNTRACED | CLONE_VM,
+		.flags		= CLONE_FS | CLONE_UNTRACED | CLONE_VM |
+				  CLONE_THREAD | CLONE_SIGHAND,
 		.exit_signal	= 0,
 		.fn		= vhost_task_fn,
 		.name		= name,

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-05-15 15:52                 ` Jens Axboe
  2023-05-15 15:54                   ` Linus Torvalds
@ 2023-05-15 15:56                   ` Linus Torvalds
  1 sibling, 0 replies; 98+ messages in thread
From: Linus Torvalds @ 2023-05-15 15:56 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christian Brauner, Thorsten Leemhuis, Mike Christie,
	nicolas.dichtel, Linux kernel regressions list, hch, stefanha,
	jasowang, mst, sgarzare, virtualization, ebiederm, konrad.wilk,
	linux-kernel

On Mon, May 15, 2023 at 8:52 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> Only potential downside is that it does make file references more
> expensive for other syscalls, since you now have a shared file table.
> But probably not something to worry about here?

Would the vhost user worker user processes ever be otherwise single-threaded?

I'd *assume* that a vhost user is already doing its own threads. But
maybe that's a completely bogus assumption. I don't actually use any
of this, so...

Because you are obviously 100% right that if you're otherwise
single-threaded, then a CLONE_FILES kernel helper thread will cause
the extra cost for file descriptor lookup/free due to all the race
prevention.

                 Linus

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-05-15 15:54                   ` Linus Torvalds
@ 2023-05-15 17:23                     ` Linus Torvalds
  0 siblings, 0 replies; 98+ messages in thread
From: Linus Torvalds @ 2023-05-15 17:23 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christian Brauner, Thorsten Leemhuis, Mike Christie,
	nicolas.dichtel, Linux kernel regressions list, hch, stefanha,
	jasowang, mst, sgarzare, virtualization, ebiederm, konrad.wilk,
	linux-kernel

On Mon, May 15, 2023 at 8:54 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Blush. I decided to build-test it, and then forgot to attach it. Here.

Btw, if this tests out good and we end up doing this, I think we
should also just rename that '.ignore_signals' bitfield to
'.block_signals' to actually match what it does.

But that's an entirely cosmetic thing just to clarify things. The
patch would look almost identical, apart from the new name (and the
small additional parts to rename the two existing users that weren't
touched by that patch - the header file and the vhost use-case).

                   Linus

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-05-15 15:44               ` Linus Torvalds
  2023-05-15 15:52                 ` Jens Axboe
@ 2023-05-15 22:23                 ` Mike Christie
  2023-05-15 22:54                   ` Linus Torvalds
  2023-05-16  8:39                   ` Christian Brauner
  1 sibling, 2 replies; 98+ messages in thread
From: Mike Christie @ 2023-05-15 22:23 UTC (permalink / raw)
  To: Linus Torvalds, Christian Brauner
  Cc: Thorsten Leemhuis, nicolas.dichtel,
	Linux kernel regressions list, hch, stefanha, jasowang, mst,
	sgarzare, virtualization, ebiederm, konrad.wilk, linux-kernel,
	Jens Axboe

On 5/15/23 10:44 AM, Linus Torvalds wrote:
> On Mon, May 15, 2023 at 7:23 AM Christian Brauner <brauner@kernel.org> wrote:
>>
>> So I think we will be able to address (1) and (2) by making vhost tasks
>> proper threads and blocking every signal except for SIGKILL and SIGSTOP
>> and then having vhost handle get_signal() - as you mentioned - the same
>> way io uring already does. We should also remove the ingore_signals
>> thing completely imho. I don't think we ever want to do this with user
>> workers.
> 
> Right. That's what IO_URING does:
> 
>         if (args->io_thread) {
>                 /*
>                  * Mark us an IO worker, and block any signal that isn't
>                  * fatal or STOP
>                  */
>                 p->flags |= PF_IO_WORKER;
>                 siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
>         }
> 
> and I really think that vhost should basically do exactly what io_uring does.
> 
> Not because io_uring fundamentally got this right - but simply because
> io_uring had almost all the same bugs (and then some), and what the
> io_uring worker threads ended up doing was to basically zoom in on
> "this works".
> 
> And it zoomed in on it largely by just going for "make it look as much
> as possible as a real user thread", because every time the kernel
> thread did something different, it just caused problems.
> 
> So I think the patch should just look something like the attached.
> Mike, can you test this on whatever vhost test-suite?

I tried that approach already and it doesn't work because io_uring and vhost
differ in that vhost drivers implement a device where each device has a vhost_task
and the drivers have a file_operations for the device. When the vhost_task's
parent gets signal like SIGKILL, then it will exit and call into the vhost
driver's file_operations->release function. At this time, we need to do cleanup
like flush the device which uses the vhost_task. There is also the case where if
the vhost_task gets a SIGKILL, we can just exit from under the vhost layer.

io_uring has a similar cleanup issue where the core kernel code can't do
exit for the io thread, but it only has that one point that it has to worry
about so when it gets SIGKILL it can clean itself up then exit.

So the patch in the other mail hits an issue where vhost_worker() can get into
a tight loop hammering the CPU due to the pending SIGKILL signal.

The vhost layer really doesn't want any signals and wants to work like kthreads
for that case. To make it really simple can we do something like this where it
separates user and io worker behavior where the major diff is how they handle
signals and exit. I also included a fix for the freeze case:



diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 537cbf9a2ade..e0f5ac90a228 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -29,7 +29,6 @@ struct kernel_clone_args {
 	u32 io_thread:1;
 	u32 user_worker:1;
 	u32 no_files:1;
-	u32 ignore_signals:1;
 	unsigned long stack;
 	unsigned long stack_size;
 	unsigned long tls;
diff --git a/kernel/fork.c b/kernel/fork.c
index ed4e01daccaa..fd2970b598b2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2336,8 +2336,15 @@ __latent_entropy struct task_struct *copy_process(
 	p->flags &= ~PF_KTHREAD;
 	if (args->kthread)
 		p->flags |= PF_KTHREAD;
-	if (args->user_worker)
+	if (args->user_worker) {
+		/*
+		 * User worker are similar to io_threads but they do not
+		 * support signals and cleanup is driven via another kernel
+		 * interface so even SIGKILL is blocked.
+		 */
 		p->flags |= PF_USER_WORKER;
+		siginitsetinv(&p->blocked, 0);
+	}
 	if (args->io_thread) {
 		/*
 		 * Mark us an IO worker, and block any signal that isn't
@@ -2517,8 +2524,8 @@ __latent_entropy struct task_struct *copy_process(
 	if (retval)
 		goto bad_fork_cleanup_io;
 
-	if (args->ignore_signals)
-		ignore_signals(p);
+	if (args->user_worker)
+		p->flags |= PF_NOFREEZE;
 
 	stackleak_task_init(p);
 
@@ -2860,7 +2867,6 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
 		.fn		= fn,
 		.fn_arg		= arg,
 		.io_thread	= 1,
-		.user_worker	= 1,
 	};
 
 	return copy_process(NULL, 0, node, &args);
diff --git a/kernel/signal.c b/kernel/signal.c
index 8f6330f0e9ca..f2f1e5ef44b2 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -995,6 +995,19 @@ static inline bool wants_signal(int sig, struct task_struct *p)
 	return task_curr(p) || !task_sigpending(p);
 }
 
+static void try_set_pending_sigkill(struct task_struct *t)
+{
+	/*
+	 * User workers don't support signals and their exit is driven through
+	 * their kernel layer, so do not send them SIGKILL.
+	 */
+	if (t->flags & PF_USER_WORKER)
+		return;
+
+	sigaddset(&t->pending.signal, SIGKILL);
+	signal_wake_up(t, 1);
+}
+
 static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
 {
 	struct signal_struct *signal = p->signal;
@@ -1055,8 +1068,7 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
 			t = p;
 			do {
 				task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
-				sigaddset(&t->pending.signal, SIGKILL);
-				signal_wake_up(t, 1);
+				try_set_pending_sigkill(t);
 			} while_each_thread(p, t);
 			return;
 		}
@@ -1373,8 +1385,7 @@ int zap_other_threads(struct task_struct *p)
 		/* Don't bother with already dead threads */
 		if (t->exit_state)
 			continue;
-		sigaddset(&t->pending.signal, SIGKILL);
-		signal_wake_up(t, 1);
+		try_set_pending_sigkill(t);
 	}
 
 	return count;
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index b7cbd66f889e..2d8d3ebaec4d 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -75,13 +75,13 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
 				     const char *name)
 {
 	struct kernel_clone_args args = {
-		.flags		= CLONE_FS | CLONE_UNTRACED | CLONE_VM,
+		.flags		= CLONE_FS | CLONE_UNTRACED | CLONE_VM |
+				  CLONE_THREAD | CLONE_SIGHAND,
 		.exit_signal	= 0,
 		.fn		= vhost_task_fn,
 		.name		= name,
 		.user_worker	= 1,
 		.no_files	= 1,
-		.ignore_signals	= 1,
 	};
 	struct vhost_task *vtsk;
 	struct task_struct *tsk;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d257916f39e5..255a2147e5c1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1207,12 +1207,12 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason)
 	DEFINE_WAIT(wait);
 
 	/*
-	 * Do not throttle user workers, kthreads other than kswapd or
+	 * Do not throttle IO/user workers, kthreads other than kswapd or
 	 * workqueues. They may be required for reclaim to make
 	 * forward progress (e.g. journalling workqueues or kthreads).
 	 */
 	if (!current_is_kswapd() &&
-	    current->flags & (PF_USER_WORKER|PF_KTHREAD)) {
+	    current->flags & (PF_USER_WORKER|PF_IO_WORKER|PF_KTHREAD)) {
 		cond_resched();
 		return;
 	}











^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-05-15 22:23                 ` Mike Christie
@ 2023-05-15 22:54                   ` Linus Torvalds
  2023-05-16  3:53                     ` Mike Christie
  2023-05-16 15:56                     ` Eric W. Biederman
  2023-05-16  8:39                   ` Christian Brauner
  1 sibling, 2 replies; 98+ messages in thread
From: Linus Torvalds @ 2023-05-15 22:54 UTC (permalink / raw)
  To: Mike Christie, Oleg Nesterov
  Cc: Christian Brauner, Thorsten Leemhuis, nicolas.dichtel,
	Linux kernel regressions list, hch, stefanha, jasowang, mst,
	sgarzare, virtualization, ebiederm, konrad.wilk, linux-kernel,
	Jens Axboe

On Mon, May 15, 2023 at 3:23 PM Mike Christie
<michael.christie@oracle.com> wrote:
>
> The vhost layer really doesn't want any signals and wants to work like kthreads
> for that case. To make it really simple can we do something like this where it
> separates user and io worker behavior where the major diff is how they handle
> signals and exit. I also included a fix for the freeze case:

I don't love the SIGKILL special case, but I also don't find this
deeply offensive. So if this is what it takes, I'm ok with it.

I wonder if we could make that special case simply check for "is
SIGKILL blocked" instead? No normal case will cause that, and it means
that a PF_USER_WORKER thread could decide per-thread what it wants to
do wrt SIGKILL.

Christian? And I guess we should Cc: Oleg too, since the signal parts
are an area he's familiar with and has worked on.. Eric Biederman has
already been on the list and has also been involved

Oleg: see

  https://lore.kernel.org/lkml/122b597e-a5fa-daf7-27bb-6f04fa98d496@oracle.com/

for the context here.

              Linus

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-05-15 22:54                   ` Linus Torvalds
@ 2023-05-16  3:53                     ` Mike Christie
  2023-05-16 13:18                       ` Oleg Nesterov
  2023-05-16 13:40                       ` Oleg Nesterov
  2023-05-16 15:56                     ` Eric W. Biederman
  1 sibling, 2 replies; 98+ messages in thread
From: Mike Christie @ 2023-05-16  3:53 UTC (permalink / raw)
  To: Linus Torvalds, Oleg Nesterov
  Cc: Christian Brauner, Thorsten Leemhuis, nicolas.dichtel,
	Linux kernel regressions list, hch, stefanha, jasowang, mst,
	sgarzare, virtualization, ebiederm, konrad.wilk, linux-kernel,
	Jens Axboe

On 5/15/23 5:54 PM, Linus Torvalds wrote:
> On Mon, May 15, 2023 at 3:23 PM Mike Christie
> <michael.christie@oracle.com> wrote:
>>
>> The vhost layer really doesn't want any signals and wants to work like kthreads
>> for that case. To make it really simple can we do something like this where it
>> separates user and io worker behavior where the major diff is how they handle
>> signals and exit. I also included a fix for the freeze case:
> 
> I don't love the SIGKILL special case, but I also don't find this
> deeply offensive. So if this is what it takes, I'm ok with it.
> 
> I wonder if we could make that special case simply check for "is
> SIGKILL blocked" instead? No normal case will cause that, and it means

Yeah, it's doable. Updated below.

> that a PF_USER_WORKER thread could decide per-thread what it wants to
> do wrt SIGKILL.
> 
> Christian? And I guess we should Cc: Oleg too, since the signal parts
> are an area he's familiar with and has worked on.. Eric Biederman has
> already been on the list and has also been involved
> 
> Oleg: see
> 
>   https://lore.kernel.org/lkml/122b597e-a5fa-daf7-27bb-6f04fa98d496@oracle.com/
> 
> for the context here.

Oleg and Christian,


Below is an updated patch that doesn't check for PF_USER_WORKER in the
signal.c code and instead will check for if we have blocked the signal.




diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 537cbf9a2ade..e0f5ac90a228 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -29,7 +29,6 @@ struct kernel_clone_args {
 	u32 io_thread:1;
 	u32 user_worker:1;
 	u32 no_files:1;
-	u32 ignore_signals:1;
 	unsigned long stack;
 	unsigned long stack_size;
 	unsigned long tls;
diff --git a/kernel/fork.c b/kernel/fork.c
index ed4e01daccaa..fd2970b598b2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2336,8 +2336,15 @@ __latent_entropy struct task_struct *copy_process(
 	p->flags &= ~PF_KTHREAD;
 	if (args->kthread)
 		p->flags |= PF_KTHREAD;
-	if (args->user_worker)
+	if (args->user_worker) {
+		/*
+		 * User worker are similar to io_threads but they do not
+		 * support signals and cleanup is driven via another kernel
+		 * interface so even SIGKILL is blocked.
+		 */
 		p->flags |= PF_USER_WORKER;
+		siginitsetinv(&p->blocked, 0);
+	}
 	if (args->io_thread) {
 		/*
 		 * Mark us an IO worker, and block any signal that isn't
@@ -2517,8 +2524,8 @@ __latent_entropy struct task_struct *copy_process(
 	if (retval)
 		goto bad_fork_cleanup_io;
 
-	if (args->ignore_signals)
-		ignore_signals(p);
+	if (args->user_worker)
+		p->flags |= PF_NOFREEZE;
 
 	stackleak_task_init(p);
 
@@ -2860,7 +2867,6 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
 		.fn		= fn,
 		.fn_arg		= arg,
 		.io_thread	= 1,
-		.user_worker	= 1,
 	};
 
 	return copy_process(NULL, 0, node, &args);
diff --git a/kernel/signal.c b/kernel/signal.c
index 8f6330f0e9ca..bc7e26072437 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -995,6 +995,19 @@ static inline bool wants_signal(int sig, struct task_struct *p)
 	return task_curr(p) || !task_sigpending(p);
 }
 
+static void try_set_pending_sigkill(struct task_struct *t)
+{
+	/*
+	 * User workers don't support signals and their exit is driven through
+	 * their kernel layer, so by default block even SIGKILL.
+	 */
+	if (sigismember(&t->blocked, SIGKILL))
+		return;
+
+	sigaddset(&t->pending.signal, SIGKILL);
+	signal_wake_up(t, 1);
+}
+
 static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
 {
 	struct signal_struct *signal = p->signal;
@@ -1055,8 +1068,7 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
 			t = p;
 			do {
 				task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
-				sigaddset(&t->pending.signal, SIGKILL);
-				signal_wake_up(t, 1);
+				try_set_pending_sigkill(t);
 			} while_each_thread(p, t);
 			return;
 		}
@@ -1373,8 +1385,7 @@ int zap_other_threads(struct task_struct *p)
 		/* Don't bother with already dead threads */
 		if (t->exit_state)
 			continue;
-		sigaddset(&t->pending.signal, SIGKILL);
-		signal_wake_up(t, 1);
+		try_set_pending_sigkill(t);
 	}
 
 	return count;
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index b7cbd66f889e..2d8d3ebaec4d 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -75,13 +75,13 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
 				     const char *name)
 {
 	struct kernel_clone_args args = {
-		.flags		= CLONE_FS | CLONE_UNTRACED | CLONE_VM,
+		.flags		= CLONE_FS | CLONE_UNTRACED | CLONE_VM |
+				  CLONE_THREAD | CLONE_SIGHAND,
 		.exit_signal	= 0,
 		.fn		= vhost_task_fn,
 		.name		= name,
 		.user_worker	= 1,
 		.no_files	= 1,
-		.ignore_signals	= 1,
 	};
 	struct vhost_task *vtsk;
 	struct task_struct *tsk;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d257916f39e5..255a2147e5c1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1207,12 +1207,12 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason)
 	DEFINE_WAIT(wait);
 
 	/*
-	 * Do not throttle user workers, kthreads other than kswapd or
+	 * Do not throttle IO/user workers, kthreads other than kswapd or
 	 * workqueues. They may be required for reclaim to make
 	 * forward progress (e.g. journalling workqueues or kthreads).
 	 */
 	if (!current_is_kswapd() &&
-	    current->flags & (PF_USER_WORKER|PF_KTHREAD)) {
+	    current->flags & (PF_USER_WORKER|PF_IO_WORKER|PF_KTHREAD)) {
 		cond_resched();
 		return;
 	}



^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-05-15 22:23                 ` Mike Christie
  2023-05-15 22:54                   ` Linus Torvalds
@ 2023-05-16  8:39                   ` Christian Brauner
  2023-05-16 16:24                     ` Mike Christie
  2023-05-19 12:15                     ` [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND Christian Brauner
  1 sibling, 2 replies; 98+ messages in thread
From: Christian Brauner @ 2023-05-16  8:39 UTC (permalink / raw)
  To: Mike Christie, Linus Torvalds
  Cc: Thorsten Leemhuis, nicolas.dichtel,
	Linux kernel regressions list, hch, stefanha, jasowang, mst,
	sgarzare, virtualization, ebiederm, konrad.wilk, linux-kernel,
	Jens Axboe

On Mon, May 15, 2023 at 05:23:12PM -0500, Mike Christie wrote:
> On 5/15/23 10:44 AM, Linus Torvalds wrote:
> > On Mon, May 15, 2023 at 7:23 AM Christian Brauner <brauner@kernel.org> wrote:
> >>
> >> So I think we will be able to address (1) and (2) by making vhost tasks
> >> proper threads and blocking every signal except for SIGKILL and SIGSTOP
> >> and then having vhost handle get_signal() - as you mentioned - the same
> >> way io uring already does. We should also remove the ingore_signals
> >> thing completely imho. I don't think we ever want to do this with user
> >> workers.
> > 
> > Right. That's what IO_URING does:
> > 
> >         if (args->io_thread) {
> >                 /*
> >                  * Mark us an IO worker, and block any signal that isn't
> >                  * fatal or STOP
> >                  */
> >                 p->flags |= PF_IO_WORKER;
> >                 siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
> >         }
> > 
> > and I really think that vhost should basically do exactly what io_uring does.
> > 
> > Not because io_uring fundamentally got this right - but simply because
> > io_uring had almost all the same bugs (and then some), and what the
> > io_uring worker threads ended up doing was to basically zoom in on
> > "this works".
> > 
> > And it zoomed in on it largely by just going for "make it look as much
> > as possible as a real user thread", because every time the kernel
> > thread did something different, it just caused problems.
> > 
> > So I think the patch should just look something like the attached.
> > Mike, can you test this on whatever vhost test-suite?
> 
> I tried that approach already and it doesn't work because io_uring and vhost
> differ in that vhost drivers implement a device where each device has a vhost_task
> and the drivers have a file_operations for the device. When the vhost_task's
> parent gets signal like SIGKILL, then it will exit and call into the vhost
> driver's file_operations->release function. At this time, we need to do cleanup

But that's no reason why the vhost worker couldn't just be allowed to
exit on SIGKILL cleanly similar to io_uring. That's just describing the
current architecture which isn't a necessity afaict. And the helper
thread could e.g., crash.

> like flush the device which uses the vhost_task. There is also the case where if
> the vhost_task gets a SIGKILL, we can just exit from under the vhost layer.

In a way I really don't like the patch below. Because this should be
solvable by adapting vhost workers. Right now, vhost is coming from a
kthread model and we ported it to a user worker model and the whole
point of this excercise has been that the workers behave more like
regular userspace processes. So my tendency is to not massage kernel
signal handling to now also include a special case for user workers in
addition to kthreads. That's just the wrong way around and then vhost
could've just stuck with kthreads in the first place.

So I'm fine with skipping over the freezing case for now but SIGKILL
should be handled imho. Only init and kthreads should get the luxury of
ignoring SIGKILL.

So, I'm afraid I'm asking some work here of you but how feasible would a
model be where vhost_worker() similar to io_wq_worker() gracefully
handles SIGKILL. Yes, I see there's

net.c:   .release = vhost_net_release
scsi.c:  .release = vhost_scsi_release
test.c:  .release = vhost_test_release
vdpa.c:  .release = vhost_vdpa_release
vsock.c: .release = virtio_transport_release
vsock.c: .release = vhost_vsock_dev_release

but that means you have all the basic logic in place and all of those
drivers also support the VHOST_RESET_OWNER ioctl which also stops the
vhost worker. I'm confident that a lof this can be leveraged to just
cleanup on SIGKILL.

So it feels like this should be achievable by adding a callback to
struct vhost_worker that get's called when vhost_worker() gets SIGKILL
and that all the users of vhost workers are forced to implement.

Yes, it is more work but I think that's the right thing to do and not to
complicate our signal handling.

Worst case if this can't be done fast enough we'll have to revert the
vhost parts. I think the user worker parts are mostly sane and are
useful.
Thoughts?

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-05-16  3:53                     ` Mike Christie
@ 2023-05-16 13:18                       ` Oleg Nesterov
  2023-05-16 13:40                       ` Oleg Nesterov
  1 sibling, 0 replies; 98+ messages in thread
From: Oleg Nesterov @ 2023-05-16 13:18 UTC (permalink / raw)
  To: Mike Christie
  Cc: Linus Torvalds, Christian Brauner, Thorsten Leemhuis,
	nicolas.dichtel, Linux kernel regressions list, hch, stefanha,
	jasowang, mst, sgarzare, virtualization, ebiederm, konrad.wilk,
	linux-kernel, Jens Axboe

On 05/15, Mike Christie wrote:
>
> Oleg and Christian,
>
>
> Below is an updated patch that doesn't check for PF_USER_WORKER in the
> signal.c code and instead will check for if we have blocked the signal.

Looks like I need to read the whole series... will try tomorrow.

> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2336,8 +2336,15 @@ __latent_entropy struct task_struct *copy_process(
>  	p->flags &= ~PF_KTHREAD;
>  	if (args->kthread)
>  		p->flags |= PF_KTHREAD;
> -	if (args->user_worker)
> +	if (args->user_worker) {
> +		/*
> +		 * User worker are similar to io_threads but they do not
> +		 * support signals and cleanup is driven via another kernel
> +		 * interface so even SIGKILL is blocked.
> +		 */
>  		p->flags |= PF_USER_WORKER;
> +		siginitsetinv(&p->blocked, 0);

I never liked the fact that io-threads block the signals, this adds
another precedent... OK, this needs another discussion.

> +static void try_set_pending_sigkill(struct task_struct *t)
> +{
> +	/*
> +	 * User workers don't support signals and their exit is driven through
> +	 * their kernel layer, so by default block even SIGKILL.
> +	 */
> +	if (sigismember(&t->blocked, SIGKILL))
> +		return;
> +
> +	sigaddset(&t->pending.signal, SIGKILL);
> +	signal_wake_up(t, 1);
> +}

so why do you need this? to avoid fatal_signal_pending() or signal_pending() ?

In the latter case this change is not enough.

Oleg.


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-05-16  3:53                     ` Mike Christie
  2023-05-16 13:18                       ` Oleg Nesterov
@ 2023-05-16 13:40                       ` Oleg Nesterov
  1 sibling, 0 replies; 98+ messages in thread
From: Oleg Nesterov @ 2023-05-16 13:40 UTC (permalink / raw)
  To: Mike Christie
  Cc: Linus Torvalds, Christian Brauner, Thorsten Leemhuis,
	nicolas.dichtel, Linux kernel regressions list, hch, stefanha,
	jasowang, mst, sgarzare, virtualization, ebiederm, konrad.wilk,
	linux-kernel, Jens Axboe

On 05/15, Mike Christie wrote:
>
> --- a/kernel/vhost_task.c
> +++ b/kernel/vhost_task.c
> @@ -75,13 +75,13 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
>  				     const char *name)
>  {
>  	struct kernel_clone_args args = {
> -		.flags		= CLONE_FS | CLONE_UNTRACED | CLONE_VM,
> +		.flags		= CLONE_FS | CLONE_UNTRACED | CLONE_VM |
> +				  CLONE_THREAD | CLONE_SIGHAND,

I am looking at 6/8 on https://lore.kernel.org/lkml/ ...

with this change kernel_wait4() in vhost_task_stop() won't work?

Oleg.


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-05-05 13:40   ` Nicolas Dichtel
  2023-05-05 18:22     ` Linus Torvalds
@ 2023-05-16 14:06     ` Linux regression tracking #adding (Thorsten Leemhuis)
  2023-05-26  9:03       ` Linux regression tracking #update (Thorsten Leemhuis)
  2023-06-02 11:38       ` Thorsten Leemhuis
  1 sibling, 2 replies; 98+ messages in thread
From: Linux regression tracking #adding (Thorsten Leemhuis) @ 2023-05-16 14:06 UTC (permalink / raw)
  To: nicolas.dichtel, Mike Christie, hch, stefanha, jasowang, mst,
	sgarzare, virtualization, brauner, ebiederm, torvalds,
	konrad.wilk, linux-kernel

[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]

On 05.05.23 15:40, Nicolas Dichtel wrote:
> Le 03/02/2023 à 00:25, Mike Christie a écrit :
>> For vhost workers we use the kthread API which inherit's its values from
>> and checks against the kthreadd thread. This results in the wrong RLIMITs
>> being checked, so while tools like libvirt try to control the number of
>> threads based on the nproc rlimit setting we can end up creating more
>> threads than the user wanted.
> 
> I have a question about (a side effect of?) this patch. The output of the 'ps'
> command has changed. Here is an example:
> [...]

Thanks for the report. This is already dealt with, but to be sure the
issue doesn't fall through the cracks unnoticed, I'm adding it to
regzbot, the Linux kernel regression tracking bot:

#regzbot ^introduced 6e890c5d502
#regzbot title vhost: ps output changed and suspend fails when VMs are
running
#regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-05-15 22:54                   ` Linus Torvalds
  2023-05-16  3:53                     ` Mike Christie
@ 2023-05-16 15:56                     ` Eric W. Biederman
  2023-05-16 18:37                       ` Oleg Nesterov
  1 sibling, 1 reply; 98+ messages in thread
From: Eric W. Biederman @ 2023-05-16 15:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mike Christie, Oleg Nesterov, Christian Brauner,
	Thorsten Leemhuis, nicolas.dichtel,
	Linux kernel regressions list, hch, stefanha, jasowang, mst,
	sgarzare, virtualization, konrad.wilk, linux-kernel, Jens Axboe

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Mon, May 15, 2023 at 3:23 PM Mike Christie
> <michael.christie@oracle.com> wrote:
>>
>> The vhost layer really doesn't want any signals and wants to work like kthreads
>> for that case. To make it really simple can we do something like this where it
>> separates user and io worker behavior where the major diff is how they handle
>> signals and exit. I also included a fix for the freeze case:
>
> I don't love the SIGKILL special case, but I also don't find this
> deeply offensive. So if this is what it takes, I'm ok with it.
>
> I wonder if we could make that special case simply check for "is
> SIGKILL blocked" instead? No normal case will cause that, and it means
> that a PF_USER_WORKER thread could decide per-thread what it wants to
> do wrt SIGKILL.

A kernel thread can block SIGKILL and that is supported.

For a thread that is part of a process you can't block SIGKILL when the
task is part of a user mode process.

There is this bit in complete_signal when SIGKILL is delivered to any
thread in the process.


			/*
			 * Start a group exit and wake everybody up.
			 * This way we don't have other threads
			 * running and doing things after a slower
			 * thread has the fatal signal pending.
			 */
			signal->flags = SIGNAL_GROUP_EXIT;
			signal->group_exit_code = sig;
			signal->group_stop_count = 0;
			t = p;
			do {
				task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
				sigaddset(&t->pending.signal, SIGKILL);
				signal_wake_up(t, 1);
			} while_each_thread(p, t);

For clarity that sigaddset(&t->pending.signal, SIGKILL);  Really isn't
setting SIGKILL pending, it is part of the short circuit delivery logic,
and that sigaddset(SIGKILL) is just setting a flag to tell the process
it needs to die.


The important part of that code is that SIGNAL_GROUP_EXIT gets set.
That indicates the entire process is being torn down.

Where this becomes important is exit_notify and release_task work
together to ensure that the first thread in the process (a user space
thread that can not block SIGKILL) will not send SIGCHLD to it's parent
process until every thread in the process has exited.

The delay_group_leader logic in wait_consider_task part of wait(2) has
the same logic.

Having been through this with io_uring the threads really need to call
get_signal to handle that case.


This is pretty much why I said at the outset you they needed to decided
if they were going to implement a thread or if they were going to be a
process.  Changing the decision to be a thread from a process is fine
but in that case the vhost logic needs to act like a process, just
like io_uring does.


> Christian? And I guess we should Cc: Oleg too, since the signal parts
> are an area he's familiar with and has worked on.. Eric Biederman has
> already been on the list and has also been involved

>
> Oleg: see
>
>   https://lore.kernel.org/lkml/122b597e-a5fa-daf7-27bb-6f04fa98d496@oracle.com/
>
> for the context here.

Eric


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-05-16  8:39                   ` Christian Brauner
@ 2023-05-16 16:24                     ` Mike Christie
  2023-05-16 16:44                       ` Christian Brauner
  2023-05-19 12:15                     ` [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND Christian Brauner
  1 sibling, 1 reply; 98+ messages in thread
From: Mike Christie @ 2023-05-16 16:24 UTC (permalink / raw)
  To: Christian Brauner, Linus Torvalds
  Cc: Thorsten Leemhuis, nicolas.dichtel,
	Linux kernel regressions list, hch, stefanha, jasowang, mst,
	sgarzare, virtualization, ebiederm, konrad.wilk, linux-kernel,
	Jens Axboe

On 5/16/23 3:39 AM, Christian Brauner wrote:
> On Mon, May 15, 2023 at 05:23:12PM -0500, Mike Christie wrote:
>> On 5/15/23 10:44 AM, Linus Torvalds wrote:
>>> On Mon, May 15, 2023 at 7:23 AM Christian Brauner <brauner@kernel.org> wrote:
>>>>
>>>> So I think we will be able to address (1) and (2) by making vhost tasks
>>>> proper threads and blocking every signal except for SIGKILL and SIGSTOP
>>>> and then having vhost handle get_signal() - as you mentioned - the same
>>>> way io uring already does. We should also remove the ingore_signals
>>>> thing completely imho. I don't think we ever want to do this with user
>>>> workers.
>>>
>>> Right. That's what IO_URING does:
>>>
>>>         if (args->io_thread) {
>>>                 /*
>>>                  * Mark us an IO worker, and block any signal that isn't
>>>                  * fatal or STOP
>>>                  */
>>>                 p->flags |= PF_IO_WORKER;
>>>                 siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
>>>         }
>>>
>>> and I really think that vhost should basically do exactly what io_uring does.
>>>
>>> Not because io_uring fundamentally got this right - but simply because
>>> io_uring had almost all the same bugs (and then some), and what the
>>> io_uring worker threads ended up doing was to basically zoom in on
>>> "this works".
>>>
>>> And it zoomed in on it largely by just going for "make it look as much
>>> as possible as a real user thread", because every time the kernel
>>> thread did something different, it just caused problems.
>>>
>>> So I think the patch should just look something like the attached.
>>> Mike, can you test this on whatever vhost test-suite?
>>
>> I tried that approach already and it doesn't work because io_uring and vhost
>> differ in that vhost drivers implement a device where each device has a vhost_task
>> and the drivers have a file_operations for the device. When the vhost_task's
>> parent gets signal like SIGKILL, then it will exit and call into the vhost
>> driver's file_operations->release function. At this time, we need to do cleanup
> 
> But that's no reason why the vhost worker couldn't just be allowed to
> exit on SIGKILL cleanly similar to io_uring. That's just describing the
> current architecture which isn't a necessity afaict. And the helper
> thread could e.g., crash.
> 
>> like flush the device which uses the vhost_task. There is also the case where if
>> the vhost_task gets a SIGKILL, we can just exit from under the vhost layer.
> 
> In a way I really don't like the patch below. Because this should be
> solvable by adapting vhost workers. Right now, vhost is coming from a
> kthread model and we ported it to a user worker model and the whole
> point of this excercise has been that the workers behave more like
> regular userspace processes. So my tendency is to not massage kernel
> signal handling to now also include a special case for user workers in
> addition to kthreads. That's just the wrong way around and then vhost
> could've just stuck with kthreads in the first place.

I would have preferred that :) Maybe let's take a step back and revisit
that decision to make sure it was right. The vhost layer wants:

1. inherit cgroups.
2. share mm.
3. no signals
4. to not show up was an extra process like in Nicolas's bug.
5. have it's worker threads counted under its parent nproc limit.

We can do 1 - 4 today with kthreads. Can we do #5 with kthreads? My first
attempt which passed around the creds to use for kthreads or exported a
helper for the nproc accounting was not liked and we eventually ended up
here.

Is this hybird user/kernel thread/task still the right way to go or is
better to use kthreads and add some way to handle #5?


> 
> So I'm fine with skipping over the freezing case for now but SIGKILL
> should be handled imho. Only init and kthreads should get the luxury of
> ignoring SIGKILL.
> 
> So, I'm afraid I'm asking some work here of you but how feasible would a
> model be where vhost_worker() similar to io_wq_worker() gracefully
> handles SIGKILL. Yes, I see there's
> 
> net.c:   .release = vhost_net_release
> scsi.c:  .release = vhost_scsi_release
> test.c:  .release = vhost_test_release
> vdpa.c:  .release = vhost_vdpa_release
> vsock.c: .release = virtio_transport_release
> vsock.c: .release = vhost_vsock_dev_release
> 
> but that means you have all the basic logic in place and all of those
> drivers also support the VHOST_RESET_OWNER ioctl which also stops the
> vhost worker. I'm confident that a lof this can be leveraged to just
> cleanup on SIGKILL.

We can do this, but the issue I'm worried about is that right now if there
is queued/running IO and userspace escalates to SIGKILL, then the vhost layer
will still complete those IOs. If we now allow SIGKILL on the vhost thread,
then those IOs might fail.

If we get a SIGKILL, I can modify vhost_worker() so that it temporarily
ignores the signal and allows IO/flushes/whatever-operations to complete
at that level. However, we could hit issues where when vhost_worker()
calls into the drivers listed above, and those drivers call into whatever
kernel layer they use, that might do

if (signal_pending(current))
	return failure;

and we now fail.

If we say that since we got a SIGKILL, then failing is acceptable behavior
now, I can code what you are requesting.




^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-05-16 16:24                     ` Mike Christie
@ 2023-05-16 16:44                       ` Christian Brauner
  0 siblings, 0 replies; 98+ messages in thread
From: Christian Brauner @ 2023-05-16 16:44 UTC (permalink / raw)
  To: Mike Christie
  Cc: Linus Torvalds, Thorsten Leemhuis, nicolas.dichtel,
	Linux kernel regressions list, hch, stefanha, jasowang, mst,
	sgarzare, virtualization, ebiederm, konrad.wilk, linux-kernel,
	Jens Axboe

On Tue, May 16, 2023 at 11:24:48AM -0500, Mike Christie wrote:
> On 5/16/23 3:39 AM, Christian Brauner wrote:
> > On Mon, May 15, 2023 at 05:23:12PM -0500, Mike Christie wrote:
> >> On 5/15/23 10:44 AM, Linus Torvalds wrote:
> >>> On Mon, May 15, 2023 at 7:23 AM Christian Brauner <brauner@kernel.org> wrote:
> >>>>
> >>>> So I think we will be able to address (1) and (2) by making vhost tasks
> >>>> proper threads and blocking every signal except for SIGKILL and SIGSTOP
> >>>> and then having vhost handle get_signal() - as you mentioned - the same
> >>>> way io uring already does. We should also remove the ingore_signals
> >>>> thing completely imho. I don't think we ever want to do this with user
> >>>> workers.
> >>>
> >>> Right. That's what IO_URING does:
> >>>
> >>>         if (args->io_thread) {
> >>>                 /*
> >>>                  * Mark us an IO worker, and block any signal that isn't
> >>>                  * fatal or STOP
> >>>                  */
> >>>                 p->flags |= PF_IO_WORKER;
> >>>                 siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
> >>>         }
> >>>
> >>> and I really think that vhost should basically do exactly what io_uring does.
> >>>
> >>> Not because io_uring fundamentally got this right - but simply because
> >>> io_uring had almost all the same bugs (and then some), and what the
> >>> io_uring worker threads ended up doing was to basically zoom in on
> >>> "this works".
> >>>
> >>> And it zoomed in on it largely by just going for "make it look as much
> >>> as possible as a real user thread", because every time the kernel
> >>> thread did something different, it just caused problems.
> >>>
> >>> So I think the patch should just look something like the attached.
> >>> Mike, can you test this on whatever vhost test-suite?
> >>
> >> I tried that approach already and it doesn't work because io_uring and vhost
> >> differ in that vhost drivers implement a device where each device has a vhost_task
> >> and the drivers have a file_operations for the device. When the vhost_task's
> >> parent gets signal like SIGKILL, then it will exit and call into the vhost
> >> driver's file_operations->release function. At this time, we need to do cleanup
> > 
> > But that's no reason why the vhost worker couldn't just be allowed to
> > exit on SIGKILL cleanly similar to io_uring. That's just describing the
> > current architecture which isn't a necessity afaict. And the helper
> > thread could e.g., crash.
> > 
> >> like flush the device which uses the vhost_task. There is also the case where if
> >> the vhost_task gets a SIGKILL, we can just exit from under the vhost layer.
> > 
> > In a way I really don't like the patch below. Because this should be
> > solvable by adapting vhost workers. Right now, vhost is coming from a
> > kthread model and we ported it to a user worker model and the whole
> > point of this excercise has been that the workers behave more like
> > regular userspace processes. So my tendency is to not massage kernel
> > signal handling to now also include a special case for user workers in
> > addition to kthreads. That's just the wrong way around and then vhost
> > could've just stuck with kthreads in the first place.
> 
> I would have preferred that :) Maybe let's take a step back and revisit
> that decision to make sure it was right. The vhost layer wants:
> 
> 1. inherit cgroups.
> 2. share mm.
> 3. no signals
> 4. to not show up was an extra process like in Nicolas's bug.
> 5. have it's worker threads counted under its parent nproc limit.
> 
> We can do 1 - 4 today with kthreads. Can we do #5 with kthreads? My first
> attempt which passed around the creds to use for kthreads or exported a
> helper for the nproc accounting was not liked and we eventually ended up
> here.
> 
> Is this hybird user/kernel thread/task still the right way to go or is
> better to use kthreads and add some way to handle #5?

I think the io_uring model makes a lot more sense for vhost than the
current approach.

> 
> 
> > 
> > So I'm fine with skipping over the freezing case for now but SIGKILL
> > should be handled imho. Only init and kthreads should get the luxury of
> > ignoring SIGKILL.
> > 
> > So, I'm afraid I'm asking some work here of you but how feasible would a
> > model be where vhost_worker() similar to io_wq_worker() gracefully
> > handles SIGKILL. Yes, I see there's
> > 
> > net.c:   .release = vhost_net_release
> > scsi.c:  .release = vhost_scsi_release
> > test.c:  .release = vhost_test_release
> > vdpa.c:  .release = vhost_vdpa_release
> > vsock.c: .release = virtio_transport_release
> > vsock.c: .release = vhost_vsock_dev_release
> > 
> > but that means you have all the basic logic in place and all of those
> > drivers also support the VHOST_RESET_OWNER ioctl which also stops the
> > vhost worker. I'm confident that a lof this can be leveraged to just
> > cleanup on SIGKILL.
> 
> We can do this, but the issue I'm worried about is that right now if there
> is queued/running IO and userspace escalates to SIGKILL, then the vhost layer
> will still complete those IOs. If we now allow SIGKILL on the vhost thread,
> then those IOs might fail.
> 
> If we get a SIGKILL, I can modify vhost_worker() so that it temporarily
> ignores the signal and allows IO/flushes/whatever-operations to complete
> at that level. However, we could hit issues where when vhost_worker()

It's really not that different from io_uring though which also flushes
out remaining io, no? This seems to basically line up with what
io_wq_worker() does.

> calls into the drivers listed above, and those drivers call into whatever
> kernel layer they use, that might do
> 
> if (signal_pending(current))
> 	return failure;
> 
> and we now fail.
> 
> If we say that since we got a SIGKILL, then failing is acceptable behavior
> now, I can code what you are requesting.

I think this is fine but I don't maintain vhost and we'd need their
opinion.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-05-16 15:56                     ` Eric W. Biederman
@ 2023-05-16 18:37                       ` Oleg Nesterov
  2023-05-16 20:12                         ` Eric W. Biederman
  0 siblings, 1 reply; 98+ messages in thread
From: Oleg Nesterov @ 2023-05-16 18:37 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Mike Christie, Christian Brauner,
	Thorsten Leemhuis, nicolas.dichtel,
	Linux kernel regressions list, hch, stefanha, jasowang, mst,
	sgarzare, virtualization, konrad.wilk, linux-kernel, Jens Axboe

On 05/16, Eric W. Biederman wrote:
>
> A kernel thread can block SIGKILL and that is supported.
>
> For a thread that is part of a process you can't block SIGKILL when the
> task is part of a user mode process.

Or SIGSTOP. Another thread can call do_signal_stop()->signal_wake_up/etc.

> There is this bit in complete_signal when SIGKILL is delivered to any
> thread in the process.
>
> 			t = p;
> 			do {
> 				task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
> 				sigaddset(&t->pending.signal, SIGKILL);
> 				signal_wake_up(t, 1);
> 			} while_each_thread(p, t);

That is why the latest version adds try_set_pending_sigkill(). No, no,
it is not that I think this is a good idea.

> For clarity that sigaddset(&t->pending.signal, SIGKILL);  Really isn't
> setting SIGKILL pending,

Hmm. it does? Nevermind.

> The important part of that code is that SIGNAL_GROUP_EXIT gets set.
> That indicates the entire process is being torn down.

Yes. and the same is true for io-thread even if it calls get_signal()
and dequeues SIGKILL and clears TIF_SIGPENDING.

> but in that case the vhost logic needs to act like a process, just
> like io_uring does.

confused... create_io_thread() creates a sub-thread too?

Although I never understood this logic. I can't even understand the usage
of lower_32_bits() in create_io_thread().

Oleg.


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-05-16 18:37                       ` Oleg Nesterov
@ 2023-05-16 20:12                         ` Eric W. Biederman
  2023-05-17 17:09                           ` Oleg Nesterov
  0 siblings, 1 reply; 98+ messages in thread
From: Eric W. Biederman @ 2023-05-16 20:12 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Mike Christie, Christian Brauner,
	Thorsten Leemhuis, nicolas.dichtel,
	Linux kernel regressions list, hch, stefanha, jasowang, mst,
	sgarzare, virtualization, konrad.wilk, linux-kernel, Jens Axboe

Oleg Nesterov <oleg@redhat.com> writes:

> On 05/16, Eric W. Biederman wrote:
>>
>> A kernel thread can block SIGKILL and that is supported.
>>
>> For a thread that is part of a process you can't block SIGKILL when the
>> task is part of a user mode process.
>
> Or SIGSTOP. Another thread can call do_signal_stop()->signal_wake_up/etc.

Yes, ignoring SIGSTOP leads to the same kind of rendezvous issues as
SIGKILL.

>> There is this bit in complete_signal when SIGKILL is delivered to any
>> thread in the process.
>>
>> 			t = p;
>> 			do {
>> 				task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
>> 				sigaddset(&t->pending.signal, SIGKILL);
>> 				signal_wake_up(t, 1);
>> 			} while_each_thread(p, t);
>
> That is why the latest version adds try_set_pending_sigkill(). No, no,
> it is not that I think this is a good idea.

I see that try_set_pending_sigkill in the patch now.

That try_set_pending_sigkill just keeps the process from reporting
that it has exited, and extend the process exit indefinitely.

SIGNAL_GROUP_EXIT has already been set, so the KILL signal was
already delivered and the process is exiting.

>> For clarity that sigaddset(&t->pending.signal, SIGKILL);  Really isn't
>> setting SIGKILL pending,
>
> Hmm. it does? Nevermind.

The point is that what try_set_pending_sigkill in the patch is doing is
keeping the "you are dead exit now" flag, from being set.

That flag is what fatal_signal_pending always tests, because we can only
know if a fatal signal is pending if we have performed short circuit
delivery on the signal.

The result is the effects of the change are mostly what people expect.
The difference the semantics being changed aren't what people think they
are.

AKA process exit is being ignored for the thread, not that SIGKILL is
being blocked.

>> The important part of that code is that SIGNAL_GROUP_EXIT gets set.
>> That indicates the entire process is being torn down.
>
> Yes. and the same is true for io-thread even if it calls get_signal()
> and dequeues SIGKILL and clears TIF_SIGPENDING.
>
>> but in that case the vhost logic needs to act like a process, just
>> like io_uring does.
>
> confused... create_io_thread() creates a sub-thread too?

Yes, create_io_uring creates an ordinary user space thread that never
runs any code in user space.

> Although I never understood this logic. I can't even understand the usage
> of lower_32_bits() in create_io_thread().

As far as I can tell lower_32_bits(flags) is just defensive programming
that just copies the code in clone.  The code just as easily have said
u32 flags, or have just populated .flags directly.  Then .exit_signal
could have been set to 0.  Later copy_process will set .exit_signal = -1
because CLONE_THREAD is set.

The reason for adding create_io_thread calling copy_process as I recall
so that the new task does not start automatically.  This allows
functions like io_init_new_worker to initialize the new task without
races and then call wake_up_new_task.

Eric


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-05-16 20:12                         ` Eric W. Biederman
@ 2023-05-17 17:09                           ` Oleg Nesterov
  2023-05-17 18:22                             ` Mike Christie
  0 siblings, 1 reply; 98+ messages in thread
From: Oleg Nesterov @ 2023-05-17 17:09 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Mike Christie, Christian Brauner,
	Thorsten Leemhuis, nicolas.dichtel,
	Linux kernel regressions list, hch, stefanha, jasowang, mst,
	sgarzare, virtualization, konrad.wilk, linux-kernel, Jens Axboe

On 05/16, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> >> There is this bit in complete_signal when SIGKILL is delivered to any
> >> thread in the process.
> >>
> >> 			t = p;
> >> 			do {
> >> 				task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
> >> 				sigaddset(&t->pending.signal, SIGKILL);
> >> 				signal_wake_up(t, 1);
> >> 			} while_each_thread(p, t);
> >
> > That is why the latest version adds try_set_pending_sigkill(). No, no,
> > it is not that I think this is a good idea.
>
> I see that try_set_pending_sigkill in the patch now.
>
> That try_set_pending_sigkill just keeps the process from reporting
> that it has exited, and extend the process exit indefinitely.
>
> SIGNAL_GROUP_EXIT has already been set, so the KILL signal was
> already delivered and the process is exiting.

Agreed, that is why I said I don't think try_set_pending_sigkill() is
a good idea.

And again, the same is true for the threads created by
create_io_thread(). get_signal() from io_uring/ can dequeue a pending
SIGKILL and return, but that is all.

> >> For clarity that sigaddset(&t->pending.signal, SIGKILL);  Really isn't
> >> setting SIGKILL pending,
> >
> > Hmm. it does? Nevermind.
>
> The point is that what try_set_pending_sigkill in the patch is doing is
> keeping the "you are dead exit now" flag, from being set.
>
> That flag is what fatal_signal_pending always tests, because we can only
> know if a fatal signal is pending if we have performed short circuit
> delivery on the signal.
>
> The result is the effects of the change are mostly what people expect.
> The difference the semantics being changed aren't what people think they
> are.
>
> AKA process exit is being ignored for the thread, not that SIGKILL is
> being blocked.

Sorry, I don't understand. I just tried to say that
sigaddset(&t->pending.signal, SIGKILL) really sets SIGKILL pending.
Nevermind.

> > Although I never understood this logic.

I meant I never really liked how io-threads play with signals,

> I can't even understand the usage
> > of lower_32_bits() in create_io_thread().
>
> As far as I can tell lower_32_bits(flags) is just defensive programming

Cough. but this is ugly. Or I missed something.

> or have just populated .flags directly.

Exactly,

> Then .exit_signal
> could have been set to 0.

Exactly.

-------------------------------------------------------------------------------
OK. It doesn't matter. I tried to read the whole thread and got lost.

IIUC, Mike is going to send the next version? So I think we can delay
the further discussions until then.

Oleg.


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-05-17 17:09                           ` Oleg Nesterov
@ 2023-05-17 18:22                             ` Mike Christie
  0 siblings, 0 replies; 98+ messages in thread
From: Mike Christie @ 2023-05-17 18:22 UTC (permalink / raw)
  To: Oleg Nesterov, Eric W. Biederman
  Cc: Linus Torvalds, Christian Brauner, Thorsten Leemhuis,
	nicolas.dichtel, Linux kernel regressions list, hch, stefanha,
	jasowang, mst, sgarzare, virtualization, konrad.wilk,
	linux-kernel, Jens Axboe

On 5/17/23 12:09 PM, Oleg Nesterov wrote:
> IIUC, Mike is going to send the next version? So I think we can delay
> the further discussions until then.

Yeah, I'm working on a version that supports signals so it will be easier
to discuss with the vhost devs and you, Christian and Eric.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND
@ 2023-05-18  0:09 Mike Christie
  2023-05-18  0:09 ` [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set Mike Christie
                   ` (8 more replies)
  0 siblings, 9 replies; 98+ messages in thread
From: Mike Christie @ 2023-05-18  0:09 UTC (permalink / raw)
  To: oleg, linux, nicolas.dichtel, axboe, ebiederm, torvalds,
	linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha,
	brauner

This patch allows the vhost and vhost_task code to use CLONE_THREAD,
CLONE_SIGHAND and CLONE_FILES. It's a RFC because I didn't do all the
normal testing, haven't coverted vsock and vdpa, and I know you guys
will not like the first patch. However, I think it better shows what
we need from the signal code and how we can support signals in the
vhost_task layer.

Note that I took the super simple route and kicked off some work to
the system workqueue. We can do more invassive approaches:
1. Modify the vhost drivers so they can check for IO completions using
a non-blocking interface. We then don't need to run from the system
workqueue and can run from the vhost_task.

2. We could drop patch 1 and just say we are doing a polling type
of approach. We then modify the vhost layer similar to #1 where we
can check for completions using a non-blocking interface and use
the vhost_task task.




^ permalink raw reply	[flat|nested] 98+ messages in thread

* [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
  2023-05-18  0:09 [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND Mike Christie
@ 2023-05-18  0:09 ` Mike Christie
  2023-05-18  2:34   ` Eric W. Biederman
                     ` (2 more replies)
  2023-05-18  0:09 ` [RFC PATCH 2/8] vhost/vhost_task: Hook vhost layer into signal handler Mike Christie
                   ` (7 subsequent siblings)
  8 siblings, 3 replies; 98+ messages in thread
From: Mike Christie @ 2023-05-18  0:09 UTC (permalink / raw)
  To: oleg, linux, nicolas.dichtel, axboe, ebiederm, torvalds,
	linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha,
	brauner
  Cc: Mike Christie

This has us deqeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is
set when we are dealing with PF_USER_WORKER tasks.

When a vhost_task gets a SIGKILL, we could have outstanding IO in flight.
We can easily stop new work/IO from being queued to the vhost_task, but
for IO that's already been sent to something like the block layer we
need to wait for the response then process it. These type of IO
completions use the vhost_task to process the completion so we can't
exit immediately.

We need to handle wait for then handle those completions from the
vhost_task, but when we have a SIGKLL pending, functions like
schedule() return immediately so we can't wait like normal. Functions
like vhost_worker() degrade to just a while(1); loop.

This patch has get_signal drop down to the normal code path when
SIGNAL_GROUP_EXIT/group_exec_task is set so the caller can still detect
there is a SIGKILL but still perform some blocking cleanup.

Note that in that chunk I'm now bypassing that does:

sigdelset(&current->pending.signal, SIGKILL);

we look to be ok, because in the places we set SIGNAL_GROUP_EXIT/
group_exec_task we are already doing that on the threads in the
group.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 kernel/signal.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 8f6330f0e9ca..ae4972eea5db 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2705,9 +2705,18 @@ bool get_signal(struct ksignal *ksig)
 		struct k_sigaction *ka;
 		enum pid_type type;
 
-		/* Has this task already been marked for death? */
-		if ((signal->flags & SIGNAL_GROUP_EXIT) ||
-		     signal->group_exec_task) {
+		/*
+		 * Has this task already been marked for death?
+		 *
+		 * If this is a PF_USER_WORKER then the task may need to do
+		 * extra work that requires waiting on running work, so we want
+		 * to dequeue the signal below and tell the caller its time to
+		 * start its exit procedure. When the work has completed then
+		 * the task will exit.
+		 */
+		if (!(current->flags & PF_USER_WORKER) &&
+		    ((signal->flags & SIGNAL_GROUP_EXIT) ||
+		     signal->group_exec_task)) {
 			clear_siginfo(&ksig->info);
 			ksig->info.si_signo = signr = SIGKILL;
 			sigdelset(&current->pending.signal, SIGKILL);
@@ -2861,11 +2870,11 @@ bool get_signal(struct ksignal *ksig)
 		}
 
 		/*
-		 * PF_IO_WORKER threads will catch and exit on fatal signals
+		 * PF_USER_WORKER threads will catch and exit on fatal signals
 		 * themselves. They have cleanup that must be performed, so
 		 * we cannot call do_exit() on their behalf.
 		 */
-		if (current->flags & PF_IO_WORKER)
+		if (current->flags & PF_USER_WORKER)
 			goto out;
 
 		/*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [RFC PATCH 2/8] vhost/vhost_task: Hook vhost layer into signal handler
  2023-05-18  0:09 [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND Mike Christie
  2023-05-18  0:09 ` [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set Mike Christie
@ 2023-05-18  0:09 ` Mike Christie
  2023-05-18  0:16   ` Linus Torvalds
  2023-05-18  0:09 ` [RFC PATCH 3/8] fork/vhost_task: Switch to CLONE_THREAD and CLONE_SIGHAND Mike Christie
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 98+ messages in thread
From: Mike Christie @ 2023-05-18  0:09 UTC (permalink / raw)
  To: oleg, linux, nicolas.dichtel, axboe, ebiederm, torvalds,
	linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha,
	brauner
  Cc: Mike Christie

This patch has vhost use get_signal to handle freezing and sort of
handle signals. By the latter I mean that when we get SIGKILL, our
parent will exit and call our file_operatons release function. That will
then stop new work from breing queued and wait for the vhost_task to
handle completions for running IO. We then exit when those are done.

The next patches will then have us work more like io_uring where
we handle the get_signal return value and key off that to cleanup.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/vhost.c            | 10 +++++++++-
 include/linux/sched/vhost_task.h |  1 +
 kernel/vhost_task.c              | 20 ++++++++++++++++++++
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a92af08e7864..1ba9e068b2ab 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -349,8 +349,16 @@ static int vhost_worker(void *data)
 		}
 
 		node = llist_del_all(&worker->work_list);
-		if (!node)
+		if (!node) {
 			schedule();
+			/*
+			 * When we get a SIGKILL our release function will
+			 * be called. That will stop new IOs from being queued
+			 * and check for outstanding cmd responses. It will then
+			 * call vhost_task_stop to exit us.
+			 */
+			vhost_task_get_signal();
+		}
 
 		node = llist_reverse_order(node);
 		/* make sure flag is seen after deletion */
diff --git a/include/linux/sched/vhost_task.h b/include/linux/sched/vhost_task.h
index 6123c10b99cf..54b68115eb3b 100644
--- a/include/linux/sched/vhost_task.h
+++ b/include/linux/sched/vhost_task.h
@@ -19,5 +19,6 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
 void vhost_task_start(struct vhost_task *vtsk);
 void vhost_task_stop(struct vhost_task *vtsk);
 bool vhost_task_should_stop(struct vhost_task *vtsk);
+bool vhost_task_get_signal(void);
 
 #endif
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index b7cbd66f889e..a661cfa32ba3 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -61,6 +61,26 @@ bool vhost_task_should_stop(struct vhost_task *vtsk)
 }
 EXPORT_SYMBOL_GPL(vhost_task_should_stop);
 
+/**
+ * vhost_task_get_signal - Check if there are pending signals
+ *
+ * Return true if we got SIGKILL.
+ */
+bool vhost_task_get_signal(void)
+{
+	struct ksignal ksig;
+	bool rc;
+
+	if (!signal_pending(current))
+		return false;
+
+	__set_current_state(TASK_RUNNING);
+	rc = get_signal(&ksig);
+	set_current_state(TASK_INTERRUPTIBLE);
+	return rc;
+}
+EXPORT_SYMBOL_GPL(vhost_task_get_signal);
+
 /**
  * vhost_task_create - create a copy of a process to be used by the kernel
  * @fn: thread stack
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [RFC PATCH 3/8] fork/vhost_task: Switch to CLONE_THREAD and CLONE_SIGHAND
  2023-05-18  0:09 [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND Mike Christie
  2023-05-18  0:09 ` [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set Mike Christie
  2023-05-18  0:09 ` [RFC PATCH 2/8] vhost/vhost_task: Hook vhost layer into signal handler Mike Christie
@ 2023-05-18  0:09 ` Mike Christie
  2023-05-18  8:18   ` Christian Brauner
  2023-05-18  0:09 ` [RFC PATCH 4/8] vhost-net: Move vhost_net_open Mike Christie
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 98+ messages in thread
From: Mike Christie @ 2023-05-18  0:09 UTC (permalink / raw)
  To: oleg, linux, nicolas.dichtel, axboe, ebiederm, torvalds,
	linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha,
	brauner
  Cc: Mike Christie

This is a modified version of Linus's patch which has vhost_task
use CLONE_THREAD and CLONE_SIGHAND and allow SIGKILL and SIGSTOP.

I renamed the ignore_signals to block_signals based on Linus's comment
where it aligns with what we are doing with the siginitsetinv
p->blocked use and no longer calling ignore_signals.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 include/linux/sched/task.h |  2 +-
 kernel/fork.c              | 12 +++---------
 kernel/vhost_task.c        |  5 +++--
 3 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 537cbf9a2ade..249a5ece9def 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -29,7 +29,7 @@ struct kernel_clone_args {
 	u32 io_thread:1;
 	u32 user_worker:1;
 	u32 no_files:1;
-	u32 ignore_signals:1;
+	u32 block_signals:1;
 	unsigned long stack;
 	unsigned long stack_size;
 	unsigned long tls;
diff --git a/kernel/fork.c b/kernel/fork.c
index ed4e01daccaa..9e04ab5c3946 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2338,14 +2338,10 @@ __latent_entropy struct task_struct *copy_process(
 		p->flags |= PF_KTHREAD;
 	if (args->user_worker)
 		p->flags |= PF_USER_WORKER;
-	if (args->io_thread) {
-		/*
-		 * Mark us an IO worker, and block any signal that isn't
-		 * fatal or STOP
-		 */
+	if (args->io_thread)
 		p->flags |= PF_IO_WORKER;
+	if (args->block_signals)
 		siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
-	}
 
 	if (args->name)
 		strscpy_pad(p->comm, args->name, sizeof(p->comm));
@@ -2517,9 +2513,6 @@ __latent_entropy struct task_struct *copy_process(
 	if (retval)
 		goto bad_fork_cleanup_io;
 
-	if (args->ignore_signals)
-		ignore_signals(p);
-
 	stackleak_task_init(p);
 
 	if (pid != &init_struct_pid) {
@@ -2861,6 +2854,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
 		.fn_arg		= arg,
 		.io_thread	= 1,
 		.user_worker	= 1,
+		.block_signals	= 1,
 	};
 
 	return copy_process(NULL, 0, node, &args);
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index a661cfa32ba3..a11f036290cc 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -95,13 +95,14 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
 				     const char *name)
 {
 	struct kernel_clone_args args = {
-		.flags		= CLONE_FS | CLONE_UNTRACED | CLONE_VM,
+		.flags		= CLONE_FS | CLONE_UNTRACED | CLONE_VM |
+				  CLONE_THREAD | CLONE_SIGHAND,
 		.exit_signal	= 0,
 		.fn		= vhost_task_fn,
 		.name		= name,
 		.user_worker	= 1,
 		.no_files	= 1,
-		.ignore_signals	= 1,
+		.block_signals	= 1,
 	};
 	struct vhost_task *vtsk;
 	struct task_struct *tsk;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [RFC PATCH 4/8] vhost-net: Move vhost_net_open
  2023-05-18  0:09 [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND Mike Christie
                   ` (2 preceding siblings ...)
  2023-05-18  0:09 ` [RFC PATCH 3/8] fork/vhost_task: Switch to CLONE_THREAD and CLONE_SIGHAND Mike Christie
@ 2023-05-18  0:09 ` Mike Christie
  2023-05-18  0:09 ` [RFC PATCH 5/8] vhost: Add callback that stops new work and waits on running ones Mike Christie
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 98+ messages in thread
From: Mike Christie @ 2023-05-18  0:09 UTC (permalink / raw)
  To: oleg, linux, nicolas.dichtel, axboe, ebiederm, torvalds,
	linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha,
	brauner
  Cc: Mike Christie

This moves vhost_net_open so in the next patches we can pass
vhost_dev_init a new helper which will use the stop/flush functions.
There is no functionality changes in this patch.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/net.c | 134 ++++++++++++++++++++++----------------------
 1 file changed, 67 insertions(+), 67 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 07181cd8d52e..8557072ff05e 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1285,73 +1285,6 @@ static void handle_rx_net(struct vhost_work *work)
 	handle_rx(net);
 }
 
-static int vhost_net_open(struct inode *inode, struct file *f)
-{
-	struct vhost_net *n;
-	struct vhost_dev *dev;
-	struct vhost_virtqueue **vqs;
-	void **queue;
-	struct xdp_buff *xdp;
-	int i;
-
-	n = kvmalloc(sizeof *n, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
-	if (!n)
-		return -ENOMEM;
-	vqs = kmalloc_array(VHOST_NET_VQ_MAX, sizeof(*vqs), GFP_KERNEL);
-	if (!vqs) {
-		kvfree(n);
-		return -ENOMEM;
-	}
-
-	queue = kmalloc_array(VHOST_NET_BATCH, sizeof(void *),
-			      GFP_KERNEL);
-	if (!queue) {
-		kfree(vqs);
-		kvfree(n);
-		return -ENOMEM;
-	}
-	n->vqs[VHOST_NET_VQ_RX].rxq.queue = queue;
-
-	xdp = kmalloc_array(VHOST_NET_BATCH, sizeof(*xdp), GFP_KERNEL);
-	if (!xdp) {
-		kfree(vqs);
-		kvfree(n);
-		kfree(queue);
-		return -ENOMEM;
-	}
-	n->vqs[VHOST_NET_VQ_TX].xdp = xdp;
-
-	dev = &n->dev;
-	vqs[VHOST_NET_VQ_TX] = &n->vqs[VHOST_NET_VQ_TX].vq;
-	vqs[VHOST_NET_VQ_RX] = &n->vqs[VHOST_NET_VQ_RX].vq;
-	n->vqs[VHOST_NET_VQ_TX].vq.handle_kick = handle_tx_kick;
-	n->vqs[VHOST_NET_VQ_RX].vq.handle_kick = handle_rx_kick;
-	for (i = 0; i < VHOST_NET_VQ_MAX; i++) {
-		n->vqs[i].ubufs = NULL;
-		n->vqs[i].ubuf_info = NULL;
-		n->vqs[i].upend_idx = 0;
-		n->vqs[i].done_idx = 0;
-		n->vqs[i].batched_xdp = 0;
-		n->vqs[i].vhost_hlen = 0;
-		n->vqs[i].sock_hlen = 0;
-		n->vqs[i].rx_ring = NULL;
-		vhost_net_buf_init(&n->vqs[i].rxq);
-	}
-	vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX,
-		       UIO_MAXIOV + VHOST_NET_BATCH,
-		       VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true,
-		       NULL);
-
-	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, dev);
-	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev);
-
-	f->private_data = n;
-	n->page_frag.page = NULL;
-	n->refcnt_bias = 0;
-
-	return 0;
-}
-
 static struct socket *vhost_net_stop_vq(struct vhost_net *n,
 					struct vhost_virtqueue *vq)
 {
@@ -1421,6 +1354,73 @@ static int vhost_net_release(struct inode *inode, struct file *f)
 	return 0;
 }
 
+static int vhost_net_open(struct inode *inode, struct file *f)
+{
+	struct vhost_net *n;
+	struct vhost_dev *dev;
+	struct vhost_virtqueue **vqs;
+	void **queue;
+	struct xdp_buff *xdp;
+	int i;
+
+	n = kvmalloc(sizeof *n, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	if (!n)
+		return -ENOMEM;
+	vqs = kmalloc_array(VHOST_NET_VQ_MAX, sizeof(*vqs), GFP_KERNEL);
+	if (!vqs) {
+		kvfree(n);
+		return -ENOMEM;
+	}
+
+	queue = kmalloc_array(VHOST_NET_BATCH, sizeof(void *),
+			      GFP_KERNEL);
+	if (!queue) {
+		kfree(vqs);
+		kvfree(n);
+		return -ENOMEM;
+	}
+	n->vqs[VHOST_NET_VQ_RX].rxq.queue = queue;
+
+	xdp = kmalloc_array(VHOST_NET_BATCH, sizeof(*xdp), GFP_KERNEL);
+	if (!xdp) {
+		kfree(vqs);
+		kvfree(n);
+		kfree(queue);
+		return -ENOMEM;
+	}
+	n->vqs[VHOST_NET_VQ_TX].xdp = xdp;
+
+	dev = &n->dev;
+	vqs[VHOST_NET_VQ_TX] = &n->vqs[VHOST_NET_VQ_TX].vq;
+	vqs[VHOST_NET_VQ_RX] = &n->vqs[VHOST_NET_VQ_RX].vq;
+	n->vqs[VHOST_NET_VQ_TX].vq.handle_kick = handle_tx_kick;
+	n->vqs[VHOST_NET_VQ_RX].vq.handle_kick = handle_rx_kick;
+	for (i = 0; i < VHOST_NET_VQ_MAX; i++) {
+		n->vqs[i].ubufs = NULL;
+		n->vqs[i].ubuf_info = NULL;
+		n->vqs[i].upend_idx = 0;
+		n->vqs[i].done_idx = 0;
+		n->vqs[i].batched_xdp = 0;
+		n->vqs[i].vhost_hlen = 0;
+		n->vqs[i].sock_hlen = 0;
+		n->vqs[i].rx_ring = NULL;
+		vhost_net_buf_init(&n->vqs[i].rxq);
+	}
+	vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX,
+		       UIO_MAXIOV + VHOST_NET_BATCH,
+		       VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true,
+		       NULL);
+
+	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, dev);
+	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev);
+
+	f->private_data = n;
+	n->page_frag.page = NULL;
+	n->refcnt_bias = 0;
+
+	return 0;
+}
+
 static struct socket *get_raw_socket(int fd)
 {
 	int r;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [RFC PATCH 5/8] vhost: Add callback that stops new work and waits on running ones
  2023-05-18  0:09 [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND Mike Christie
                   ` (3 preceding siblings ...)
  2023-05-18  0:09 ` [RFC PATCH 4/8] vhost-net: Move vhost_net_open Mike Christie
@ 2023-05-18  0:09 ` Mike Christie
  2023-05-18 14:18   ` Christian Brauner
  2023-05-18  0:09 ` [RFC PATCH 6/8] vhost-scsi: Add callback to stop and wait on works Mike Christie
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 98+ messages in thread
From: Mike Christie @ 2023-05-18  0:09 UTC (permalink / raw)
  To: oleg, linux, nicolas.dichtel, axboe, ebiederm, torvalds,
	linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha,
	brauner
  Cc: Mike Christie

When the vhost_task gets a SIGKILL we want to stop new work from being
queued and also wait for and handle completions for running work. For the
latter, we still need to use the vhost_task to handle the completing work
so we can't just exit right away. But, this has us kick off the stopping
and flushing/stopping of the device/vhost_task/worker to the system work
queue while the vhost_task handles completions. When all completions are
done we will then do vhost_task_stop and we will exit.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/net.c   |  2 +-
 drivers/vhost/scsi.c  |  4 ++--
 drivers/vhost/test.c  |  3 ++-
 drivers/vhost/vdpa.c  |  2 +-
 drivers/vhost/vhost.c | 48 ++++++++++++++++++++++++++++++++++++-------
 drivers/vhost/vhost.h | 10 ++++++++-
 drivers/vhost/vsock.c |  4 ++--
 7 files changed, 58 insertions(+), 15 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 8557072ff05e..90c25127b3f8 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1409,7 +1409,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 	vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX,
 		       UIO_MAXIOV + VHOST_NET_BATCH,
 		       VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true,
-		       NULL);
+		       NULL, NULL);
 
 	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, dev);
 	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev);
diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index bb10fa4bb4f6..40f9135e1a62 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1820,8 +1820,8 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
 		vqs[i] = &vs->vqs[i].vq;
 		vs->vqs[i].vq.handle_kick = vhost_scsi_handle_kick;
 	}
-	vhost_dev_init(&vs->dev, vqs, nvqs, UIO_MAXIOV,
-		       VHOST_SCSI_WEIGHT, 0, true, NULL);
+	vhost_dev_init(&vs->dev, vqs, nvqs, UIO_MAXIOV, VHOST_SCSI_WEIGHT, 0,
+		       true, NULL, NULL);
 
 	vhost_scsi_init_inflight(vs, NULL);
 
diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
index 42c955a5b211..11a2823d7532 100644
--- a/drivers/vhost/test.c
+++ b/drivers/vhost/test.c
@@ -120,7 +120,8 @@ static int vhost_test_open(struct inode *inode, struct file *f)
 	vqs[VHOST_TEST_VQ] = &n->vqs[VHOST_TEST_VQ];
 	n->vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick;
 	vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV,
-		       VHOST_TEST_PKT_WEIGHT, VHOST_TEST_WEIGHT, true, NULL);
+		       VHOST_TEST_PKT_WEIGHT, VHOST_TEST_WEIGHT, true, NULL,
+		       NULL);
 
 	f->private_data = n;
 
diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 8c1aefc865f0..de9a83ecb70d 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -1279,7 +1279,7 @@ static int vhost_vdpa_open(struct inode *inode, struct file *filep)
 		vqs[i]->handle_kick = handle_vq_kick;
 	}
 	vhost_dev_init(dev, vqs, nvqs, 0, 0, 0, false,
-		       vhost_vdpa_process_iotlb_msg);
+		       vhost_vdpa_process_iotlb_msg, NULL);
 
 	r = vhost_vdpa_alloc_domain(v);
 	if (r)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 1ba9e068b2ab..4163c86db50c 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -336,6 +336,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 static int vhost_worker(void *data)
 {
 	struct vhost_worker *worker = data;
+	struct vhost_dev *dev = worker->dev;
 	struct vhost_work *work, *work_next;
 	struct llist_node *node;
 
@@ -352,12 +353,13 @@ static int vhost_worker(void *data)
 		if (!node) {
 			schedule();
 			/*
-			 * When we get a SIGKILL our release function will
-			 * be called. That will stop new IOs from being queued
-			 * and check for outstanding cmd responses. It will then
-			 * call vhost_task_stop to exit us.
+			 * When we get a SIGKILL we kick off a work to
+			 * run the driver's helper to stop new work and
+			 * handle completions. When they are done they will
+			 * call vhost_task_stop to tell us to exit.
 			 */
-			vhost_task_get_signal();
+			if (vhost_task_get_signal())
+				schedule_work(&dev->destroy_worker);
 		}
 
 		node = llist_reverse_order(node);
@@ -376,6 +378,33 @@ static int vhost_worker(void *data)
 	return 0;
 }
 
+static void __vhost_dev_stop_work(struct vhost_dev *dev)
+{
+	mutex_lock(&dev->stop_work_mutex);
+	if (dev->work_stopped)
+		goto done;
+
+	if (dev->stop_dev_work)
+		dev->stop_dev_work(dev);
+	dev->work_stopped = true;
+done:
+	mutex_unlock(&dev->stop_work_mutex);
+}
+
+void vhost_dev_stop_work(struct vhost_dev *dev)
+{
+	__vhost_dev_stop_work(dev);
+	flush_work(&dev->destroy_worker);
+}
+EXPORT_SYMBOL_GPL(vhost_dev_stop_work);
+
+static void vhost_worker_destroy(struct work_struct *work)
+{
+	struct vhost_dev *dev = container_of(work, struct vhost_dev,
+					     destroy_worker);
+	__vhost_dev_stop_work(dev);
+}
+
 static void vhost_vq_free_iovecs(struct vhost_virtqueue *vq)
 {
 	kfree(vq->indirect);
@@ -464,7 +493,8 @@ void vhost_dev_init(struct vhost_dev *dev,
 		    int iov_limit, int weight, int byte_weight,
 		    bool use_worker,
 		    int (*msg_handler)(struct vhost_dev *dev, u32 asid,
-				       struct vhost_iotlb_msg *msg))
+				       struct vhost_iotlb_msg *msg),
+		    void (*stop_dev_work)(struct vhost_dev *dev))
 {
 	struct vhost_virtqueue *vq;
 	int i;
@@ -472,6 +502,7 @@ void vhost_dev_init(struct vhost_dev *dev,
 	dev->vqs = vqs;
 	dev->nvqs = nvqs;
 	mutex_init(&dev->mutex);
+	mutex_init(&dev->stop_work_mutex);
 	dev->log_ctx = NULL;
 	dev->umem = NULL;
 	dev->iotlb = NULL;
@@ -482,12 +513,14 @@ void vhost_dev_init(struct vhost_dev *dev,
 	dev->byte_weight = byte_weight;
 	dev->use_worker = use_worker;
 	dev->msg_handler = msg_handler;
+	dev->work_stopped = false;
+	dev->stop_dev_work = stop_dev_work;
+	INIT_WORK(&dev->destroy_worker, vhost_worker_destroy);
 	init_waitqueue_head(&dev->wait);
 	INIT_LIST_HEAD(&dev->read_list);
 	INIT_LIST_HEAD(&dev->pending_list);
 	spin_lock_init(&dev->iotlb_lock);
 
-
 	for (i = 0; i < dev->nvqs; ++i) {
 		vq = dev->vqs[i];
 		vq->log = NULL;
@@ -572,6 +605,7 @@ static int vhost_worker_create(struct vhost_dev *dev)
 	if (!worker)
 		return -ENOMEM;
 
+	worker->dev = dev;
 	dev->worker = worker;
 	worker->kcov_handle = kcov_common_handle();
 	init_llist_head(&worker->work_list);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 0308638cdeee..325e5e52c7ae 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -17,6 +17,7 @@
 
 struct vhost_work;
 struct vhost_task;
+struct vhost_dev;
 typedef void (*vhost_work_fn_t)(struct vhost_work *work);
 
 #define VHOST_WORK_QUEUED 1
@@ -28,6 +29,7 @@ struct vhost_work {
 
 struct vhost_worker {
 	struct vhost_task	*vtsk;
+	struct vhost_dev	*dev;
 	struct llist_head	work_list;
 	u64			kcov_handle;
 };
@@ -165,8 +167,12 @@ struct vhost_dev {
 	int weight;
 	int byte_weight;
 	bool use_worker;
+	struct mutex stop_work_mutex;
+	bool work_stopped;
+	struct work_struct destroy_worker;
 	int (*msg_handler)(struct vhost_dev *dev, u32 asid,
 			   struct vhost_iotlb_msg *msg);
+	void (*stop_dev_work)(struct vhost_dev *dev);
 };
 
 bool vhost_exceeds_weight(struct vhost_virtqueue *vq, int pkts, int total_len);
@@ -174,7 +180,8 @@ void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs,
 		    int nvqs, int iov_limit, int weight, int byte_weight,
 		    bool use_worker,
 		    int (*msg_handler)(struct vhost_dev *dev, u32 asid,
-				       struct vhost_iotlb_msg *msg));
+				       struct vhost_iotlb_msg *msg),
+		    void (*stop_dev_work)(struct vhost_dev *dev));
 long vhost_dev_set_owner(struct vhost_dev *dev);
 bool vhost_dev_has_owner(struct vhost_dev *dev);
 long vhost_dev_check_owner(struct vhost_dev *);
@@ -182,6 +189,7 @@ struct vhost_iotlb *vhost_dev_reset_owner_prepare(void);
 void vhost_dev_reset_owner(struct vhost_dev *dev, struct vhost_iotlb *iotlb);
 void vhost_dev_cleanup(struct vhost_dev *);
 void vhost_dev_stop(struct vhost_dev *);
+void vhost_dev_stop_work(struct vhost_dev *dev);
 long vhost_dev_ioctl(struct vhost_dev *, unsigned int ioctl, void __user *argp);
 long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *argp);
 bool vhost_vq_access_ok(struct vhost_virtqueue *vq);
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 6578db78f0ae..1ef53722d494 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -664,8 +664,8 @@ static int vhost_vsock_dev_open(struct inode *inode, struct file *file)
 	vsock->vqs[VSOCK_VQ_RX].handle_kick = vhost_vsock_handle_rx_kick;
 
 	vhost_dev_init(&vsock->dev, vqs, ARRAY_SIZE(vsock->vqs),
-		       UIO_MAXIOV, VHOST_VSOCK_PKT_WEIGHT,
-		       VHOST_VSOCK_WEIGHT, true, NULL);
+		       UIO_MAXIOV, VHOST_VSOCK_PKT_WEIGHT, VHOST_VSOCK_WEIGHT,
+		       true, NULL, NULL);
 
 	file->private_data = vsock;
 	skb_queue_head_init(&vsock->send_pkt_queue);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [RFC PATCH 6/8] vhost-scsi: Add callback to stop and wait on works
  2023-05-18  0:09 [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND Mike Christie
                   ` (4 preceding siblings ...)
  2023-05-18  0:09 ` [RFC PATCH 5/8] vhost: Add callback that stops new work and waits on running ones Mike Christie
@ 2023-05-18  0:09 ` Mike Christie
  2023-05-18  0:09 ` [RFC PATCH 7/8] vhost-net: " Mike Christie
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 98+ messages in thread
From: Mike Christie @ 2023-05-18  0:09 UTC (permalink / raw)
  To: oleg, linux, nicolas.dichtel, axboe, ebiederm, torvalds,
	linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha,
	brauner
  Cc: Mike Christie

This moves the scsi code we use to stop new works from being queued
and wait on running works to a helper which is used by the vhost layer
when the vhost_task is being killed by a SIGKILL.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/scsi.c | 23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 40f9135e1a62..a0f2588270f2 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1768,6 +1768,19 @@ static int vhost_scsi_set_features(struct vhost_scsi *vs, u64 features)
 	return 0;
 }
 
+static void vhost_scsi_stop_dev_work(struct vhost_dev *dev)
+{
+	struct vhost_scsi *vs = container_of(dev, struct vhost_scsi, dev);
+	struct vhost_scsi_target t;
+
+	mutex_lock(&vs->dev.mutex);
+	memcpy(t.vhost_wwpn, vs->vs_vhost_wwpn, sizeof(t.vhost_wwpn));
+	mutex_unlock(&vs->dev.mutex);
+	vhost_scsi_clear_endpoint(vs, &t);
+	vhost_dev_stop(&vs->dev);
+	vhost_dev_cleanup(&vs->dev);
+}
+
 static int vhost_scsi_open(struct inode *inode, struct file *f)
 {
 	struct vhost_scsi *vs;
@@ -1821,7 +1834,7 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
 		vs->vqs[i].vq.handle_kick = vhost_scsi_handle_kick;
 	}
 	vhost_dev_init(&vs->dev, vqs, nvqs, UIO_MAXIOV, VHOST_SCSI_WEIGHT, 0,
-		       true, NULL, NULL);
+		       true, NULL, vhost_scsi_stop_dev_work);
 
 	vhost_scsi_init_inflight(vs, NULL);
 
@@ -1843,14 +1856,8 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
 static int vhost_scsi_release(struct inode *inode, struct file *f)
 {
 	struct vhost_scsi *vs = f->private_data;
-	struct vhost_scsi_target t;
 
-	mutex_lock(&vs->dev.mutex);
-	memcpy(t.vhost_wwpn, vs->vs_vhost_wwpn, sizeof(t.vhost_wwpn));
-	mutex_unlock(&vs->dev.mutex);
-	vhost_scsi_clear_endpoint(vs, &t);
-	vhost_dev_stop(&vs->dev);
-	vhost_dev_cleanup(&vs->dev);
+	vhost_dev_stop_work(&vs->dev);
 	kfree(vs->dev.vqs);
 	kfree(vs->vqs);
 	kfree(vs->old_inflight);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [RFC PATCH 7/8] vhost-net: Add callback to stop and wait on works
  2023-05-18  0:09 [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND Mike Christie
                   ` (5 preceding siblings ...)
  2023-05-18  0:09 ` [RFC PATCH 6/8] vhost-scsi: Add callback to stop and wait on works Mike Christie
@ 2023-05-18  0:09 ` Mike Christie
  2023-05-18  0:09 ` [RFC PATCH 8/8] fork/vhost_task: remove no_files Mike Christie
  2023-05-18  8:25 ` [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND Christian Brauner
  8 siblings, 0 replies; 98+ messages in thread
From: Mike Christie @ 2023-05-18  0:09 UTC (permalink / raw)
  To: oleg, linux, nicolas.dichtel, axboe, ebiederm, torvalds,
	linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha,
	brauner
  Cc: Mike Christie

This moves the net code we use to stop new works from being queued
and wait on running works to a helper which is used by the vhost layer
when the vhost_task is being killed by a SIGKILL.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/net.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 90c25127b3f8..f8a5527b15ba 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1325,9 +1325,9 @@ static void vhost_net_flush(struct vhost_net *n)
 	}
 }
 
-static int vhost_net_release(struct inode *inode, struct file *f)
+static void vhost_net_stop_dev_work(struct vhost_dev *dev)
 {
-	struct vhost_net *n = f->private_data;
+	struct vhost_net *n = container_of(dev, struct vhost_net, dev);
 	struct socket *tx_sock;
 	struct socket *rx_sock;
 
@@ -1345,6 +1345,13 @@ static int vhost_net_release(struct inode *inode, struct file *f)
 	/* We do an extra flush before freeing memory,
 	 * since jobs can re-queue themselves. */
 	vhost_net_flush(n);
+}
+
+static int vhost_net_release(struct inode *inode, struct file *f)
+{
+	struct vhost_net *n = f->private_data;
+
+	vhost_dev_stop_work(&n->dev);
 	kfree(n->vqs[VHOST_NET_VQ_RX].rxq.queue);
 	kfree(n->vqs[VHOST_NET_VQ_TX].xdp);
 	kfree(n->dev.vqs);
@@ -1409,7 +1416,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 	vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX,
 		       UIO_MAXIOV + VHOST_NET_BATCH,
 		       VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true,
-		       NULL, NULL);
+		       NULL, vhost_net_stop_dev_work);
 
 	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, dev);
 	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [RFC PATCH 8/8] fork/vhost_task: remove no_files
  2023-05-18  0:09 [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND Mike Christie
                   ` (6 preceding siblings ...)
  2023-05-18  0:09 ` [RFC PATCH 7/8] vhost-net: " Mike Christie
@ 2023-05-18  0:09 ` Mike Christie
  2023-05-18  1:04   ` Mike Christie
  2023-05-18  8:25 ` [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND Christian Brauner
  8 siblings, 1 reply; 98+ messages in thread
From: Mike Christie @ 2023-05-18  0:09 UTC (permalink / raw)
  To: oleg, linux, nicolas.dichtel, axboe, ebiederm, torvalds,
	linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha,
	brauner
  Cc: Mike Christie

The vhost_task can now support the worker being freed from under the
device when we get a SIGKILL or the process exits without closing
devices. We no longer need no_files so this removes it.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 include/linux/sched/task.h |  1 -
 kernel/fork.c              | 10 ++--------
 kernel/vhost_task.c        |  3 +--
 3 files changed, 3 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 249a5ece9def..342fe297ffd4 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -28,7 +28,6 @@ struct kernel_clone_args {
 	u32 kthread:1;
 	u32 io_thread:1;
 	u32 user_worker:1;
-	u32 no_files:1;
 	u32 block_signals:1;
 	unsigned long stack;
 	unsigned long stack_size;
diff --git a/kernel/fork.c b/kernel/fork.c
index 9e04ab5c3946..f2c081c15efb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1769,8 +1769,7 @@ static int copy_fs(unsigned long clone_flags, struct task_struct *tsk)
 	return 0;
 }
 
-static int copy_files(unsigned long clone_flags, struct task_struct *tsk,
-		      int no_files)
+static int copy_files(unsigned long clone_flags, struct task_struct *tsk)
 {
 	struct files_struct *oldf, *newf;
 	int error = 0;
@@ -1782,11 +1781,6 @@ static int copy_files(unsigned long clone_flags, struct task_struct *tsk,
 	if (!oldf)
 		goto out;
 
-	if (no_files) {
-		tsk->files = NULL;
-		goto out;
-	}
-
 	if (clone_flags & CLONE_FILES) {
 		atomic_inc(&oldf->count);
 		goto out;
@@ -2488,7 +2482,7 @@ __latent_entropy struct task_struct *copy_process(
 	retval = copy_semundo(clone_flags, p);
 	if (retval)
 		goto bad_fork_cleanup_security;
-	retval = copy_files(clone_flags, p, args->no_files);
+	retval = copy_files(clone_flags, p);
 	if (retval)
 		goto bad_fork_cleanup_semundo;
 	retval = copy_fs(clone_flags, p);
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index a11f036290cc..642047765190 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -96,12 +96,11 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
 {
 	struct kernel_clone_args args = {
 		.flags		= CLONE_FS | CLONE_UNTRACED | CLONE_VM |
-				  CLONE_THREAD | CLONE_SIGHAND,
+				  CLONE_THREAD | CLONE_FILES, CLONE_SIGHAND,
 		.exit_signal	= 0,
 		.fn		= vhost_task_fn,
 		.name		= name,
 		.user_worker	= 1,
-		.no_files	= 1,
 		.block_signals	= 1,
 	};
 	struct vhost_task *vtsk;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [RFC PATCH 2/8] vhost/vhost_task: Hook vhost layer into signal handler
  2023-05-18  0:09 ` [RFC PATCH 2/8] vhost/vhost_task: Hook vhost layer into signal handler Mike Christie
@ 2023-05-18  0:16   ` Linus Torvalds
  2023-05-18  1:01     ` Mike Christie
  0 siblings, 1 reply; 98+ messages in thread
From: Linus Torvalds @ 2023-05-18  0:16 UTC (permalink / raw)
  To: Mike Christie
  Cc: oleg, linux, nicolas.dichtel, axboe, ebiederm, linux-kernel,
	virtualization, mst, sgarzare, jasowang, stefanha, brauner

On Wed, May 17, 2023 at 5:09 PM Mike Christie
<michael.christie@oracle.com> wrote:
>
> +       __set_current_state(TASK_RUNNING);
> +       rc = get_signal(&ksig);
> +       set_current_state(TASK_INTERRUPTIBLE);
> +       return rc;

The games with current_state seem nonsensical.

What are they all about? get_signal() shouldn't care, and no other
caller does this thing. This just seems completely random.

      Linus

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFC PATCH 2/8] vhost/vhost_task: Hook vhost layer into signal handler
  2023-05-18  0:16   ` Linus Torvalds
@ 2023-05-18  1:01     ` Mike Christie
  2023-05-18  8:16       ` Christian Brauner
  0 siblings, 1 reply; 98+ messages in thread
From: Mike Christie @ 2023-05-18  1:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: oleg, linux, nicolas.dichtel, axboe, ebiederm, linux-kernel,
	virtualization, mst, sgarzare, jasowang, stefanha, brauner

On 5/17/23 7:16 PM, Linus Torvalds wrote:
> On Wed, May 17, 2023 at 5:09 PM Mike Christie
> <michael.christie@oracle.com> wrote:
>>
>> +       __set_current_state(TASK_RUNNING);
>> +       rc = get_signal(&ksig);
>> +       set_current_state(TASK_INTERRUPTIBLE);
>> +       return rc;
> 
> The games with current_state seem nonsensical.
> 
> What are they all about? get_signal() shouldn't care, and no other
> caller does this thing. This just seems completely random.

Sorry. It's a leftover.

I was originally calling this from vhost_task_should_stop where before
calling that function we do a:

set_current_state(TASK_INTERRUPTIBLE);

So, I was hitting get_signal->try_to_freeze->might_sleep->__might_sleep
and was getting the "do not call blocking ops when !TASK_RUNNING"
warnings.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFC PATCH 8/8] fork/vhost_task: remove no_files
  2023-05-18  0:09 ` [RFC PATCH 8/8] fork/vhost_task: remove no_files Mike Christie
@ 2023-05-18  1:04   ` Mike Christie
  0 siblings, 0 replies; 98+ messages in thread
From: Mike Christie @ 2023-05-18  1:04 UTC (permalink / raw)
  To: oleg, linux, nicolas.dichtel, axboe, ebiederm, torvalds,
	linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha,
	brauner

On 5/17/23 7:09 PM, Mike Christie wrote:
> +				  CLONE_THREAD | CLONE_FILES, CLONE_SIGHAND,

Sorry. I tried to throw this one in last second so we could see that
we can also see that we can now use CLONE_FILES like io_uring.
It will of course not compile.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
  2023-05-18  0:09 ` [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set Mike Christie
@ 2023-05-18  2:34   ` Eric W. Biederman
  2023-05-18  3:49   ` Eric W. Biederman
  2023-05-18  8:08   ` Christian Brauner
  2 siblings, 0 replies; 98+ messages in thread
From: Eric W. Biederman @ 2023-05-18  2:34 UTC (permalink / raw)
  To: Mike Christie
  Cc: oleg, linux, nicolas.dichtel, axboe, torvalds, linux-kernel,
	virtualization, mst, sgarzare, jasowang, stefanha, brauner

Mike Christie <michael.christie@oracle.com> writes:

> This has us deqeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is
> set when we are dealing with PF_USER_WORKER tasks.

> When a vhost_task gets a SIGKILL, we could have outstanding IO in flight.
> We can easily stop new work/IO from being queued to the vhost_task, but
> for IO that's already been sent to something like the block layer we
> need to wait for the response then process it. These type of IO
> completions use the vhost_task to process the completion so we can't
> exit immediately.


I understand the concern.

> We need to handle wait for then handle those completions from the
> vhost_task, but when we have a SIGKLL pending, functions like
> schedule() return immediately so we can't wait like normal. Functions
> like vhost_worker() degrade to just a while(1); loop.
>
> This patch has get_signal drop down to the normal code path when
> SIGNAL_GROUP_EXIT/group_exec_task is set so the caller can still detect
> there is a SIGKILL but still perform some blocking cleanup.
>
> Note that in that chunk I'm now bypassing that does:
>
> sigdelset(&current->pending.signal, SIGKILL);
>
> we look to be ok, because in the places we set SIGNAL_GROUP_EXIT/
> group_exec_task we are already doing that on the threads in the
> group.

What you are doing does not make any sense to me.

First there is the semantic non-sense, of queuing something that
is not a signal.   The per task SIGKILL bit is used as a flag with
essentially the same meaning as SIGNAL_GROUP_EXIT, reporting that
the task has been scheduled for exit.

More so is what happens afterwards.

As I read your patch it is roughly equivalent to doing:

	if ((current->flags & PF_USER_WORKER) &&
       	    fatal_signal_pending(current)) {
		sigdelset(&current->pending.signal, SIGKILL);
	        clear_siginfo(&ksig->info);
                ksig->info.si_signo = SIGKILL;
                ksig->info.si_code = SI_USER;
                recalc_sigpending();
		trace_signal_deliver(SIGKILL, &ksig->info,
			&sighand->action[SIGKILL - 1]);
                goto fatal;
	}

Before the "(SIGNAL_GROUP_EXIT || signal->group_exec_task)" test.

To get that code I stripped the active statements out of the
dequeue_signal path the code executes after your change below.

I don't get why you are making it though because the code you
are opting out of does:

		/* Has this task already been marked for death? */
		if ((signal->flags & SIGNAL_GROUP_EXIT) ||
		     signal->group_exec_task) {
			clear_siginfo(&ksig->info);
			ksig->info.si_signo = signr = SIGKILL;
			sigdelset(&current->pending.signal, SIGKILL);
			trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO,
				&sighand->action[SIGKILL - 1]);
			recalc_sigpending();
			goto fatal;
		}

I don't see what in practice changes, other than the fact that by going
through the ordinary dequeue_signal path that other signals can be
processed after a SIGKILL has arrived.  Of course those signal all
should be blocked.




The trailing bit that expands the PF_IO_WORKER test to be PF_USER_WORKER
appears reasonable, and possibly needed.

Eric


> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> ---
>  kernel/signal.c | 19 ++++++++++++++-----
>  1 file changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 8f6330f0e9ca..ae4972eea5db 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2705,9 +2705,18 @@ bool get_signal(struct ksignal *ksig)
>  		struct k_sigaction *ka;
>  		enum pid_type type;
>  
> -		/* Has this task already been marked for death? */
> -		if ((signal->flags & SIGNAL_GROUP_EXIT) ||
> -		     signal->group_exec_task) {
> +		/*
> +		 * Has this task already been marked for death?
> +		 *
> +		 * If this is a PF_USER_WORKER then the task may need to do
> +		 * extra work that requires waiting on running work, so we want
> +		 * to dequeue the signal below and tell the caller its time to
> +		 * start its exit procedure. When the work has completed then
> +		 * the task will exit.
> +		 */
> +		if (!(current->flags & PF_USER_WORKER) &&
> +		    ((signal->flags & SIGNAL_GROUP_EXIT) ||
> +		     signal->group_exec_task)) {
>  			clear_siginfo(&ksig->info);
>  			ksig->info.si_signo = signr = SIGKILL;
>  			sigdelset(&current->pending.signal, SIGKILL);
> @@ -2861,11 +2870,11 @@ bool get_signal(struct ksignal *ksig)
>  		}
>  
>  		/*
> -		 * PF_IO_WORKER threads will catch and exit on fatal signals
> +		 * PF_USER_WORKER threads will catch and exit on fatal signals
>  		 * themselves. They have cleanup that must be performed, so
>  		 * we cannot call do_exit() on their behalf.
>  		 */
> -		if (current->flags & PF_IO_WORKER)
> +		if (current->flags & PF_USER_WORKER)
>  			goto out;
>  
>  		/*

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
  2023-05-18  0:09 ` [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set Mike Christie
  2023-05-18  2:34   ` Eric W. Biederman
@ 2023-05-18  3:49   ` Eric W. Biederman
  2023-05-18 15:21     ` Mike Christie
  2023-05-18  8:08   ` Christian Brauner
  2 siblings, 1 reply; 98+ messages in thread
From: Eric W. Biederman @ 2023-05-18  3:49 UTC (permalink / raw)
  To: Mike Christie
  Cc: oleg, linux, nicolas.dichtel, axboe, torvalds, linux-kernel,
	virtualization, mst, sgarzare, jasowang, stefanha, brauner


Long story short.

In the patch below the first hunk is a noop.

The code you are bypassing was added to ensure that process termination
(aka SIGKILL) is processed before any other signals.  Other than signal
processing order there are not any substantive differences in the two
code paths.  With all signals except SIGSTOP == 19 and SIGKILL == 9
blocked SIGKILL should always be processed before SIGSTOP.

Can you try patch with just the last hunk that does
s/PF_IO_WORKER/PF_USER_WORKER/ and see if that is enough?

I have no objections to the final hunk.

Mike Christie <michael.christie@oracle.com> writes:

> This has us deqeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is
> set when we are dealing with PF_USER_WORKER tasks.
>
> When a vhost_task gets a SIGKILL, we could have outstanding IO in flight.
> We can easily stop new work/IO from being queued to the vhost_task, but
> for IO that's already been sent to something like the block layer we
> need to wait for the response then process it. These type of IO
> completions use the vhost_task to process the completion so we can't
> exit immediately.
>
> We need to handle wait for then handle those completions from the
> vhost_task, but when we have a SIGKLL pending, functions like
> schedule() return immediately so we can't wait like normal. Functions
> like vhost_worker() degrade to just a while(1); loop.
>
> This patch has get_signal drop down to the normal code path when
> SIGNAL_GROUP_EXIT/group_exec_task is set so the caller can still detect
> there is a SIGKILL but still perform some blocking cleanup.
>
> Note that in that chunk I'm now bypassing that does:
>
> sigdelset(&current->pending.signal, SIGKILL);
>
> we look to be ok, because in the places we set SIGNAL_GROUP_EXIT/
> group_exec_task we are already doing that on the threads in the
> group.
>
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> ---
>  kernel/signal.c | 19 ++++++++++++++-----
>  1 file changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 8f6330f0e9ca..ae4972eea5db 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2705,9 +2705,18 @@ bool get_signal(struct ksignal *ksig)
>  		struct k_sigaction *ka;
>  		enum pid_type type;
>  
> -		/* Has this task already been marked for death? */
> -		if ((signal->flags & SIGNAL_GROUP_EXIT) ||
> -		     signal->group_exec_task) {
> +		/*
> +		 * Has this task already been marked for death?
> +		 *
> +		 * If this is a PF_USER_WORKER then the task may need to do
> +		 * extra work that requires waiting on running work, so we want
> +		 * to dequeue the signal below and tell the caller its time to
> +		 * start its exit procedure. When the work has completed then
> +		 * the task will exit.
> +		 */
> +		if (!(current->flags & PF_USER_WORKER) &&
> +		    ((signal->flags & SIGNAL_GROUP_EXIT) ||
> +		     signal->group_exec_task)) {
>  			clear_siginfo(&ksig->info);
>  			ksig->info.si_signo = signr = SIGKILL;
>  			sigdelset(&current->pending.signal, SIGKILL);

This hunk is a confusing no-op.

> @@ -2861,11 +2870,11 @@ bool get_signal(struct ksignal *ksig)
>  		}
>  
>  		/*
> -		 * PF_IO_WORKER threads will catch and exit on fatal signals
> +		 * PF_USER_WORKER threads will catch and exit on fatal signals
>  		 * themselves. They have cleanup that must be performed, so
>  		 * we cannot call do_exit() on their behalf.
>  		 */
> -		if (current->flags & PF_IO_WORKER)
> +		if (current->flags & PF_USER_WORKER)
>  			goto out;
>  
>  		/*

This hunk is good and makes sense.

Eric

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
  2023-05-18  0:09 ` [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set Mike Christie
  2023-05-18  2:34   ` Eric W. Biederman
  2023-05-18  3:49   ` Eric W. Biederman
@ 2023-05-18  8:08   ` Christian Brauner
  2023-05-18 15:27     ` Mike Christie
  2 siblings, 1 reply; 98+ messages in thread
From: Christian Brauner @ 2023-05-18  8:08 UTC (permalink / raw)
  To: Mike Christie
  Cc: oleg, linux, nicolas.dichtel, axboe, ebiederm, torvalds,
	linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha

On Wed, May 17, 2023 at 07:09:13PM -0500, Mike Christie wrote:
> This has us deqeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is
> set when we are dealing with PF_USER_WORKER tasks.
> 
> When a vhost_task gets a SIGKILL, we could have outstanding IO in flight.
> We can easily stop new work/IO from being queued to the vhost_task, but
> for IO that's already been sent to something like the block layer we
> need to wait for the response then process it. These type of IO
> completions use the vhost_task to process the completion so we can't
> exit immediately.
> 
> We need to handle wait for then handle those completions from the
> vhost_task, but when we have a SIGKLL pending, functions like
> schedule() return immediately so we can't wait like normal. Functions
> like vhost_worker() degrade to just a while(1); loop.
> 
> This patch has get_signal drop down to the normal code path when
> SIGNAL_GROUP_EXIT/group_exec_task is set so the caller can still detect
> there is a SIGKILL but still perform some blocking cleanup.
> 
> Note that in that chunk I'm now bypassing that does:
> 
> sigdelset(&current->pending.signal, SIGKILL);
> 
> we look to be ok, because in the places we set SIGNAL_GROUP_EXIT/
> group_exec_task we are already doing that on the threads in the
> group.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> ---

I think you just got confused by the original discussion that was split
into two separate threads:

(1) The discussion based on your original proposal to adjust the signal
    handling logic to accommodate vhost workers as they are right now.
    That's where Oleg jumped in.
(2) My request - which you did in this series - of rewriting vhost
    workers to behave more like io_uring workers.

Both problems are orthogonal. The gist of my proposal is to avoid (1) by
doing (2). So the only change that's needed is
s/PF_IO_WORKER/PF_USER_WORKER/ which is pretty obvious as io_uring
workers and vhost workers no almost fully collapse into the same
concept.

So forget (1). If additional signal patches are needed as discussed in
(1) then it must be because of a bug that would affect io_uring workers
today.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFC PATCH 2/8] vhost/vhost_task: Hook vhost layer into signal handler
  2023-05-18  1:01     ` Mike Christie
@ 2023-05-18  8:16       ` Christian Brauner
  0 siblings, 0 replies; 98+ messages in thread
From: Christian Brauner @ 2023-05-18  8:16 UTC (permalink / raw)
  To: Mike Christie
  Cc: Linus Torvalds, oleg, linux, nicolas.dichtel, axboe, ebiederm,
	linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha

On Wed, May 17, 2023 at 08:01:45PM -0500, Mike Christie wrote:
> On 5/17/23 7:16 PM, Linus Torvalds wrote:
> > On Wed, May 17, 2023 at 5:09 PM Mike Christie
> > <michael.christie@oracle.com> wrote:
> >>
> >> +       __set_current_state(TASK_RUNNING);
> >> +       rc = get_signal(&ksig);
> >> +       set_current_state(TASK_INTERRUPTIBLE);
> >> +       return rc;
> > 
> > The games with current_state seem nonsensical.
> > 
> > What are they all about? get_signal() shouldn't care, and no other
> > caller does this thing. This just seems completely random.
> 
> Sorry. It's a leftover.
> 
> I was originally calling this from vhost_task_should_stop where before
> calling that function we do a:
> 
> set_current_state(TASK_INTERRUPTIBLE);
> 
> So, I was hitting get_signal->try_to_freeze->might_sleep->__might_sleep
> and was getting the "do not call blocking ops when !TASK_RUNNING"
> warnings.

Also seems you might want to check the return value of you new helper...

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFC PATCH 3/8] fork/vhost_task: Switch to CLONE_THREAD and CLONE_SIGHAND
  2023-05-18  0:09 ` [RFC PATCH 3/8] fork/vhost_task: Switch to CLONE_THREAD and CLONE_SIGHAND Mike Christie
@ 2023-05-18  8:18   ` Christian Brauner
  0 siblings, 0 replies; 98+ messages in thread
From: Christian Brauner @ 2023-05-18  8:18 UTC (permalink / raw)
  To: Mike Christie
  Cc: oleg, linux, nicolas.dichtel, axboe, ebiederm, torvalds,
	linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha

On Wed, May 17, 2023 at 07:09:15PM -0500, Mike Christie wrote:
> This is a modified version of Linus's patch which has vhost_task
> use CLONE_THREAD and CLONE_SIGHAND and allow SIGKILL and SIGSTOP.
> 
> I renamed the ignore_signals to block_signals based on Linus's comment
> where it aligns with what we are doing with the siginitsetinv
> p->blocked use and no longer calling ignore_signals.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> ---

Yes, much nicer than what this was before,
Acked-by: Christian Brauner <brauner@kernel.org>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND
  2023-05-18  0:09 [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND Mike Christie
                   ` (7 preceding siblings ...)
  2023-05-18  0:09 ` [RFC PATCH 8/8] fork/vhost_task: remove no_files Mike Christie
@ 2023-05-18  8:25 ` Christian Brauner
  2023-05-18  8:40   ` Christian Brauner
  2023-05-18 14:30   ` Christian Brauner
  8 siblings, 2 replies; 98+ messages in thread
From: Christian Brauner @ 2023-05-18  8:25 UTC (permalink / raw)
  To: Mike Christie
  Cc: oleg, linux, nicolas.dichtel, axboe, ebiederm, torvalds,
	linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha

On Wed, May 17, 2023 at 07:09:12PM -0500, Mike Christie wrote:
> This patch allows the vhost and vhost_task code to use CLONE_THREAD,
> CLONE_SIGHAND and CLONE_FILES. It's a RFC because I didn't do all the
> normal testing, haven't coverted vsock and vdpa, and I know you guys
> will not like the first patch. However, I think it better shows what

Just to summarize the core idea behind my proposal is that no signal
handling changes are needed unless there's a bug in the current way
io_uring workers already work. All that should be needed is
s/PF_IO_WORKER/PF_USER_WORKER/ in signal.c.

If you follow my proposal than vhost and io_uring workers should almost
collapse into the same concept. Specifically, io_uring workers and vhost
workers should behave the same when it comes ot handling signals.

See 
https://lore.kernel.org/lkml/20230518-kontakt-geduckt-25bab595f503@brauner


> we need from the signal code and how we can support signals in the
> vhost_task layer.
> 
> Note that I took the super simple route and kicked off some work to
> the system workqueue. We can do more invassive approaches:
> 1. Modify the vhost drivers so they can check for IO completions using
> a non-blocking interface. We then don't need to run from the system
> workqueue and can run from the vhost_task.
> 
> 2. We could drop patch 1 and just say we are doing a polling type
> of approach. We then modify the vhost layer similar to #1 where we
> can check for completions using a non-blocking interface and use
> the vhost_task task.

My preference would be to do whatever is the minimal thing now and has
the least bug potential and is the easiest to review for us non-vhost
experts. Then you can take all the time to rework and improve the vhost
infra based on the possibilities that using user workers offers. Plus,
that can easily happen in the next kernel cycle.

Remember, that we're trying to fix a regression here. A regression on an
unreleased kernel but still.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND
  2023-05-18  8:25 ` [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND Christian Brauner
@ 2023-05-18  8:40   ` Christian Brauner
  2023-05-18 14:30   ` Christian Brauner
  1 sibling, 0 replies; 98+ messages in thread
From: Christian Brauner @ 2023-05-18  8:40 UTC (permalink / raw)
  To: Mike Christie
  Cc: oleg, linux, nicolas.dichtel, axboe, ebiederm, torvalds,
	linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha

On Thu, May 18, 2023 at 10:25:11AM +0200, Christian Brauner wrote:
> On Wed, May 17, 2023 at 07:09:12PM -0500, Mike Christie wrote:
> > This patch allows the vhost and vhost_task code to use CLONE_THREAD,
> > CLONE_SIGHAND and CLONE_FILES. It's a RFC because I didn't do all the
> > normal testing, haven't coverted vsock and vdpa, and I know you guys
> > will not like the first patch. However, I think it better shows what
> 
> Just to summarize the core idea behind my proposal is that no signal
> handling changes are needed unless there's a bug in the current way
> io_uring workers already work. All that should be needed is
> s/PF_IO_WORKER/PF_USER_WORKER/ in signal.c.
> 
> If you follow my proposal than vhost and io_uring workers should almost
> collapse into the same concept. Specifically, io_uring workers and vhost
> workers should behave the same when it comes ot handling signals.
> 
> See 
> https://lore.kernel.org/lkml/20230518-kontakt-geduckt-25bab595f503@brauner
> 
> 
> > we need from the signal code and how we can support signals in the
> > vhost_task layer.
> > 
> > Note that I took the super simple route and kicked off some work to
> > the system workqueue. We can do more invassive approaches:
> > 1. Modify the vhost drivers so they can check for IO completions using
> > a non-blocking interface. We then don't need to run from the system
> > workqueue and can run from the vhost_task.
> > 
> > 2. We could drop patch 1 and just say we are doing a polling type
> > of approach. We then modify the vhost layer similar to #1 where we
> > can check for completions using a non-blocking interface and use
> > the vhost_task task.
> 
> My preference would be to do whatever is the minimal thing now and has
> the least bug potential and is the easiest to review for us non-vhost
> experts. Then you can take all the time to rework and improve the vhost
> infra based on the possibilities that using user workers offers. Plus,
> that can easily happen in the next kernel cycle.
> 
> Remember, that we're trying to fix a regression here. A regression on an
> unreleased kernel but still.

It's a public holiday here today so I'll try to find time to review this
tomorrow.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFC PATCH 5/8] vhost: Add callback that stops new work and waits on running ones
  2023-05-18  0:09 ` [RFC PATCH 5/8] vhost: Add callback that stops new work and waits on running ones Mike Christie
@ 2023-05-18 14:18   ` Christian Brauner
  2023-05-18 15:03     ` Mike Christie
  0 siblings, 1 reply; 98+ messages in thread
From: Christian Brauner @ 2023-05-18 14:18 UTC (permalink / raw)
  To: Mike Christie
  Cc: oleg, linux, nicolas.dichtel, axboe, ebiederm, torvalds,
	linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha

On Wed, May 17, 2023 at 07:09:17PM -0500, Mike Christie wrote:
> When the vhost_task gets a SIGKILL we want to stop new work from being
> queued and also wait for and handle completions for running work. For the
> latter, we still need to use the vhost_task to handle the completing work
> so we can't just exit right away. But, this has us kick off the stopping
> and flushing/stopping of the device/vhost_task/worker to the system work
> queue while the vhost_task handles completions. When all completions are
> done we will then do vhost_task_stop and we will exit.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> ---
>  drivers/vhost/net.c   |  2 +-
>  drivers/vhost/scsi.c  |  4 ++--
>  drivers/vhost/test.c  |  3 ++-
>  drivers/vhost/vdpa.c  |  2 +-
>  drivers/vhost/vhost.c | 48 ++++++++++++++++++++++++++++++++++++-------
>  drivers/vhost/vhost.h | 10 ++++++++-
>  drivers/vhost/vsock.c |  4 ++--
>  7 files changed, 58 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 8557072ff05e..90c25127b3f8 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -1409,7 +1409,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
>  	vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX,
>  		       UIO_MAXIOV + VHOST_NET_BATCH,
>  		       VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true,
> -		       NULL);
> +		       NULL, NULL);
>  
>  	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, dev);
>  	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev);
> diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
> index bb10fa4bb4f6..40f9135e1a62 100644
> --- a/drivers/vhost/scsi.c
> +++ b/drivers/vhost/scsi.c
> @@ -1820,8 +1820,8 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
>  		vqs[i] = &vs->vqs[i].vq;
>  		vs->vqs[i].vq.handle_kick = vhost_scsi_handle_kick;
>  	}
> -	vhost_dev_init(&vs->dev, vqs, nvqs, UIO_MAXIOV,
> -		       VHOST_SCSI_WEIGHT, 0, true, NULL);
> +	vhost_dev_init(&vs->dev, vqs, nvqs, UIO_MAXIOV, VHOST_SCSI_WEIGHT, 0,
> +		       true, NULL, NULL);
>  
>  	vhost_scsi_init_inflight(vs, NULL);
>  
> diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
> index 42c955a5b211..11a2823d7532 100644
> --- a/drivers/vhost/test.c
> +++ b/drivers/vhost/test.c
> @@ -120,7 +120,8 @@ static int vhost_test_open(struct inode *inode, struct file *f)
>  	vqs[VHOST_TEST_VQ] = &n->vqs[VHOST_TEST_VQ];
>  	n->vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick;
>  	vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV,
> -		       VHOST_TEST_PKT_WEIGHT, VHOST_TEST_WEIGHT, true, NULL);
> +		       VHOST_TEST_PKT_WEIGHT, VHOST_TEST_WEIGHT, true, NULL,
> +		       NULL);
>  
>  	f->private_data = n;
>  
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index 8c1aefc865f0..de9a83ecb70d 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -1279,7 +1279,7 @@ static int vhost_vdpa_open(struct inode *inode, struct file *filep)
>  		vqs[i]->handle_kick = handle_vq_kick;
>  	}
>  	vhost_dev_init(dev, vqs, nvqs, 0, 0, 0, false,
> -		       vhost_vdpa_process_iotlb_msg);
> +		       vhost_vdpa_process_iotlb_msg, NULL);
>  
>  	r = vhost_vdpa_alloc_domain(v);
>  	if (r)
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 1ba9e068b2ab..4163c86db50c 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -336,6 +336,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>  static int vhost_worker(void *data)
>  {
>  	struct vhost_worker *worker = data;
> +	struct vhost_dev *dev = worker->dev;
>  	struct vhost_work *work, *work_next;
>  	struct llist_node *node;
>  
> @@ -352,12 +353,13 @@ static int vhost_worker(void *data)
>  		if (!node) {
>  			schedule();
>  			/*
> -			 * When we get a SIGKILL our release function will
> -			 * be called. That will stop new IOs from being queued
> -			 * and check for outstanding cmd responses. It will then
> -			 * call vhost_task_stop to exit us.
> +			 * When we get a SIGKILL we kick off a work to
> +			 * run the driver's helper to stop new work and
> +			 * handle completions. When they are done they will
> +			 * call vhost_task_stop to tell us to exit.
>  			 */
> -			vhost_task_get_signal();
> +			if (vhost_task_get_signal())
> +				schedule_work(&dev->destroy_worker);
>  		}

I'm pretty sure you still need to actually call exit here. Basically
mirror what's done in io_worker_exit() minus the io specific bits.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND
  2023-05-18  8:25 ` [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND Christian Brauner
  2023-05-18  8:40   ` Christian Brauner
@ 2023-05-18 14:30   ` Christian Brauner
  1 sibling, 0 replies; 98+ messages in thread
From: Christian Brauner @ 2023-05-18 14:30 UTC (permalink / raw)
  To: Mike Christie
  Cc: oleg, linux, nicolas.dichtel, axboe, ebiederm, torvalds,
	linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha

On Thu, May 18, 2023 at 10:25:11AM +0200, Christian Brauner wrote:
> On Wed, May 17, 2023 at 07:09:12PM -0500, Mike Christie wrote:
> > This patch allows the vhost and vhost_task code to use CLONE_THREAD,
> > CLONE_SIGHAND and CLONE_FILES. It's a RFC because I didn't do all the
> > normal testing, haven't coverted vsock and vdpa, and I know you guys
> > will not like the first patch. However, I think it better shows what
> 
> Just to summarize the core idea behind my proposal is that no signal
> handling changes are needed unless there's a bug in the current way
> io_uring workers already work. All that should be needed is
> s/PF_IO_WORKER/PF_USER_WORKER/ in signal.c.
> 
> If you follow my proposal than vhost and io_uring workers should almost
> collapse into the same concept. Specifically, io_uring workers and vhost
> workers should behave the same when it comes ot handling signals.
> 
> See 
> https://lore.kernel.org/lkml/20230518-kontakt-geduckt-25bab595f503@brauner
> 
> 
> > we need from the signal code and how we can support signals in the
> > vhost_task layer.
> > 
> > Note that I took the super simple route and kicked off some work to
> > the system workqueue. We can do more invassive approaches:
> > 1. Modify the vhost drivers so they can check for IO completions using
> > a non-blocking interface. We then don't need to run from the system
> > workqueue and can run from the vhost_task.
> > 
> > 2. We could drop patch 1 and just say we are doing a polling type
> > of approach. We then modify the vhost layer similar to #1 where we
> > can check for completions using a non-blocking interface and use
> > the vhost_task task.
> 
> My preference would be to do whatever is the minimal thing now and has
> the least bug potential and is the easiest to review for us non-vhost
> experts. Then you can take all the time to rework and improve the vhost
> infra based on the possibilities that using user workers offers. Plus,
> that can easily happen in the next kernel cycle.
> 
> Remember, that we're trying to fix a regression here. A regression on an
> unreleased kernel but still.

Just two more thoughts:

The following places currently check for PF_IO_WORKER:

arch/x86/include/asm/fpu/sched.h: !(current->flags & (PF_KTHREAD | PF_IO_WORKER))) {
arch/x86/kernel/fpu/context.h:    if (WARN_ON_ONCE(current->flags & (PF_KTHREAD | PF_IO_WORKER)))
arch/x86/kernel/fpu/core.c:       if (!(current->flags & (PF_KTHREAD | PF_IO_WORKER)) &&

Both PF_KTHREAD and PF_IO_WORKER don't need TIF_NEED_FPU_LOAD because
they never return to userspace. But that's not specific to
PF_IO_WORKERs. Please generalize this to just check for PF_USER_WORKER
via a simple s/PF_IO_WORKER/PF_USER_WORKER/g in these places.

Another thing, in the sched code we have hooks into sched_submit_work()
and sched_update_worker() specific to PF_IO_WORKERs. But again, I don't
think this needs to be special to PF_IO_WORKERS. This might be
generally useful for PF_USER_WORKER. So we should probably generalize
this and have a generic user_worker_sleeping() and user_worker_running()
helper that figures out internally what specific helper to call. That's
not something that needs to be done right now though since I don't think
vhost needs this functionality.

But we should generalize this for the next development cycle so we have
this all nice and clean when someone actually needs this. Overall this
will mean that there would only be a single place left where
PF_IO_WORKER would need to be checked and that's in io_uring code
itself. And if we do things just right we might not even need that
PF_IO_WORKER flag anymore at all. But again, that's just notes for next
cycle.

Thoughts? Rotten apples?

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFC PATCH 5/8] vhost: Add callback that stops new work and waits on running ones
  2023-05-18 14:18   ` Christian Brauner
@ 2023-05-18 15:03     ` Mike Christie
  2023-05-18 15:09       ` Christian Brauner
  2023-05-18 18:38       ` Eric W. Biederman
  0 siblings, 2 replies; 98+ messages in thread
From: Mike Christie @ 2023-05-18 15:03 UTC (permalink / raw)
  To: Christian Brauner
  Cc: oleg, linux, nicolas.dichtel, axboe, ebiederm, torvalds,
	linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha

On 5/18/23 9:18 AM, Christian Brauner wrote:
>> @@ -352,12 +353,13 @@ static int vhost_worker(void *data)
>>  		if (!node) {
>>  			schedule();
>>  			/*
>> -			 * When we get a SIGKILL our release function will
>> -			 * be called. That will stop new IOs from being queued
>> -			 * and check for outstanding cmd responses. It will then
>> -			 * call vhost_task_stop to exit us.
>> +			 * When we get a SIGKILL we kick off a work to
>> +			 * run the driver's helper to stop new work and
>> +			 * handle completions. When they are done they will
>> +			 * call vhost_task_stop to tell us to exit.
>>  			 */
>> -			vhost_task_get_signal();
>> +			if (vhost_task_get_signal())
>> +				schedule_work(&dev->destroy_worker);
>>  		}
> 
> I'm pretty sure you still need to actually call exit here. Basically
> mirror what's done in io_worker_exit() minus the io specific bits.

We do call do_exit(). Once destory_worker has flushed the device and
all outstanding IO has completed it call vhost_task_stop(). vhost_worker()
above then breaks out of the loop and returns and vhost_task_fn() does
do_exit().

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFC PATCH 5/8] vhost: Add callback that stops new work and waits on running ones
  2023-05-18 15:03     ` Mike Christie
@ 2023-05-18 15:09       ` Christian Brauner
  2023-05-18 18:38       ` Eric W. Biederman
  1 sibling, 0 replies; 98+ messages in thread
From: Christian Brauner @ 2023-05-18 15:09 UTC (permalink / raw)
  To: Mike Christie
  Cc: oleg, linux, nicolas.dichtel, axboe, ebiederm, torvalds,
	linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha

On Thu, May 18, 2023 at 10:03:32AM -0500, Mike Christie wrote:
> On 5/18/23 9:18 AM, Christian Brauner wrote:
> >> @@ -352,12 +353,13 @@ static int vhost_worker(void *data)
> >>  		if (!node) {
> >>  			schedule();
> >>  			/*
> >> -			 * When we get a SIGKILL our release function will
> >> -			 * be called. That will stop new IOs from being queued
> >> -			 * and check for outstanding cmd responses. It will then
> >> -			 * call vhost_task_stop to exit us.
> >> +			 * When we get a SIGKILL we kick off a work to
> >> +			 * run the driver's helper to stop new work and
> >> +			 * handle completions. When they are done they will
> >> +			 * call vhost_task_stop to tell us to exit.
> >>  			 */
> >> -			vhost_task_get_signal();
> >> +			if (vhost_task_get_signal())
> >> +				schedule_work(&dev->destroy_worker);
> >>  		}
> > 
> > I'm pretty sure you still need to actually call exit here. Basically
> > mirror what's done in io_worker_exit() minus the io specific bits.
> 
> We do call do_exit(). Once destory_worker has flushed the device and
> all outstanding IO has completed it call vhost_task_stop(). vhost_worker()
> above then breaks out of the loop and returns and vhost_task_fn() does
> do_exit().

Ah, that callchain wasn't obvious. Thanks.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
  2023-05-18  3:49   ` Eric W. Biederman
@ 2023-05-18 15:21     ` Mike Christie
  2023-05-18 16:25       ` Oleg Nesterov
  0 siblings, 1 reply; 98+ messages in thread
From: Mike Christie @ 2023-05-18 15:21 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: oleg, linux, nicolas.dichtel, axboe, torvalds, linux-kernel,
	virtualization, mst, sgarzare, jasowang, stefanha, brauner

On 5/17/23 10:49 PM, Eric W. Biederman wrote:
> 
> Long story short.
> 
> In the patch below the first hunk is a noop.
> 
> The code you are bypassing was added to ensure that process termination
> (aka SIGKILL) is processed before any other signals.  Other than signal
> processing order there are not any substantive differences in the two
> code paths.  With all signals except SIGSTOP == 19 and SIGKILL == 9
> blocked SIGKILL should always be processed before SIGSTOP.
> 
> Can you try patch with just the last hunk that does
> s/PF_IO_WORKER/PF_USER_WORKER/ and see if that is enough?
> 

If I just have the last hunk and then we get SIGKILL what happens is
in code like:

vhost_worker()

	schedule()
	if (has IO)
		handle_IO()

The schedule() calls will hit the signal_pending_state check for
signal_pending or __fatal_signal_pending and so instead of waiting
for whatever wake_up call we normally waited for we tend to just
return immediately. If you just run Qemu (the parent of the vhost_task)
and send SIGKILL then sometimes the vhost_task just spins and it
would look like the task has taken over the CPU (this is what I hit
when I tested Linus's patch).

With the first hunk of the patch, we will end up dequeuing the SIGKILL
and clearing TIF_SIGPENDING, so the vhost_task can still do some work
before it exits.

In the other patches we do:

if (get_signal(ksig))
	start_exit_cleanup_by_stopping_newIO()
	flush running IO()
	exit()

But to do the flush running IO() part of this I need to wait for it so
that's why I wanted to be able to dequeue the SIGKILL and clear the
TIF_SIGPENDING bit.

Or I don't need this specifically. In patch 0/8 I said I knew you guys
would not like it :) If I just have a:

if (fatal_signal())
	clear_fatal_signal()

then it would work for me.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
  2023-05-18  8:08   ` Christian Brauner
@ 2023-05-18 15:27     ` Mike Christie
  2023-05-18 17:07       ` Christian Brauner
  0 siblings, 1 reply; 98+ messages in thread
From: Mike Christie @ 2023-05-18 15:27 UTC (permalink / raw)
  To: Christian Brauner
  Cc: oleg, linux, nicolas.dichtel, axboe, ebiederm, torvalds,
	linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha

On 5/18/23 3:08 AM, Christian Brauner wrote:
> On Wed, May 17, 2023 at 07:09:13PM -0500, Mike Christie wrote:
>> This has us deqeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is
>> set when we are dealing with PF_USER_WORKER tasks.
>>
>> When a vhost_task gets a SIGKILL, we could have outstanding IO in flight.
>> We can easily stop new work/IO from being queued to the vhost_task, but
>> for IO that's already been sent to something like the block layer we
>> need to wait for the response then process it. These type of IO
>> completions use the vhost_task to process the completion so we can't
>> exit immediately.
>>
>> We need to handle wait for then handle those completions from the
>> vhost_task, but when we have a SIGKLL pending, functions like
>> schedule() return immediately so we can't wait like normal. Functions
>> like vhost_worker() degrade to just a while(1); loop.
>>
>> This patch has get_signal drop down to the normal code path when
>> SIGNAL_GROUP_EXIT/group_exec_task is set so the caller can still detect
>> there is a SIGKILL but still perform some blocking cleanup.
>>
>> Note that in that chunk I'm now bypassing that does:
>>
>> sigdelset(&current->pending.signal, SIGKILL);
>>
>> we look to be ok, because in the places we set SIGNAL_GROUP_EXIT/
>> group_exec_task we are already doing that on the threads in the
>> group.
>>
>> Signed-off-by: Mike Christie <michael.christie@oracle.com>
>> ---
> 
> I think you just got confused by the original discussion that was split
> into two separate threads:
> 
> (1) The discussion based on your original proposal to adjust the signal
>     handling logic to accommodate vhost workers as they are right now.
>     That's where Oleg jumped in.
> (2) My request - which you did in this series - of rewriting vhost
>     workers to behave more like io_uring workers.
> 
> Both problems are orthogonal. The gist of my proposal is to avoid (1) by
> doing (2). So the only change that's needed is
> s/PF_IO_WORKER/PF_USER_WORKER/ which is pretty obvious as io_uring
> workers and vhost workers no almost fully collapse into the same
> concept.
> 
> So forget (1). If additional signal patches are needed as discussed in
> (1) then it must be because of a bug that would affect io_uring workers
> today.

I maybe didn't exactly misunderstand you. I did patch 1/8 to show issues I
hit when I'm doing 2-8. See my reply to Eric's question about what I'm
hitting and why the last part of the patch only did not work for me:

https://lore.kernel.org/lkml/20230518000920.191583-2-michael.christie@oracle.com/T/#mc6286d1a42c79761248ba55f1dd7a433379be6d1

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
  2023-05-18 15:21     ` Mike Christie
@ 2023-05-18 16:25       ` Oleg Nesterov
  2023-05-18 16:42         ` Mike Christie
  0 siblings, 1 reply; 98+ messages in thread
From: Oleg Nesterov @ 2023-05-18 16:25 UTC (permalink / raw)
  To: Mike Christie
  Cc: Eric W. Biederman, linux, nicolas.dichtel, axboe, torvalds,
	linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha,
	brauner

I too do not understand the 1st change in this patch ...

On 05/18, Mike Christie wrote:
>
> In the other patches we do:
>
> if (get_signal(ksig))
> 	start_exit_cleanup_by_stopping_newIO()
> 	flush running IO()
> 	exit()
>
> But to do the flush running IO() part of this I need to wait for it so
> that's why I wanted to be able to dequeue the SIGKILL and clear the
> TIF_SIGPENDING bit.

But get_signal() will do what you need, dequeue SIGKILL and clear SIGPENDING ?

	if ((signal->flags & SIGNAL_GROUP_EXIT) ||
	     signal->group_exec_task) {
		clear_siginfo(&ksig->info);
		ksig->info.si_signo = signr = SIGKILL;
		sigdelset(&current->pending.signal, SIGKILL);

this "dequeues" SIGKILL,

		trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO,
			&sighand->action[SIGKILL - 1]);
		recalc_sigpending();

this clears TIF_SIGPENDING.

> Or I don't need this specifically. In patch 0/8 I said I knew you guys
> would not like it :) If I just have a:
>
> if (fatal_signal())
> 	clear_fatal_signal()

see above...


Well... I think this code is actually wrong if if SIGSTOP is pending and
the task is PF_IO_WORKER, but this is also true for io-threads so we can
discuss this separately.

Oleg.


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
  2023-05-18 16:25       ` Oleg Nesterov
@ 2023-05-18 16:42         ` Mike Christie
  2023-05-18 17:04           ` Oleg Nesterov
  0 siblings, 1 reply; 98+ messages in thread
From: Mike Christie @ 2023-05-18 16:42 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eric W. Biederman, linux, nicolas.dichtel, axboe, torvalds,
	linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha,
	brauner

On 5/18/23 11:25 AM, Oleg Nesterov wrote:
> I too do not understand the 1st change in this patch ...
> 
> On 05/18, Mike Christie wrote:
>>
>> In the other patches we do:
>>
>> if (get_signal(ksig))
>> 	start_exit_cleanup_by_stopping_newIO()
>> 	flush running IO()
>> 	exit()
>>
>> But to do the flush running IO() part of this I need to wait for it so
>> that's why I wanted to be able to dequeue the SIGKILL and clear the
>> TIF_SIGPENDING bit.
> 
> But get_signal() will do what you need, dequeue SIGKILL and clear SIGPENDING ?
> 
> 	if ((signal->flags & SIGNAL_GROUP_EXIT) ||
> 	     signal->group_exec_task) {
> 		clear_siginfo(&ksig->info);
> 		ksig->info.si_signo = signr = SIGKILL;
> 		sigdelset(&current->pending.signal, SIGKILL);
> 
> this "dequeues" SIGKILL,
> 
> 		trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO,
> 			&sighand->action[SIGKILL - 1]);
> 		recalc_sigpending();
> 
> this clears TIF_SIGPENDING.
> 

I see what you guys meant. TIF_SIGPENDING isn't getting cleared.
I'll dig into why.


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
  2023-05-18 16:42         ` Mike Christie
@ 2023-05-18 17:04           ` Oleg Nesterov
  2023-05-18 18:28             ` Eric W. Biederman
  0 siblings, 1 reply; 98+ messages in thread
From: Oleg Nesterov @ 2023-05-18 17:04 UTC (permalink / raw)
  To: Mike Christie
  Cc: Eric W. Biederman, linux, nicolas.dichtel, axboe, torvalds,
	linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha,
	brauner

On 05/18, Mike Christie wrote:
>
> On 5/18/23 11:25 AM, Oleg Nesterov wrote:
> > I too do not understand the 1st change in this patch ...
> >
> > On 05/18, Mike Christie wrote:
> >>
> >> In the other patches we do:
> >>
> >> if (get_signal(ksig))
> >> 	start_exit_cleanup_by_stopping_newIO()
> >> 	flush running IO()
> >> 	exit()
> >>
> >> But to do the flush running IO() part of this I need to wait for it so
> >> that's why I wanted to be able to dequeue the SIGKILL and clear the
> >> TIF_SIGPENDING bit.
> >
> > But get_signal() will do what you need, dequeue SIGKILL and clear SIGPENDING ?
> >
> > 	if ((signal->flags & SIGNAL_GROUP_EXIT) ||
> > 	     signal->group_exec_task) {
> > 		clear_siginfo(&ksig->info);
> > 		ksig->info.si_signo = signr = SIGKILL;
> > 		sigdelset(&current->pending.signal, SIGKILL);
> >
> > this "dequeues" SIGKILL,

OOPS. this doesn't remove SIGKILL from current->signal->shared_pending

> >
> > 		trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO,
> > 			&sighand->action[SIGKILL - 1]);
> > 		recalc_sigpending();
> >
> > this clears TIF_SIGPENDING.

No, I was wrong, recalc_sigpending() won't clear TIF_SIGPENDING if
SIGKILL is in signal->shared_pending

> I see what you guys meant. TIF_SIGPENDING isn't getting cleared.
> I'll dig into why.

See above, sorry for confusion.



And again, there is another problem with SIGSTOP. To simplify, suppose
a PF_IO_WORKER thread does something like

	while (signal_pending(current))
		get_signal(...);

this will loop forever if (SIGNAL_GROUP_EXIT || group_exec_task) and
SIGSTOP is pending.

Oleg.


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
  2023-05-18 15:27     ` Mike Christie
@ 2023-05-18 17:07       ` Christian Brauner
  2023-05-18 18:08         ` Oleg Nesterov
  0 siblings, 1 reply; 98+ messages in thread
From: Christian Brauner @ 2023-05-18 17:07 UTC (permalink / raw)
  To: Mike Christie
  Cc: oleg, linux, nicolas.dichtel, axboe, ebiederm, torvalds,
	linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha

On Thu, May 18, 2023 at 10:27:12AM -0500, Mike Christie wrote:
> On 5/18/23 3:08 AM, Christian Brauner wrote:
> > On Wed, May 17, 2023 at 07:09:13PM -0500, Mike Christie wrote:
> >> This has us deqeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is
> >> set when we are dealing with PF_USER_WORKER tasks.
> >>
> >> When a vhost_task gets a SIGKILL, we could have outstanding IO in flight.
> >> We can easily stop new work/IO from being queued to the vhost_task, but
> >> for IO that's already been sent to something like the block layer we
> >> need to wait for the response then process it. These type of IO
> >> completions use the vhost_task to process the completion so we can't
> >> exit immediately.
> >>
> >> We need to handle wait for then handle those completions from the
> >> vhost_task, but when we have a SIGKLL pending, functions like
> >> schedule() return immediately so we can't wait like normal. Functions
> >> like vhost_worker() degrade to just a while(1); loop.
> >>
> >> This patch has get_signal drop down to the normal code path when
> >> SIGNAL_GROUP_EXIT/group_exec_task is set so the caller can still detect
> >> there is a SIGKILL but still perform some blocking cleanup.
> >>
> >> Note that in that chunk I'm now bypassing that does:
> >>
> >> sigdelset(&current->pending.signal, SIGKILL);
> >>
> >> we look to be ok, because in the places we set SIGNAL_GROUP_EXIT/
> >> group_exec_task we are already doing that on the threads in the
> >> group.
> >>
> >> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> >> ---
> > 
> > I think you just got confused by the original discussion that was split
> > into two separate threads:
> > 
> > (1) The discussion based on your original proposal to adjust the signal
> >     handling logic to accommodate vhost workers as they are right now.
> >     That's where Oleg jumped in.
> > (2) My request - which you did in this series - of rewriting vhost
> >     workers to behave more like io_uring workers.
> > 
> > Both problems are orthogonal. The gist of my proposal is to avoid (1) by
> > doing (2). So the only change that's needed is
> > s/PF_IO_WORKER/PF_USER_WORKER/ which is pretty obvious as io_uring
> > workers and vhost workers no almost fully collapse into the same
> > concept.
> > 
> > So forget (1). If additional signal patches are needed as discussed in
> > (1) then it must be because of a bug that would affect io_uring workers
> > today.
> 
> I maybe didn't exactly misunderstand you. I did patch 1/8 to show issues I
> hit when I'm doing 2-8. See my reply to Eric's question about what I'm
> hitting and why the last part of the patch only did not work for me:
> 
> https://lore.kernel.org/lkml/20230518000920.191583-2-michael.christie@oracle.com/T/#mc6286d1a42c79761248ba55f1dd7a433379be6d1

Yeah, but these are issues that exist with PF_IO_WORKER then too which
was sort of my point.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
  2023-05-18 17:07       ` Christian Brauner
@ 2023-05-18 18:08         ` Oleg Nesterov
  2023-05-18 18:12           ` Christian Brauner
  0 siblings, 1 reply; 98+ messages in thread
From: Oleg Nesterov @ 2023-05-18 18:08 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Mike Christie, linux, nicolas.dichtel, axboe, ebiederm, torvalds,
	linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha

On 05/18, Christian Brauner wrote:
>
> Yeah, but these are issues that exist with PF_IO_WORKER then too

This was my thought too but I am starting to think I was wrong.

Of course I don't understand the code in io_uring/ but it seems
that it always breaks the IO loops if get_signal() returns SIGKILL.

Oleg.


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
  2023-05-18 18:08         ` Oleg Nesterov
@ 2023-05-18 18:12           ` Christian Brauner
  2023-05-18 18:23             ` Oleg Nesterov
  0 siblings, 1 reply; 98+ messages in thread
From: Christian Brauner @ 2023-05-18 18:12 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mike Christie, linux, nicolas.dichtel, axboe, ebiederm, torvalds,
	linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha

On Thu, May 18, 2023 at 08:08:10PM +0200, Oleg Nesterov wrote:
> On 05/18, Christian Brauner wrote:
> >
> > Yeah, but these are issues that exist with PF_IO_WORKER then too
> 
> This was my thought too but I am starting to think I was wrong.
> 
> Of course I don't understand the code in io_uring/ but it seems
> that it always breaks the IO loops if get_signal() returns SIGKILL.

Yeah, it does and I think Mike has a point that vhost could be running
into an issue here that io_uring currently does avoid. But I don't think
we should rely on that.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
  2023-05-18 18:12           ` Christian Brauner
@ 2023-05-18 18:23             ` Oleg Nesterov
  0 siblings, 0 replies; 98+ messages in thread
From: Oleg Nesterov @ 2023-05-18 18:23 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Mike Christie, linux, nicolas.dichtel, axboe, ebiederm, torvalds,
	linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha

On 05/18, Christian Brauner wrote:
>
> On Thu, May 18, 2023 at 08:08:10PM +0200, Oleg Nesterov wrote:
> > On 05/18, Christian Brauner wrote:
> > >
> > > Yeah, but these are issues that exist with PF_IO_WORKER then too
> >
> > This was my thought too but I am starting to think I was wrong.
> >
> > Of course I don't understand the code in io_uring/ but it seems
> > that it always breaks the IO loops if get_signal() returns SIGKILL.
>
> Yeah, it does and I think Mike has a point that vhost could be running
> into an issue here that io_uring currently does avoid. But I don't think
> we should rely on that.

So what do you propose?

Unless (quite possibly) I am confused again, unlike io_uring vhost can't
tolerate signal_pending() == T in the cleanup-after-SIGKILL paths?

Oleg.


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
  2023-05-18 17:04           ` Oleg Nesterov
@ 2023-05-18 18:28             ` Eric W. Biederman
  2023-05-18 22:57               ` Mike Christie
  2023-05-22 13:30               ` Oleg Nesterov
  0 siblings, 2 replies; 98+ messages in thread
From: Eric W. Biederman @ 2023-05-18 18:28 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mike Christie, linux, nicolas.dichtel, axboe, torvalds,
	linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha,
	brauner

Oleg Nesterov <oleg@redhat.com> writes:

> On 05/18, Mike Christie wrote:
>>
>> On 5/18/23 11:25 AM, Oleg Nesterov wrote:
>> > I too do not understand the 1st change in this patch ...
>> >
>> > On 05/18, Mike Christie wrote:
>> >>
>> >> In the other patches we do:
>> >>
>> >> if (get_signal(ksig))
>> >> 	start_exit_cleanup_by_stopping_newIO()
>> >> 	flush running IO()
>> >> 	exit()
>> >>
>> >> But to do the flush running IO() part of this I need to wait for it so
>> >> that's why I wanted to be able to dequeue the SIGKILL and clear the
>> >> TIF_SIGPENDING bit.
>> >
>> > But get_signal() will do what you need, dequeue SIGKILL and clear SIGPENDING ?
>> >
>> > 	if ((signal->flags & SIGNAL_GROUP_EXIT) ||
>> > 	     signal->group_exec_task) {
>> > 		clear_siginfo(&ksig->info);
>> > 		ksig->info.si_signo = signr = SIGKILL;
>> > 		sigdelset(&current->pending.signal, SIGKILL);
>> >
>> > this "dequeues" SIGKILL,
>
> OOPS. this doesn't remove SIGKILL from current->signal->shared_pending

Neither does calling get_signal the first time.
But the second time get_signal is called it should work.

Leaving SIGKILL in current->signal->shared_pending when it has already
been short circuit delivered appears to be an out and out bug.

>> >
>> > 		trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO,
>> > 			&sighand->action[SIGKILL - 1]);
>> > 		recalc_sigpending();
>> >
>> > this clears TIF_SIGPENDING.
>
> No, I was wrong, recalc_sigpending() won't clear TIF_SIGPENDING if
> SIGKILL is in signal->shared_pending

That feels wrong as well.

>> I see what you guys meant. TIF_SIGPENDING isn't getting cleared.
>> I'll dig into why.
>
> See above, sorry for confusion.
>
>
>
> And again, there is another problem with SIGSTOP. To simplify, suppose
> a PF_IO_WORKER thread does something like
>
> 	while (signal_pending(current))
> 		get_signal(...);
>
> this will loop forever if (SIGNAL_GROUP_EXIT || group_exec_task) and
> SIGSTOP is pending.

I think we want to do something like the untested diff below.

That the PF_IO_WORKER test allows get_signal to be called
after get_signal returns a fatal aka SIGKILL seems wrong.
That doesn't happen in the io_uring case, and certainly nowhere
else.

The change to complete_signal appears obviously correct although
a pending siginfo still needs to be handled.

The change to recalc_siginfo also appears mostly right, but I am not
certain that the !freezing test is in the proper place.  Nor am I
certain it won't have other surprise effects.

Still the big issue seems to be the way get_signal is connected into
these threads so that it keeps getting called.  Calling get_signal after
a fatal signal has been returned happens nowhere else and even if we fix
it today it is likely to lead to bugs in the future because whoever is
testing and updating the code is unlikely they have a vhost test case
the care about.

diff --git a/kernel/signal.c b/kernel/signal.c
index 8f6330f0e9ca..4d54718cad36 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -181,7 +181,9 @@ void recalc_sigpending_and_wake(struct task_struct *t)
 
 void recalc_sigpending(void)
 {
-       if (!recalc_sigpending_tsk(current) && !freezing(current))
+       if ((!recalc_sigpending_tsk(current) && !freezing(current)) ||
+           ((current->signal->flags & SIGNAL_GROUP_EXIT) &&
+                   !__fatal_signal_pending(current)))
                clear_thread_flag(TIF_SIGPENDING);
 
 }
@@ -1043,6 +1045,13 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
                 * This signal will be fatal to the whole group.
                 */
                if (!sig_kernel_coredump(sig)) {
+                       /*
+                        * The signal is being short circuit delivered
+                        * don't it pending.
+                        */
+                       if (type != PIDTYPE_PID) {
+                               sigdelset(&t->signal->shared_pending,  sig);
+
                        /*
                         * Start a group exit and wake everybody up.
                         * This way we don't have other threads



Eric

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [RFC PATCH 5/8] vhost: Add callback that stops new work and waits on running ones
  2023-05-18 15:03     ` Mike Christie
  2023-05-18 15:09       ` Christian Brauner
@ 2023-05-18 18:38       ` Eric W. Biederman
  1 sibling, 0 replies; 98+ messages in thread
From: Eric W. Biederman @ 2023-05-18 18:38 UTC (permalink / raw)
  To: Mike Christie
  Cc: Christian Brauner, oleg, linux, nicolas.dichtel, axboe, torvalds,
	linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha

Mike Christie <michael.christie@oracle.com> writes:

> On 5/18/23 9:18 AM, Christian Brauner wrote:
>>> @@ -352,12 +353,13 @@ static int vhost_worker(void *data)
>>>  		if (!node) {
>>>  			schedule();
>>>  			/*
>>> -			 * When we get a SIGKILL our release function will
>>> -			 * be called. That will stop new IOs from being queued
>>> -			 * and check for outstanding cmd responses. It will then
>>> -			 * call vhost_task_stop to exit us.
>>> +			 * When we get a SIGKILL we kick off a work to
>>> +			 * run the driver's helper to stop new work and
>>> +			 * handle completions. When they are done they will
>>> +			 * call vhost_task_stop to tell us to exit.
>>>  			 */
>>> -			vhost_task_get_signal();
>>> +			if (vhost_task_get_signal())
>>> +				schedule_work(&dev->destroy_worker);
>>>  		}
>> 
>> I'm pretty sure you still need to actually call exit here. Basically
>> mirror what's done in io_worker_exit() minus the io specific bits.
>
> We do call do_exit(). Once destory_worker has flushed the device and
> all outstanding IO has completed it call vhost_task_stop(). vhost_worker()
> above then breaks out of the loop and returns and vhost_task_fn() does
> do_exit().

I am not certain how you want to structure this but you really should
not call get_signal after it returns positive before you call do_exit.

You are in complete uncharted and untested waters calling get_signal
multiple times, when get_signal figures the proper response is to
call do_exit itself.

Eric


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
  2023-05-18 18:28             ` Eric W. Biederman
@ 2023-05-18 22:57               ` Mike Christie
  2023-05-19  4:16                 ` Eric W. Biederman
  2023-05-22 13:30               ` Oleg Nesterov
  1 sibling, 1 reply; 98+ messages in thread
From: Mike Christie @ 2023-05-18 22:57 UTC (permalink / raw)
  To: Eric W. Biederman, Oleg Nesterov
  Cc: linux, nicolas.dichtel, axboe, torvalds, linux-kernel,
	virtualization, mst, sgarzare, jasowang, stefanha, brauner

On 5/18/23 1:28 PM, Eric W. Biederman wrote:
> Still the big issue seems to be the way get_signal is connected into
> these threads so that it keeps getting called.  Calling get_signal after
> a fatal signal has been returned happens nowhere else and even if we fix
> it today it is likely to lead to bugs in the future because whoever is
> testing and updating the code is unlikely they have a vhost test case
> the care about.
> 
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 8f6330f0e9ca..4d54718cad36 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -181,7 +181,9 @@ void recalc_sigpending_and_wake(struct task_struct *t)
>  
>  void recalc_sigpending(void)
>  {
> -       if (!recalc_sigpending_tsk(current) && !freezing(current))
> +       if ((!recalc_sigpending_tsk(current) && !freezing(current)) ||
> +           ((current->signal->flags & SIGNAL_GROUP_EXIT) &&
> +                   !__fatal_signal_pending(current)))
>                 clear_thread_flag(TIF_SIGPENDING);
>  
>  }
> @@ -1043,6 +1045,13 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
>                  * This signal will be fatal to the whole group.
>                  */
>                 if (!sig_kernel_coredump(sig)) {
> +                       /*
> +                        * The signal is being short circuit delivered
> +                        * don't it pending.
> +                        */
> +                       if (type != PIDTYPE_PID) {
> +                               sigdelset(&t->signal->shared_pending,  sig);
> +
>                         /*
>                          * Start a group exit and wake everybody up.
>                          * This way we don't have other threads
> 

If I change up your patch so the last part is moved down a bit to when we set t
like this:

diff --git a/kernel/signal.c b/kernel/signal.c
index 0ac48c96ab04..c976a80650db 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -181,9 +181,10 @@ void recalc_sigpending_and_wake(struct task_struct *t)
 
 void recalc_sigpending(void)
 {
-	if (!recalc_sigpending_tsk(current) && !freezing(current))
+	if ((!recalc_sigpending_tsk(current) && !freezing(current)) ||
+	    ((current->signal->flags & SIGNAL_GROUP_EXIT) &&
+	     !__fatal_signal_pending(current)))
 		clear_thread_flag(TIF_SIGPENDING);
-
 }
 EXPORT_SYMBOL(recalc_sigpending);
 
@@ -1053,6 +1054,17 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
 			signal->group_exit_code = sig;
 			signal->group_stop_count = 0;
 			t = p;
+			/*
+			 * The signal is being short circuit delivered
+			 * don't it pending.
+			 */
+			if (type != PIDTYPE_PID) {
+				struct sigpending *pending;
+
+				pending = &t->signal->shared_pending;
+				sigdelset(&pending->signal, sig);
+			}
+
 			do {
 				task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
 				sigaddset(&t->pending.signal, SIGKILL);


Then get_signal() works like how Oleg mentioned it should earlier.

For vhost I just need the code below which is just Linus's patch plus a call
to get_signal() in vhost_worker() and the PF_IO_WORKER->PF_USER_WORKER change.

Note that when we get SIGKILL, the vhost file_operations->release function is called via

            do_exit -> exit_files -> put_files_struct -> close_files

and so the vhost release function starts to flush IO and stop the worker/vhost
task. In vhost_worker() then we just handle those last completions for already
running IO. When  the vhost release function detects they are done it does
vhost_task_stop() and vhost_worker() returns and then vhost_task_fn() does do_exit().
So we don't return immediately when get_signal() returns non-zero.

So it works, but it sounds like you don't like vhost relying on the behavior,
and it's non standard to use get_signal() like we are. So I'm not sure how we
want to proceed.

Maybe the safest is to revert:

commit 6e890c5d5021ca7e69bbe203fde42447874d9a82
Author: Mike Christie <michael.christie@oracle.com>
Date:   Fri Mar 10 16:03:32 2023 -0600

    vhost: use vhost_tasks for worker threads

and retry this for the next kernel when we can do proper testing and more
code review?


diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a92af08e7864..1ba9e068b2ab 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -349,8 +349,16 @@ static int vhost_worker(void *data)
 		}
 
 		node = llist_del_all(&worker->work_list);
-		if (!node)
+		if (!node) {
 			schedule();
+			/*
+			 * When we get a SIGKILL our release function will
+			 * be called. That will stop new IOs from being queued
+			 * and check for outstanding cmd responses. It will then
+			 * call vhost_task_stop to exit us.
+			 */
+			vhost_task_get_signal();
+		}
 
 		node = llist_reverse_order(node);
 		/* make sure flag is seen after deletion */
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 537cbf9a2ade..249a5ece9def 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -29,7 +29,7 @@ struct kernel_clone_args {
 	u32 io_thread:1;
 	u32 user_worker:1;
 	u32 no_files:1;
-	u32 ignore_signals:1;
+	u32 block_signals:1;
 	unsigned long stack;
 	unsigned long stack_size;
 	unsigned long tls;
diff --git a/include/linux/sched/vhost_task.h b/include/linux/sched/vhost_task.h
index 6123c10b99cf..79bf0ed4ded0 100644
--- a/include/linux/sched/vhost_task.h
+++ b/include/linux/sched/vhost_task.h
@@ -19,5 +19,6 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
 void vhost_task_start(struct vhost_task *vtsk);
 void vhost_task_stop(struct vhost_task *vtsk);
 bool vhost_task_should_stop(struct vhost_task *vtsk);
+void vhost_task_get_signal(void);
 
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index ed4e01daccaa..9e04ab5c3946 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2338,14 +2338,10 @@ __latent_entropy struct task_struct *copy_process(
 		p->flags |= PF_KTHREAD;
 	if (args->user_worker)
 		p->flags |= PF_USER_WORKER;
-	if (args->io_thread) {
-		/*
-		 * Mark us an IO worker, and block any signal that isn't
-		 * fatal or STOP
-		 */
+	if (args->io_thread)
 		p->flags |= PF_IO_WORKER;
+	if (args->block_signals)
 		siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
-	}
 
 	if (args->name)
 		strscpy_pad(p->comm, args->name, sizeof(p->comm));
@@ -2517,9 +2513,6 @@ __latent_entropy struct task_struct *copy_process(
 	if (retval)
 		goto bad_fork_cleanup_io;
 
-	if (args->ignore_signals)
-		ignore_signals(p);
-
 	stackleak_task_init(p);
 
 	if (pid != &init_struct_pid) {
@@ -2861,6 +2854,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
 		.fn_arg		= arg,
 		.io_thread	= 1,
 		.user_worker	= 1,
+		.block_signals	= 1,
 	};
 
 	return copy_process(NULL, 0, node, &args);
diff --git a/kernel/signal.c b/kernel/signal.c
index 8f6330f0e9ca..0ac48c96ab04 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2861,11 +2861,11 @@ bool get_signal(struct ksignal *ksig)
 		}
 
 		/*
-		 * PF_IO_WORKER threads will catch and exit on fatal signals
+		 * PF_USER_WORKER threads will catch and exit on fatal signals
 		 * themselves. They have cleanup that must be performed, so
 		 * we cannot call do_exit() on their behalf.
 		 */
-		if (current->flags & PF_IO_WORKER)
+		if (current->flags & PF_USER_WORKER)
 			goto out;
 
 		/*
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index b7cbd66f889e..82467f450f0d 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -31,22 +31,13 @@ static int vhost_task_fn(void *data)
  */
 void vhost_task_stop(struct vhost_task *vtsk)
 {
-	pid_t pid = vtsk->task->pid;
-
 	set_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags);
 	wake_up_process(vtsk->task);
 	/*
 	 * Make sure vhost_task_fn is no longer accessing the vhost_task before
-	 * freeing it below. If userspace crashed or exited without closing,
-	 * then the vhost_task->task could already be marked dead so
-	 * kernel_wait will return early.
+	 * freeing it below.
 	 */
 	wait_for_completion(&vtsk->exited);
-	/*
-	 * If we are just closing/removing a device and the parent process is
-	 * not exiting then reap the task.
-	 */
-	kernel_wait4(pid, NULL, __WCLONE, NULL);
 	kfree(vtsk);
 }
 EXPORT_SYMBOL_GPL(vhost_task_stop);
@@ -61,6 +52,25 @@ bool vhost_task_should_stop(struct vhost_task *vtsk)
 }
 EXPORT_SYMBOL_GPL(vhost_task_should_stop);
 
+/**
+ * vhost_task_get_signal - Check if there are pending signals
+ *
+ * This checks if there are signals and will handle freezes requests. For
+ * SIGKILL, out file_operations->release is already being called when we
+ * see the signal, so we let release call vhost_task_stop to tell the
+ * vhost_task to exit when it's done using the task.
+ */
+void vhost_task_get_signal(void)
+{
+	struct ksignal ksig;
+
+	if (!signal_pending(current))
+		return;
+
+	get_signal(&ksig);
+}
+EXPORT_SYMBOL_GPL(vhost_task_get_signal);
+
 /**
  * vhost_task_create - create a copy of a process to be used by the kernel
  * @fn: thread stack
@@ -75,13 +85,14 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
 				     const char *name)
 {
 	struct kernel_clone_args args = {
-		.flags		= CLONE_FS | CLONE_UNTRACED | CLONE_VM,
+		.flags		= CLONE_FS | CLONE_UNTRACED | CLONE_VM |
+				  CLONE_THREAD | CLONE_SIGHAND,
 		.exit_signal	= 0,
 		.fn		= vhost_task_fn,
 		.name		= name,
 		.user_worker	= 1,
 		.no_files	= 1,
-		.ignore_signals	= 1,
+		.block_signals	= 1,
 	};
 	struct vhost_task *vtsk;
 	struct task_struct *tsk;


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
  2023-05-18 22:57               ` Mike Christie
@ 2023-05-19  4:16                 ` Eric W. Biederman
  2023-05-19 23:24                   ` Mike Christie
  0 siblings, 1 reply; 98+ messages in thread
From: Eric W. Biederman @ 2023-05-19  4:16 UTC (permalink / raw)
  To: Mike Christie
  Cc: Oleg Nesterov, linux, nicolas.dichtel, axboe, torvalds,
	linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha,
	brauner

Mike Christie <michael.christie@oracle.com> writes:

> On 5/18/23 1:28 PM, Eric W. Biederman wrote:
>> Still the big issue seems to be the way get_signal is connected into
>> these threads so that it keeps getting called.  Calling get_signal after
>> a fatal signal has been returned happens nowhere else and even if we fix
>> it today it is likely to lead to bugs in the future because whoever is
>> testing and updating the code is unlikely they have a vhost test case
>> the care about.
>> 
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index 8f6330f0e9ca..4d54718cad36 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -181,7 +181,9 @@ void recalc_sigpending_and_wake(struct task_struct *t)
>>  
>>  void recalc_sigpending(void)
>>  {
>> -       if (!recalc_sigpending_tsk(current) && !freezing(current))
>> +       if ((!recalc_sigpending_tsk(current) && !freezing(current)) ||
>> +           ((current->signal->flags & SIGNAL_GROUP_EXIT) &&
>> +                   !__fatal_signal_pending(current)))
>>                 clear_thread_flag(TIF_SIGPENDING);
>>  
>>  }
>> @@ -1043,6 +1045,13 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
>>                  * This signal will be fatal to the whole group.
>>                  */
>>                 if (!sig_kernel_coredump(sig)) {
>> +                       /*
>> +                        * The signal is being short circuit delivered
>> +                        * don't it pending.
>> +                        */
>> +                       if (type != PIDTYPE_PID) {
>> +                               sigdelset(&t->signal->shared_pending,  sig);
>> +
>>                         /*
>>                          * Start a group exit and wake everybody up.
>>                          * This way we don't have other threads
>> 
>
> If I change up your patch so the last part is moved down a bit to when we set t
> like this:
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 0ac48c96ab04..c976a80650db 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -181,9 +181,10 @@ void recalc_sigpending_and_wake(struct task_struct *t)
>  
>  void recalc_sigpending(void)
>  {
> -	if (!recalc_sigpending_tsk(current) && !freezing(current))
> +	if ((!recalc_sigpending_tsk(current) && !freezing(current)) ||
> +	    ((current->signal->flags & SIGNAL_GROUP_EXIT) &&
> +	     !__fatal_signal_pending(current)))
>  		clear_thread_flag(TIF_SIGPENDING);
> -
Can we get rid of this suggestion to recalc_sigpending.  The more I look
at it the more I am convinced it is not safe.  In particular I believe
it is incompatible with dump_interrupted() in fs/coredump.c

The code in fs/coredump.c is the closest code we have to what you are
trying to do with vhost_worker after the session is killed.  It also
struggles with TIF_SIGPENDING getting set. 
>  }
>  EXPORT_SYMBOL(recalc_sigpending);
>  
> @@ -1053,6 +1054,17 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
>  			signal->group_exit_code = sig;
>  			signal->group_stop_count = 0;
>  			t = p;
> +			/*
> +			 * The signal is being short circuit delivered
> +			 * don't it pending.
> +			 */
> +			if (type != PIDTYPE_PID) {
> +				struct sigpending *pending;
> +
> +				pending = &t->signal->shared_pending;
> +				sigdelset(&pending->signal, sig);
> +			}
> +
>  			do {
>  				task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
>  				sigaddset(&t->pending.signal, SIGKILL);
>
>
> Then get_signal() works like how Oleg mentioned it should earlier.

I am puzzled it makes a difference as t->signal and p->signal should
point to the same thing, and in fact the code would more clearly read
sigdelset(&signal->shared_pending, sig);

But all of that seems minor.

> For vhost I just need the code below which is just Linus's patch plus a call
> to get_signal() in vhost_worker() and the PF_IO_WORKER->PF_USER_WORKER change.
>
> Note that when we get SIGKILL, the vhost file_operations->release function is called via
>
>             do_exit -> exit_files -> put_files_struct -> close_files
>
> and so the vhost release function starts to flush IO and stop the worker/vhost
> task. In vhost_worker() then we just handle those last completions for already
> running IO. When  the vhost release function detects they are done it does
> vhost_task_stop() and vhost_worker() returns and then vhost_task_fn() does do_exit().
> So we don't return immediately when get_signal() returns non-zero.
>
> So it works, but it sounds like you don't like vhost relying on the behavior,
> and it's non standard to use get_signal() like we are. So I'm not sure how we
> want to proceed.

Let me clarify my concern.

Your code modifies get_signal as:
 		/*
-		 * PF_IO_WORKER threads will catch and exit on fatal signals
+		 * PF_USER_WORKER threads will catch and exit on fatal signals
 		 * themselves. They have cleanup that must be performed, so
 		 * we cannot call do_exit() on their behalf.
 		 */
-		if (current->flags & PF_IO_WORKER)
+		if (current->flags & PF_USER_WORKER)
 			goto out;
 		/*
 		 * Death signals, no core dump.
 		 */
 		do_group_exit(ksig->info.si_signo);
 		/* NOTREACHED */

Which means by modifying get_signal you are logically deleting the
do_group_exit from get_signal.  As far as that goes that is a perfectly
reasonable change.  The problem is you wind up calling get_signal again
after that.  That does not make sense.

I would suggest doing something like:

 static int vhost_worker(void *data)
 {
 	struct vhost_worker *worker = data;
 	struct vhost_work *work, *work_next;
 	struct llist_node *node;
        bool dead = false;
 
 	for (;;) {
 		/* mb paired w/ kthread_stop */
 		set_current_state(TASK_INTERRUPTIBLE);
 
 		if (vhost_task_should_stop(worker->vtsk)) {
 			__set_current_state(TASK_RUNNING);
 			break;
 		}
 
+		if (!dead && signal_pending()) {
+			dead = get_signal();
+                       if (dead) {
+				/*
+			 	* When the process exits we kick off a work to
+			 	* run the driver's helper to stop new work and
+			 	* handle completions. When they are done they will
+			 	* call vhost_task_stop to tell us to exit.
+			 	*/
+				schedule_work(&dev->destroy_worker);
+				clear_thread_flag(TIF_SIGPENDING);
+			}
+		}
+
 		node = llist_del_all(&worker->work_list);
 		if (!node)
 			schedule();

 		node = llist_reverse_order(node);
 		/* make sure flag is seen after deletion */
 		smp_wmb();
 		llist_for_each_entry_safe(work, work_next, node, node) {
 			clear_bit(VHOST_WORK_QUEUED, &work->flags);
 			__set_current_state(TASK_RUNNING);
 			kcov_remote_start_common(worker->kcov_handle);
 			work->fn(work);
 			kcov_remote_stop();
 			cond_resched();
 		}
 	}
 
 	return 0;
 }


The idea is two fold.
1) Call get_signal every time through the loop to handle SIGSTOP (to the
   process).
2) Don't call get_signal after you know the process is exiting.

With a single call to get_signal (once the process is dead) I don't
see any fundamental problems with your approach.  It is doing pretty
much what fs/coredump.c is trying to do.

*Grumble*  fs/coredump.c also struggles with TIF_SIGPENDING.  But at
least you won't be alone.


> Maybe the safest is to revert:
>
> commit 6e890c5d5021ca7e69bbe203fde42447874d9a82
> Author: Mike Christie <michael.christie@oracle.com>
> Date:   Fri Mar 10 16:03:32 2023 -0600
>
>     vhost: use vhost_tasks for worker threads
>
> and retry this for the next kernel when we can do proper testing and more
> code review?

I can see wisdom in that.  It is always nice when you don't have
to scramble to get the code to do what you want.


What is the diff below?  It does not appear to a revert diff.

> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index a92af08e7864..1ba9e068b2ab 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -349,8 +349,16 @@ static int vhost_worker(void *data)
>  		}
>  
>  		node = llist_del_all(&worker->work_list);
> -		if (!node)
> +		if (!node) {
>  			schedule();
> +			/*
> +			 * When we get a SIGKILL our release function will
> +			 * be called. That will stop new IOs from being queued
> +			 * and check for outstanding cmd responses. It will then
> +			 * call vhost_task_stop to exit us.
> +			 */
> +			vhost_task_get_signal();
> +		}
>  
>  		node = llist_reverse_order(node);
>  		/* make sure flag is seen after deletion */
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index 537cbf9a2ade..249a5ece9def 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -29,7 +29,7 @@ struct kernel_clone_args {
>  	u32 io_thread:1;
>  	u32 user_worker:1;
>  	u32 no_files:1;
> -	u32 ignore_signals:1;
> +	u32 block_signals:1;
>  	unsigned long stack;
>  	unsigned long stack_size;
>  	unsigned long tls;
> diff --git a/include/linux/sched/vhost_task.h b/include/linux/sched/vhost_task.h
> index 6123c10b99cf..79bf0ed4ded0 100644
> --- a/include/linux/sched/vhost_task.h
> +++ b/include/linux/sched/vhost_task.h
> @@ -19,5 +19,6 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
>  void vhost_task_start(struct vhost_task *vtsk);
>  void vhost_task_stop(struct vhost_task *vtsk);
>  bool vhost_task_should_stop(struct vhost_task *vtsk);
> +void vhost_task_get_signal(void);
>  
>  #endif
> diff --git a/kernel/fork.c b/kernel/fork.c
> index ed4e01daccaa..9e04ab5c3946 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2338,14 +2338,10 @@ __latent_entropy struct task_struct *copy_process(
>  		p->flags |= PF_KTHREAD;
>  	if (args->user_worker)
>  		p->flags |= PF_USER_WORKER;
> -	if (args->io_thread) {
> -		/*
> -		 * Mark us an IO worker, and block any signal that isn't
> -		 * fatal or STOP
> -		 */
> +	if (args->io_thread)
>  		p->flags |= PF_IO_WORKER;
> +	if (args->block_signals)
>  		siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
> -	}
>  
>  	if (args->name)
>  		strscpy_pad(p->comm, args->name, sizeof(p->comm));
> @@ -2517,9 +2513,6 @@ __latent_entropy struct task_struct *copy_process(
>  	if (retval)
>  		goto bad_fork_cleanup_io;
>  
> -	if (args->ignore_signals)
> -		ignore_signals(p);
> -
>  	stackleak_task_init(p);
>  
>  	if (pid != &init_struct_pid) {
> @@ -2861,6 +2854,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
>  		.fn_arg		= arg,
>  		.io_thread	= 1,
>  		.user_worker	= 1,
> +		.block_signals	= 1,
>  	};
>  
>  	return copy_process(NULL, 0, node, &args);
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 8f6330f0e9ca..0ac48c96ab04 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2861,11 +2861,11 @@ bool get_signal(struct ksignal *ksig)
>  		}
>  
>  		/*
> -		 * PF_IO_WORKER threads will catch and exit on fatal signals
> +		 * PF_USER_WORKER threads will catch and exit on fatal signals
>  		 * themselves. They have cleanup that must be performed, so
>  		 * we cannot call do_exit() on their behalf.
>  		 */
> -		if (current->flags & PF_IO_WORKER)
> +		if (current->flags & PF_USER_WORKER)
>  			goto out;
>  
>  		/*
> diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
> index b7cbd66f889e..82467f450f0d 100644
> --- a/kernel/vhost_task.c
> +++ b/kernel/vhost_task.c
> @@ -31,22 +31,13 @@ static int vhost_task_fn(void *data)
>   */
>  void vhost_task_stop(struct vhost_task *vtsk)
>  {
> -	pid_t pid = vtsk->task->pid;
> -
>  	set_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags);
>  	wake_up_process(vtsk->task);
>  	/*
>  	 * Make sure vhost_task_fn is no longer accessing the vhost_task before
> -	 * freeing it below. If userspace crashed or exited without closing,
> -	 * then the vhost_task->task could already be marked dead so
> -	 * kernel_wait will return early.
> +	 * freeing it below.
>  	 */
>  	wait_for_completion(&vtsk->exited);
> -	/*
> -	 * If we are just closing/removing a device and the parent process is
> -	 * not exiting then reap the task.
> -	 */
> -	kernel_wait4(pid, NULL, __WCLONE, NULL);
>  	kfree(vtsk);
>  }
>  EXPORT_SYMBOL_GPL(vhost_task_stop);
> @@ -61,6 +52,25 @@ bool vhost_task_should_stop(struct vhost_task *vtsk)
>  }
>  EXPORT_SYMBOL_GPL(vhost_task_should_stop);
>  
> +/**
> + * vhost_task_get_signal - Check if there are pending signals
> + *
> + * This checks if there are signals and will handle freezes requests. For
> + * SIGKILL, out file_operations->release is already being called when we
> + * see the signal, so we let release call vhost_task_stop to tell the
> + * vhost_task to exit when it's done using the task.
> + */
> +void vhost_task_get_signal(void)
> +{
> +	struct ksignal ksig;
> +
> +	if (!signal_pending(current))
> +		return;
> +
> +	get_signal(&ksig);
> +}
> +EXPORT_SYMBOL_GPL(vhost_task_get_signal);
> +
>  /**
>   * vhost_task_create - create a copy of a process to be used by the kernel
>   * @fn: thread stack
> @@ -75,13 +85,14 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
>  				     const char *name)
>  {
>  	struct kernel_clone_args args = {
> -		.flags		= CLONE_FS | CLONE_UNTRACED | CLONE_VM,
> +		.flags		= CLONE_FS | CLONE_UNTRACED | CLONE_VM |
> +				  CLONE_THREAD | CLONE_SIGHAND,
>  		.exit_signal	= 0,
>  		.fn		= vhost_task_fn,
>  		.name		= name,
>  		.user_worker	= 1,
>  		.no_files	= 1,
> -		.ignore_signals	= 1,
> +		.block_signals	= 1,
>  	};
>  	struct vhost_task *vtsk;
>  	struct task_struct *tsk;

Eric

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND
  2023-05-16  8:39                   ` Christian Brauner
  2023-05-16 16:24                     ` Mike Christie
@ 2023-05-19 12:15                     ` Christian Brauner
  2023-06-01  7:58                       ` Thorsten Leemhuis
  1 sibling, 1 reply; 98+ messages in thread
From: Christian Brauner @ 2023-05-19 12:15 UTC (permalink / raw)
  To: Mike Christie, Linus Torvalds
  Cc: oleg, linux, nicolas.dichtel, axboe, ebiederm, linux-kernel,
	virtualization, mst, sgarzare, jasowang, stefanha,
	Linux kernel regressions list, hch, konrad.wilk

On Thu, May 18, 2023 at 10:25:11AM +0200, Christian Brauner wrote:
> On Wed, May 17, 2023 at 07:09:12PM -0500, Mike Christie wrote:
> > This patch allows the vhost and vhost_task code to use CLONE_THREAD,
> > CLONE_SIGHAND and CLONE_FILES. It's a RFC because I didn't do all the
> > normal testing, haven't coverted vsock and vdpa, and I know you guys
> > will not like the first patch. However, I think it better shows what
> 
> Just to summarize the core idea behind my proposal is that no signal
> handling changes are needed unless there's a bug in the current way
> io_uring workers already work. All that should be needed is
> s/PF_IO_WORKER/PF_USER_WORKER/ in signal.c.
> 
> If you follow my proposal than vhost and io_uring workers should almost
> collapse into the same concept. Specifically, io_uring workers and vhost
> workers should behave the same when it comes ot handling signals.
> 
> See 
> https://lore.kernel.org/lkml/20230518-kontakt-geduckt-25bab595f503@brauner
> 
> 
> > we need from the signal code and how we can support signals in the
> > vhost_task layer.
> > 
> > Note that I took the super simple route and kicked off some work to
> > the system workqueue. We can do more invassive approaches:
> > 1. Modify the vhost drivers so they can check for IO completions using
> > a non-blocking interface. We then don't need to run from the system
> > workqueue and can run from the vhost_task.
> > 
> > 2. We could drop patch 1 and just say we are doing a polling type
> > of approach. We then modify the vhost layer similar to #1 where we
> > can check for completions using a non-blocking interface and use
> > the vhost_task task.
> 
> My preference would be to do whatever is the minimal thing now and has
> the least bug potential and is the easiest to review for us non-vhost
> experts. Then you can take all the time to rework and improve the vhost
> infra based on the possibilities that using user workers offers. Plus,
> that can easily happen in the next kernel cycle.
> 
> Remember, that we're trying to fix a regression here. A regression on an
> unreleased kernel but still.

On Tue, May 16, 2023 at 10:40:01AM +0200, Christian Brauner wrote:
> On Mon, May 15, 2023 at 05:23:12PM -0500, Mike Christie wrote:
> > On 5/15/23 10:44 AM, Linus Torvalds wrote:
> > > On Mon, May 15, 2023 at 7:23 AM Christian Brauner <brauner@kernel.org> wrote:
> > >>
> > >> So I think we will be able to address (1) and (2) by making vhost tasks
> > >> proper threads and blocking every signal except for SIGKILL and SIGSTOP
> > >> and then having vhost handle get_signal() - as you mentioned - the same
> > >> way io uring already does. We should also remove the ingore_signals
> > >> thing completely imho. I don't think we ever want to do this with user
> > >> workers.
> > > 
> > > Right. That's what IO_URING does:
> > > 
> > >         if (args->io_thread) {
> > >                 /*
> > >                  * Mark us an IO worker, and block any signal that isn't
> > >                  * fatal or STOP
> > >                  */
> > >                 p->flags |= PF_IO_WORKER;
> > >                 siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
> > >         }
> > > 
> > > and I really think that vhost should basically do exactly what io_uring does.
> > > 
> > > Not because io_uring fundamentally got this right - but simply because
> > > io_uring had almost all the same bugs (and then some), and what the
> > > io_uring worker threads ended up doing was to basically zoom in on
> > > "this works".
> > > 
> > > And it zoomed in on it largely by just going for "make it look as much
> > > as possible as a real user thread", because every time the kernel
> > > thread did something different, it just caused problems.
> > > 
> > > So I think the patch should just look something like the attached.
> > > Mike, can you test this on whatever vhost test-suite?
> > 
> > I tried that approach already and it doesn't work because io_uring and vhost
> > differ in that vhost drivers implement a device where each device has a vhost_task
> > and the drivers have a file_operations for the device. When the vhost_task's
> > parent gets signal like SIGKILL, then it will exit and call into the vhost
> > driver's file_operations->release function. At this time, we need to do cleanup
> 
> But that's no reason why the vhost worker couldn't just be allowed to
> exit on SIGKILL cleanly similar to io_uring. That's just describing the
> current architecture which isn't a necessity afaict. And the helper
> thread could e.g., crash.
> 
> > like flush the device which uses the vhost_task. There is also the case where if
> > the vhost_task gets a SIGKILL, we can just exit from under the vhost layer.
> 
> In a way I really don't like the patch below. Because this should be
> solvable by adapting vhost workers. Right now, vhost is coming from a
> kthread model and we ported it to a user worker model and the whole
> point of this excercise has been that the workers behave more like
> regular userspace processes. So my tendency is to not massage kernel
> signal handling to now also include a special case for user workers in
> addition to kthreads. That's just the wrong way around and then vhost
> could've just stuck with kthreads in the first place.
> 
> So I'm fine with skipping over the freezing case for now but SIGKILL
> should be handled imho. Only init and kthreads should get the luxury of
> ignoring SIGKILL.
> 
> So, I'm afraid I'm asking some work here of you but how feasible would a
> model be where vhost_worker() similar to io_wq_worker() gracefully
> handles SIGKILL. Yes, I see there's
> 
> net.c:   .release = vhost_net_release
> scsi.c:  .release = vhost_scsi_release
> test.c:  .release = vhost_test_release
> vdpa.c:  .release = vhost_vdpa_release
> vsock.c: .release = virtio_transport_release
> vsock.c: .release = vhost_vsock_dev_release
> 
> but that means you have all the basic logic in place and all of those
> drivers also support the VHOST_RESET_OWNER ioctl which also stops the
> vhost worker. I'm confident that a lof this can be leveraged to just
> cleanup on SIGKILL.
> 
> So it feels like this should be achievable by adding a callback to
> struct vhost_worker that get's called when vhost_worker() gets SIGKILL
> and that all the users of vhost workers are forced to implement.
> 
> Yes, it is more work but I think that's the right thing to do and not to
> complicate our signal handling.
> 
> Worst case if this can't be done fast enough we'll have to revert the
> vhost parts. I think the user worker parts are mostly sane and are

As mentioned, if we can't settle this cleanly before -rc4 we should
revert the vhost parts unless Linus wants to have it earlier.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
  2023-05-19  4:16                 ` Eric W. Biederman
@ 2023-05-19 23:24                   ` Mike Christie
  0 siblings, 0 replies; 98+ messages in thread
From: Mike Christie @ 2023-05-19 23:24 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Oleg Nesterov, linux, nicolas.dichtel, axboe, torvalds,
	linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha,
	brauner

On 5/18/23 11:16 PM, Eric W. Biederman wrote:
> Mike Christie <michael.christie@oracle.com> writes:
> 
>> On 5/18/23 1:28 PM, Eric W. Biederman wrote:
>>> Still the big issue seems to be the way get_signal is connected into
>>> these threads so that it keeps getting called.  Calling get_signal after
>>> a fatal signal has been returned happens nowhere else and even if we fix
>>> it today it is likely to lead to bugs in the future because whoever is
>>> testing and updating the code is unlikely they have a vhost test case
>>> the care about.
>>>
>>> diff --git a/kernel/signal.c b/kernel/signal.c
>>> index 8f6330f0e9ca..4d54718cad36 100644
>>> --- a/kernel/signal.c
>>> +++ b/kernel/signal.c
>>> @@ -181,7 +181,9 @@ void recalc_sigpending_and_wake(struct task_struct *t)
>>>  
>>>  void recalc_sigpending(void)
>>>  {
>>> -       if (!recalc_sigpending_tsk(current) && !freezing(current))
>>> +       if ((!recalc_sigpending_tsk(current) && !freezing(current)) ||
>>> +           ((current->signal->flags & SIGNAL_GROUP_EXIT) &&
>>> +                   !__fatal_signal_pending(current)))
>>>                 clear_thread_flag(TIF_SIGPENDING);
>>>  
>>>  }
>>> @@ -1043,6 +1045,13 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
>>>                  * This signal will be fatal to the whole group.
>>>                  */
>>>                 if (!sig_kernel_coredump(sig)) {
>>> +                       /*
>>> +                        * The signal is being short circuit delivered
>>> +                        * don't it pending.
>>> +                        */
>>> +                       if (type != PIDTYPE_PID) {
>>> +                               sigdelset(&t->signal->shared_pending,  sig);
>>> +
>>>                         /*
>>>                          * Start a group exit and wake everybody up.
>>>                          * This way we don't have other threads
>>>
>>
>> If I change up your patch so the last part is moved down a bit to when we set t
>> like this:
>>
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index 0ac48c96ab04..c976a80650db 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -181,9 +181,10 @@ void recalc_sigpending_and_wake(struct task_struct *t)
>>  
>>  void recalc_sigpending(void)
>>  {
>> -	if (!recalc_sigpending_tsk(current) && !freezing(current))
>> +	if ((!recalc_sigpending_tsk(current) && !freezing(current)) ||
>> +	    ((current->signal->flags & SIGNAL_GROUP_EXIT) &&
>> +	     !__fatal_signal_pending(current)))
>>  		clear_thread_flag(TIF_SIGPENDING);
>> -
> Can we get rid of this suggestion to recalc_sigpending.  The more I look
> at it the more I am convinced it is not safe.  In particular I believe
> it is incompatible with dump_interrupted() in fs/coredump.c


With your clear_thread_flag call in vhost_worker suggestion I don't need
the above chunk.


> 
> The code in fs/coredump.c is the closest code we have to what you are
> trying to do with vhost_worker after the session is killed.  It also
> struggles with TIF_SIGPENDING getting set. 
>>  }
>>  EXPORT_SYMBOL(recalc_sigpending);
>>  
>> @@ -1053,6 +1054,17 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
>>  			signal->group_exit_code = sig;
>>  			signal->group_stop_count = 0;
>>  			t = p;
>> +			/*
>> +			 * The signal is being short circuit delivered
>> +			 * don't it pending.
>> +			 */
>> +			if (type != PIDTYPE_PID) {
>> +				struct sigpending *pending;
>> +
>> +				pending = &t->signal->shared_pending;
>> +				sigdelset(&pending->signal, sig);
>> +			}
>> +
>>  			do {
>>  				task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
>>  				sigaddset(&t->pending.signal, SIGKILL);
>>
>>
>> Then get_signal() works like how Oleg mentioned it should earlier.
> 
> I am puzzled it makes a difference as t->signal and p->signal should
> point to the same thing, and in fact the code would more clearly read
> sigdelset(&signal->shared_pending, sig);


Yeah either should work. The original patch had used t before it was
set so my patch just moved it down to after we set it. I just used signal
like you wrote and it works fine.


> 
> But all of that seems minor.
> 
>> For vhost I just need the code below which is just Linus's patch plus a call
>> to get_signal() in vhost_worker() and the PF_IO_WORKER->PF_USER_WORKER change.
>>
>> Note that when we get SIGKILL, the vhost file_operations->release function is called via
>>
>>             do_exit -> exit_files -> put_files_struct -> close_files
>>
>> and so the vhost release function starts to flush IO and stop the worker/vhost
>> task. In vhost_worker() then we just handle those last completions for already
>> running IO. When  the vhost release function detects they are done it does
>> vhost_task   _stop() and vhost_worker() returns and then vhost_task_fn() does do_exit().
>> So we don't return immediately when get_signal() returns non-zero.
>>
>> So it works, but it sounds like you don't like vhost relying on the behavior,
>> and it's non standard to use get_signal() like we are. So I'm not sure how we
>> want to proceed.
> 
> Let me clarify my concern.
> 
> Your code modifies get_signal as:
>  		/*
> -		 * PF_IO_WORKER threads will catch and exit on fatal signals
> +		 * PF_USER_WORKER threads will catch and exit on fatal signals
>  		 * themselves. They have cleanup that must be performed, so
>  		 * we cannot call do_exit() on their behalf.
>  		 */
> -		if (current->flags & PF_IO_WORKER)
> +		if (current->flags & PF_USER_WORKER)
>  			goto out;
>  		/*
>  		 * Death signals, no core dump.
>  		 */
>  		do_group_exit(ksig->info.si_signo);
>  		/* NOTREACHED */
> 
> Which means by modifying get_signal you are logically deleting the
> do_group_exit from get_signal.  As far as that goes that is a perfectly
> reasonable change.  The problem is you wind up calling get_signal again
> after that.  That does not make sense.
> 
> I would suggest doing something like:

I see. I've run some tests today and what you suggested for vhost_worker
and your signal change and it works for SIGKILL/STOP/CONT and freeze.

> 
> What is the diff below?  It does not appear to a revert diff.

It was just the most simple patch that was needed with your signal changes
(and the PF_IO_WORKER -> PF_USER_WORKER signal change) to fix the 2
regressions reported. I wanted to give the vhost devs an idea of what was
needed with your signal changes.

Let me do some more testing over the weekend and I'll post a RFC with your
signal change and the minimal changes needed to vhost to handle the 2
regressions that were reported. The vhost developers can get a better idea
of what needs to be done and they can better decide what they want to do to
proceed.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
  2023-05-18 18:28             ` Eric W. Biederman
  2023-05-18 22:57               ` Mike Christie
@ 2023-05-22 13:30               ` Oleg Nesterov
  1 sibling, 0 replies; 98+ messages in thread
From: Oleg Nesterov @ 2023-05-22 13:30 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Mike Christie, linux, nicolas.dichtel, axboe, torvalds,
	linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha,
	brauner

On 05/18, Eric W. Biederman wrote:
>
>  void recalc_sigpending(void)
>  {
> -       if (!recalc_sigpending_tsk(current) && !freezing(current))
> +       if ((!recalc_sigpending_tsk(current) && !freezing(current)) ||
> +           ((current->signal->flags & SIGNAL_GROUP_EXIT) &&
> +                   !__fatal_signal_pending(current)))
>                 clear_thread_flag(TIF_SIGPENDING);
>  
>  }
> @@ -1043,6 +1045,13 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
>                  * This signal will be fatal to the whole group.
>                  */
>                 if (!sig_kernel_coredump(sig)) {
> +                       /*
> +                        * The signal is being short circuit delivered
> +                        * don't it pending.
> +                        */
> +                       if (type != PIDTYPE_PID) {
> +                               sigdelset(&t->signal->shared_pending,  sig);
> +
>                         /*
>                          * Start a group exit and wake everybody up.
>                          * This way we don't have other threads

Eric, sorry. I fail to understand this patch.

How can it help? And whom?

Perhaps we can discuss it in the context of the new series from Mike?

Oleg.


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-05-16 14:06     ` [PATCH v11 8/8] vhost: use vhost_tasks for worker threads Linux regression tracking #adding (Thorsten Leemhuis)
@ 2023-05-26  9:03       ` Linux regression tracking #update (Thorsten Leemhuis)
  2023-06-02 11:38       ` Thorsten Leemhuis
  1 sibling, 0 replies; 98+ messages in thread
From: Linux regression tracking #update (Thorsten Leemhuis) @ 2023-05-26  9:03 UTC (permalink / raw)
  To: virtualization, linux-kernel

On 16.05.23 16:06, Linux regression tracking #adding (Thorsten Leemhuis)
wrote:
> On 05.05.23 15:40, Nicolas Dichtel wrote:
>> Le 03/02/2023 à 00:25, Mike Christie a écrit :
>>> For vhost workers we use the kthread API which inherit's its values from
>>> and checks against the kthreadd thread. This results in the wrong RLIMITs
>>> being checked, so while tools like libvirt try to control the number of
>>> threads based on the nproc rlimit setting we can end up creating more
>>> threads than the user wanted.
>>
>> I have a question about (a side effect of?) this patch. The output of the 'ps'
>> command has changed. Here is an example:
>> [...]
> 
> Thanks for the report. This is already dealt with, but to be sure the
> issue doesn't fall through the cracks unnoticed, I'm adding it to
> regzbot, the Linux kernel regression tracking bot:
> 
> #regzbot ^introduced 6e890c5d502
> #regzbot title vhost: ps output changed and suspend fails when VMs are
> running
> #regzbot ignore-activity

#regzbot monitor:
https://lore.kernel.org/all/20230522025124.5863-1-michael.christie@oracle.com/
#regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND
  2023-05-19 12:15                     ` [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND Christian Brauner
@ 2023-06-01  7:58                       ` Thorsten Leemhuis
  2023-06-01 10:18                         ` Nicolas Dichtel
  2023-06-01 10:47                         ` Christian Brauner
  0 siblings, 2 replies; 98+ messages in thread
From: Thorsten Leemhuis @ 2023-06-01  7:58 UTC (permalink / raw)
  To: Christian Brauner, Mike Christie, Linus Torvalds
  Cc: oleg, nicolas.dichtel, axboe, ebiederm, linux-kernel,
	virtualization, mst, sgarzare, jasowang, stefanha,
	Linux kernel regressions list, hch, konrad.wilk

On 19.05.23 14:15, Christian Brauner wrote:
> On Thu, May 18, 2023 at 10:25:11AM +0200, Christian Brauner wrote:
>> On Wed, May 17, 2023 at 07:09:12PM -0500, Mike Christie wrote:
>>> This patch allows the vhost and vhost_task code to use CLONE_THREAD,
>>> CLONE_SIGHAND and CLONE_FILES. It's a RFC because I didn't do all the
>>> normal testing, haven't coverted vsock and vdpa, and I know you guys
>>> will not like the first patch. However, I think it better shows what
>>
>> Just to summarize the core idea behind my proposal is that no signal
>> handling changes are needed unless there's a bug in the current way
>> io_uring workers already work. All that should be needed is
>> s/PF_IO_WORKER/PF_USER_WORKER/ in signal.c.
[...]
>> So it feels like this should be achievable by adding a callback to
>> struct vhost_worker that get's called when vhost_worker() gets SIGKILL
>> and that all the users of vhost workers are forced to implement.
>>
>> Yes, it is more work but I think that's the right thing to do and not to
>> complicate our signal handling.
>>
>> Worst case if this can't be done fast enough we'll have to revert the
>> vhost parts. I think the user worker parts are mostly sane and are
> 
> As mentioned, if we can't settle this cleanly before -rc4 we should
> revert the vhost parts unless Linus wants to have it earlier.

Meanwhile -rc5 is just a few days away and there are still a lot of
discussions in the patch-set proposed to address the issues[1]. Which is
kinda great (albeit also why I haven't given it a spin yet), but on the
other hand makes we wonder:

Is it maybe time to revert the vhost parts for 6.4 and try again next cycle?

[1]
https://lore.kernel.org/all/20230522025124.5863-1-michael.christie@oracle.com/

Ciao, Thorsten "not sure if I'm asking because I'm affected, or because
it's my duty as regression tracker" Leemhuis

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND
  2023-06-01  7:58                       ` Thorsten Leemhuis
@ 2023-06-01 10:18                         ` Nicolas Dichtel
  2023-06-01 10:47                         ` Christian Brauner
  1 sibling, 0 replies; 98+ messages in thread
From: Nicolas Dichtel @ 2023-06-01 10:18 UTC (permalink / raw)
  To: Thorsten Leemhuis, Christian Brauner, Mike Christie, Linus Torvalds
  Cc: oleg, axboe, ebiederm, linux-kernel, virtualization, mst,
	sgarzare, jasowang, stefanha, Linux kernel regressions list, hch,
	konrad.wilk

Le 01/06/2023 à 09:58, Thorsten Leemhuis a écrit :
[snip]
> 
> Meanwhile -rc5 is just a few days away and there are still a lot of
> discussions in the patch-set proposed to address the issues[1]. Which is
> kinda great (albeit also why I haven't given it a spin yet), but on the
> other hand makes we wonder:
> 
> Is it maybe time to revert the vhost parts for 6.4 and try again next cycle?
At least it's time to find a way to fix this issue :)


Thank you,
Nicolas

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND
  2023-06-01  7:58                       ` Thorsten Leemhuis
  2023-06-01 10:18                         ` Nicolas Dichtel
@ 2023-06-01 10:47                         ` Christian Brauner
  2023-06-01 11:29                           ` Thorsten Leemhuis
                                             ` (2 more replies)
  1 sibling, 3 replies; 98+ messages in thread
From: Christian Brauner @ 2023-06-01 10:47 UTC (permalink / raw)
  To: Thorsten Leemhuis, Mike Christie, Linus Torvalds
  Cc: oleg, nicolas.dichtel, axboe, ebiederm, linux-kernel,
	virtualization, mst, sgarzare, jasowang, stefanha,
	Linux kernel regressions list, hch, konrad.wilk

On Thu, Jun 01, 2023 at 09:58:38AM +0200, Thorsten Leemhuis wrote:
> On 19.05.23 14:15, Christian Brauner wrote:
> > On Thu, May 18, 2023 at 10:25:11AM +0200, Christian Brauner wrote:
> >> On Wed, May 17, 2023 at 07:09:12PM -0500, Mike Christie wrote:
> >>> This patch allows the vhost and vhost_task code to use CLONE_THREAD,
> >>> CLONE_SIGHAND and CLONE_FILES. It's a RFC because I didn't do all the
> >>> normal testing, haven't coverted vsock and vdpa, and I know you guys
> >>> will not like the first patch. However, I think it better shows what
> >>
> >> Just to summarize the core idea behind my proposal is that no signal
> >> handling changes are needed unless there's a bug in the current way
> >> io_uring workers already work. All that should be needed is
> >> s/PF_IO_WORKER/PF_USER_WORKER/ in signal.c.
> [...]
> >> So it feels like this should be achievable by adding a callback to
> >> struct vhost_worker that get's called when vhost_worker() gets SIGKILL
> >> and that all the users of vhost workers are forced to implement.
> >>
> >> Yes, it is more work but I think that's the right thing to do and not to
> >> complicate our signal handling.
> >>
> >> Worst case if this can't be done fast enough we'll have to revert the
> >> vhost parts. I think the user worker parts are mostly sane and are
> > 
> > As mentioned, if we can't settle this cleanly before -rc4 we should
> > revert the vhost parts unless Linus wants to have it earlier.
> 
> Meanwhile -rc5 is just a few days away and there are still a lot of
> discussions in the patch-set proposed to address the issues[1]. Which is
> kinda great (albeit also why I haven't given it a spin yet), but on the
> other hand makes we wonder:

You might've missed it in the thread but it seems everyone is currently
operating under the assumption that the preferred way is to fix this is
rather than revert. See the mail in [1]:

"So I'd really like to finish this. Even if we end up with a hack or
two in signal handling that we can hopefully fix up later by having
vhost fix up some of its current assumptions."

which is why no revert was send for -rc4. And there's a temporary fix we
seem to have converged on.

@Mike, do you want to prepare an updated version of the temporary fix.
If @Linus prefers to just apply it directly he can just grab it from the
list rather than delaying it. Make sure to grab a Co-developed-by line
on this, @Mike.

Just in case we misunderstood the intention, I also prepared a revert
at the end of this mail that Linus can use.

@Thorsten, you can test it if you want. The revert only reverts the
vhost bits as the general agreement seems to be that user workers are
otherwise the path forward.

[1]: https://lore.kernel.org/lkml/CAHk-=wj4DS=2F5mW+K2P7cVqrsuGd3rKE_2k2BqnnPeeYhUCvg@mail.gmail.com

---

/* Summary */
Switching vhost workers to user workers broke existing workflows because
vhost workers started showing up in ps output breaking various scripts.
The reason is that vhost user workers are currently spawned as separate
processes and not as threads. Revert the patches converting vhost from
kthreads to vhost workers until vhost is ready to support user workers
created as actual threads.

The following changes since commit 7877cb91f1081754a1487c144d85dc0d2e2e7fc4:

  Linux 6.4-rc4 (2023-05-28 07:49:00 -0400)

are available in the Git repository at:

  git@gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux tags/kernel/v6.4-rc4/vhost

for you to fetch changes up to b20084b6bc90012a8ccce72ef1c0050d5fd42aa8:

  Revert "vhost_task: Allow vhost layer to use copy_process" (2023-06-01 12:33:19 +0200)

----------------------------------------------------------------
kernel/v6.4-rc4/vhost

----------------------------------------------------------------
Christian Brauner (3):
      Revert "vhost: use vhost_tasks for worker threads"
      Revert "vhost: move worker thread fields to new struct"
      Revert "vhost_task: Allow vhost layer to use copy_process"

 MAINTAINERS                      |   1 -
 drivers/vhost/Kconfig            |   5 --
 drivers/vhost/vhost.c            | 124 ++++++++++++++++++++-------------------
 drivers/vhost/vhost.h            |  11 +---
 include/linux/sched/vhost_task.h |  23 --------
 kernel/Makefile                  |   1 -
 kernel/vhost_task.c              | 117 ------------------------------------
 7 files changed, 67 insertions(+), 215 deletions(-)
 delete mode 100644 include/linux/sched/vhost_task.h
 delete mode 100644 kernel/vhost_task.c

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND
  2023-06-01 10:47                         ` Christian Brauner
@ 2023-06-01 11:29                           ` Thorsten Leemhuis
  2023-06-01 12:26                           ` Linus Torvalds
  2023-06-01 16:10                           ` Mike Christie
  2 siblings, 0 replies; 98+ messages in thread
From: Thorsten Leemhuis @ 2023-06-01 11:29 UTC (permalink / raw)
  To: Christian Brauner, Mike Christie, Linus Torvalds
  Cc: oleg, nicolas.dichtel, axboe, ebiederm, linux-kernel,
	virtualization, mst, sgarzare, jasowang, stefanha,
	Linux kernel regressions list, hch, konrad.wilk

On 01.06.23 12:47, Christian Brauner wrote:
> On Thu, Jun 01, 2023 at 09:58:38AM +0200, Thorsten Leemhuis wrote:
>> On 19.05.23 14:15, Christian Brauner wrote:
>>> On Thu, May 18, 2023 at 10:25:11AM +0200, Christian Brauner wrote:
>>>> On Wed, May 17, 2023 at 07:09:12PM -0500, Mike Christie wrote:
>>>>> This patch allows the vhost and vhost_task code to use CLONE_THREAD,
>>>>> CLONE_SIGHAND and CLONE_FILES. It's a RFC because I didn't do all the
>>>>> normal testing, haven't coverted vsock and vdpa, and I know you guys
>>>>> will not like the first patch. However, I think it better shows what
>>>>
>>>> Just to summarize the core idea behind my proposal is that no signal
>>>> handling changes are needed unless there's a bug in the current way
>>>> io_uring workers already work. All that should be needed is
>>>> s/PF_IO_WORKER/PF_USER_WORKER/ in signal.c.
>> [...]
>>>> So it feels like this should be achievable by adding a callback to
>>>> struct vhost_worker that get's called when vhost_worker() gets SIGKILL
>>>> and that all the users of vhost workers are forced to implement.
>>>>
>>>> Yes, it is more work but I think that's the right thing to do and not to
>>>> complicate our signal handling.
>>>>
>>>> Worst case if this can't be done fast enough we'll have to revert the
>>>> vhost parts. I think the user worker parts are mostly sane and are
>>>
>>> As mentioned, if we can't settle this cleanly before -rc4 we should
>>> revert the vhost parts unless Linus wants to have it earlier.
>>
>> Meanwhile -rc5 is just a few days away and there are still a lot of
>> discussions in the patch-set proposed to address the issues[1]. Which is
>> kinda great (albeit also why I haven't given it a spin yet), but on the
>> other hand makes we wonder:
> 
> You might've missed it in the thread but it seems everyone is currently
> operating under the assumption that the preferred way is to fix this is
> rather than revert. 

I saw that, but that was also a week ago already, so I slowly started to
wonder if plans might have/should be changed. Anyway: if that's still
the plan forward it's totally fine for me if it's fine for Linus. :-D

BTW: I for now didn't sit down to test Mike's patches, as due to all the
discussions I assumed new ones would be coming sooner or later anyway.
If it's worth giving them a shot, please let me know.

> [...]

Thx for the update!

Ciao, Thorsten

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND
  2023-06-01 10:47                         ` Christian Brauner
  2023-06-01 11:29                           ` Thorsten Leemhuis
@ 2023-06-01 12:26                           ` Linus Torvalds
  2023-06-01 16:10                           ` Mike Christie
  2 siblings, 0 replies; 98+ messages in thread
From: Linus Torvalds @ 2023-06-01 12:26 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Thorsten Leemhuis, Mike Christie, oleg, nicolas.dichtel, axboe,
	ebiederm, linux-kernel, virtualization, mst, sgarzare, jasowang,
	stefanha, Linux kernel regressions list, hch, konrad.wilk

On Thu, Jun 1, 2023 at 6:47 AM Christian Brauner <brauner@kernel.org> wrote:
>
> @Mike, do you want to prepare an updated version of the temporary fix.
> If @Linus prefers to just apply it directly he can just grab it from the
> list rather than delaying it. Make sure to grab a Co-developed-by line
> on this, @Mike.

Yeah, let's apply the known "fix the immediate regression" patch wrt
vhost ps output and the freezer. That gets rid of the regression.

I think that we can - and should - then treat the questions about core
dumping and execve as separate issues.

vhost wouldn't have done execve since it's nonsensical and has never
worked anyway since it always left the old mm ref behind, and
similarly core dumping has never been an issue.

So on those things we don't have any "semantic" issues, we just need
to make sure we don't do crazy things like hang uninterruptibly.

            Linus

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND
  2023-06-01 10:47                         ` Christian Brauner
  2023-06-01 11:29                           ` Thorsten Leemhuis
  2023-06-01 12:26                           ` Linus Torvalds
@ 2023-06-01 16:10                           ` Mike Christie
  2 siblings, 0 replies; 98+ messages in thread
From: Mike Christie @ 2023-06-01 16:10 UTC (permalink / raw)
  To: Christian Brauner, Thorsten Leemhuis, Linus Torvalds
  Cc: oleg, nicolas.dichtel, axboe, ebiederm, linux-kernel,
	virtualization, mst, sgarzare, jasowang, stefanha,
	Linux kernel regressions list, hch, konrad.wilk

On 6/1/23 5:47 AM, Christian Brauner wrote:
> On Thu, Jun 01, 2023 at 09:58:38AM +0200, Thorsten Leemhuis wrote:
>> On 19.05.23 14:15, Christian Brauner wrote:
>>> On Thu, May 18, 2023 at 10:25:11AM +0200, Christian Brauner wrote:
>>>> On Wed, May 17, 2023 at 07:09:12PM -0500, Mike Christie wrote:
>>>>> This patch allows the vhost and vhost_task code to use CLONE_THREAD,
>>>>> CLONE_SIGHAND and CLONE_FILES. It's a RFC because I didn't do all the
>>>>> normal testing, haven't coverted vsock and vdpa, and I know you guys
>>>>> will not like the first patch. However, I think it better shows what
>>>> Just to summarize the core idea behind my proposal is that no signal
>>>> handling changes are needed unless there's a bug in the current way
>>>> io_uring workers already work. All that should be needed is
>>>> s/PF_IO_WORKER/PF_USER_WORKER/ in signal.c.
>> [...]
>>>> So it feels like this should be achievable by adding a callback to
>>>> struct vhost_worker that get's called when vhost_worker() gets SIGKILL
>>>> and that all the users of vhost workers are forced to implement.
>>>>
>>>> Yes, it is more work but I think that's the right thing to do and not to
>>>> complicate our signal handling.
>>>>
>>>> Worst case if this can't be done fast enough we'll have to revert the
>>>> vhost parts. I think the user worker parts are mostly sane and are
>>> As mentioned, if we can't settle this cleanly before -rc4 we should
>>> revert the vhost parts unless Linus wants to have it earlier.
>> Meanwhile -rc5 is just a few days away and there are still a lot of
>> discussions in the patch-set proposed to address the issues[1]. Which is
>> kinda great (albeit also why I haven't given it a spin yet), but on the
>> other hand makes we wonder:
> You might've missed it in the thread but it seems everyone is currently
> operating under the assumption that the preferred way is to fix this is
> rather than revert. See the mail in [1]:
> 
> "So I'd really like to finish this. Even if we end up with a hack or
> two in signal handling that we can hopefully fix up later by having
> vhost fix up some of its current assumptions."
> 
> which is why no revert was send for -rc4. And there's a temporary fix we
> seem to have converged on.
> 
> @Mike, do you want to prepare an updated version of the temporary fix.
> If @Linus prefers to just apply it directly he can just grab it from the
> list rather than delaying it. Make sure to grab a Co-developed-by line
> on this, @Mike.

Yes, I'll send it within a couple hours.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-05-16 14:06     ` [PATCH v11 8/8] vhost: use vhost_tasks for worker threads Linux regression tracking #adding (Thorsten Leemhuis)
  2023-05-26  9:03       ` Linux regression tracking #update (Thorsten Leemhuis)
@ 2023-06-02 11:38       ` Thorsten Leemhuis
  1 sibling, 0 replies; 98+ messages in thread
From: Thorsten Leemhuis @ 2023-06-02 11:38 UTC (permalink / raw)
  To: nicolas.dichtel, Mike Christie, hch, stefanha, jasowang, mst,
	sgarzare, virtualization, brauner, ebiederm, torvalds,
	konrad.wilk, linux-kernel, Linux kernel regressions list

[TLDR: This mail in primarily relevant for Linux regression tracking. A
change or fix related to the regression discussed in this thread was
posted or applied, but it did not use a Link: tag to point to the
report, as Linus and the documentation call for. Things happen, no
worries -- but now the regression tracking bot needs to be told manually
about the fix. See link in footer if these mails annoy you.]

On 16.05.23 16:06, Linux regression tracking #adding (Thorsten Leemhuis)
wrote:
> On 05.05.23 15:40, Nicolas Dichtel wrote:
>> Le 03/02/2023 à 00:25, Mike Christie a écrit :
>>> For vhost workers we use the kthread API which inherit's its values from
>>> and checks against the kthreadd thread. This results in the wrong RLIMITs
>>> being checked, so while tools like libvirt try to control the number of
>>> threads based on the nproc rlimit setting we can end up creating more
>>> threads than the user wanted.
>>
>> I have a question about (a side effect of?) this patch. The output of the 'ps'
>> command has changed. Here is an example:
>> [...]
> 
> Thanks for the report. This is already dealt with, but to be sure the
> issue doesn't fall through the cracks unnoticed, I'm adding it to
> regzbot, the Linux kernel regression tracking bot:
> 
> #regzbot ^introduced 6e890c5d502

#regzbot fix: f9010dbdce911ee1f1af1398a24b1f9f992e0080
#regzbot ignore-activity

BTW, if anyone cares: just checked, it fixes my suspend problem. Thx
everyone!

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-02-02 23:25 ` [PATCH v11 8/8] vhost: use vhost_tasks for worker threads Mike Christie
  2023-05-05 13:40   ` Nicolas Dichtel
@ 2023-07-20 13:06   ` Michael S. Tsirkin
  2023-07-23  4:03     ` michael.christie
  1 sibling, 1 reply; 98+ messages in thread
From: Michael S. Tsirkin @ 2023-07-20 13:06 UTC (permalink / raw)
  To: Mike Christie
  Cc: hch, stefanha, jasowang, sgarzare, virtualization, brauner,
	ebiederm, torvalds, konrad.wilk, linux-kernel

On Thu, Feb 02, 2023 at 05:25:17PM -0600, Mike Christie wrote:
> For vhost workers we use the kthread API which inherit's its values from
> and checks against the kthreadd thread. This results in the wrong RLIMITs
> being checked, so while tools like libvirt try to control the number of
> threads based on the nproc rlimit setting we can end up creating more
> threads than the user wanted.
> 
> This patch has us use the vhost_task helpers which will inherit its
> values/checks from the thread that owns the device similar to if we did
> a clone in userspace. The vhost threads will now be counted in the nproc
> rlimits. And we get features like cgroups and mm sharing automatically,
> so we can remove those calls.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Acked-by: Michael S. Tsirkin <mst@redhat.com>



Hi Mike,
So this seems to have caused a measureable regression in networking
performance (about 30%). Take a look here, and there's a zip file
with detailed measuraments attached:

https://bugzilla.redhat.com/show_bug.cgi?id=2222603


Could you take a look please?
You can also ask reporter questions there assuming you
have or can create a (free) account.



> ---
>  drivers/vhost/vhost.c | 58 ++++++++-----------------------------------
>  drivers/vhost/vhost.h |  4 +--
>  2 files changed, 13 insertions(+), 49 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 74378d241f8d..d3c7c37b69a7 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -22,11 +22,11 @@
>  #include <linux/slab.h>
>  #include <linux/vmalloc.h>
>  #include <linux/kthread.h>
> -#include <linux/cgroup.h>
>  #include <linux/module.h>
>  #include <linux/sort.h>
>  #include <linux/sched/mm.h>
>  #include <linux/sched/signal.h>
> +#include <linux/sched/vhost_task.h>
>  #include <linux/interval_tree_generic.h>
>  #include <linux/nospec.h>
>  #include <linux/kcov.h>
> @@ -256,7 +256,7 @@ void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
>  		 * test_and_set_bit() implies a memory barrier.
>  		 */
>  		llist_add(&work->node, &dev->worker->work_list);
> -		wake_up_process(dev->worker->task);
> +		wake_up_process(dev->worker->vtsk->task);
>  	}
>  }
>  EXPORT_SYMBOL_GPL(vhost_work_queue);
> @@ -336,17 +336,14 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>  static int vhost_worker(void *data)
>  {
>  	struct vhost_worker *worker = data;
> -	struct vhost_dev *dev = worker->dev;
>  	struct vhost_work *work, *work_next;
>  	struct llist_node *node;
>  
> -	kthread_use_mm(dev->mm);
> -
>  	for (;;) {
>  		/* mb paired w/ kthread_stop */
>  		set_current_state(TASK_INTERRUPTIBLE);
>  
> -		if (kthread_should_stop()) {
> +		if (vhost_task_should_stop(worker->vtsk)) {
>  			__set_current_state(TASK_RUNNING);
>  			break;
>  		}
> @@ -368,7 +365,7 @@ static int vhost_worker(void *data)
>  				schedule();
>  		}
>  	}
> -	kthread_unuse_mm(dev->mm);
> +
>  	return 0;
>  }
>  
> @@ -509,31 +506,6 @@ long vhost_dev_check_owner(struct vhost_dev *dev)
>  }
>  EXPORT_SYMBOL_GPL(vhost_dev_check_owner);
>  
> -struct vhost_attach_cgroups_struct {
> -	struct vhost_work work;
> -	struct task_struct *owner;
> -	int ret;
> -};
> -
> -static void vhost_attach_cgroups_work(struct vhost_work *work)
> -{
> -	struct vhost_attach_cgroups_struct *s;
> -
> -	s = container_of(work, struct vhost_attach_cgroups_struct, work);
> -	s->ret = cgroup_attach_task_all(s->owner, current);
> -}
> -
> -static int vhost_attach_cgroups(struct vhost_dev *dev)
> -{
> -	struct vhost_attach_cgroups_struct attach;
> -
> -	attach.owner = current;
> -	vhost_work_init(&attach.work, vhost_attach_cgroups_work);
> -	vhost_work_queue(dev, &attach.work);
> -	vhost_dev_flush(dev);
> -	return attach.ret;
> -}
> -
>  /* Caller should have device mutex */
>  bool vhost_dev_has_owner(struct vhost_dev *dev)
>  {
> @@ -580,14 +552,14 @@ static void vhost_worker_free(struct vhost_dev *dev)
>  
>  	dev->worker = NULL;
>  	WARN_ON(!llist_empty(&worker->work_list));
> -	kthread_stop(worker->task);
> +	vhost_task_stop(worker->vtsk);
>  	kfree(worker);
>  }
>  
>  static int vhost_worker_create(struct vhost_dev *dev)
>  {
>  	struct vhost_worker *worker;
> -	struct task_struct *task;
> +	struct vhost_task *vtsk;
>  	int ret;
>  
>  	worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT);
> @@ -595,27 +567,19 @@ static int vhost_worker_create(struct vhost_dev *dev)
>  		return -ENOMEM;
>  
>  	dev->worker = worker;
> -	worker->dev = dev;
>  	worker->kcov_handle = kcov_common_handle();
>  	init_llist_head(&worker->work_list);
>  
> -	task = kthread_create(vhost_worker, worker, "vhost-%d", current->pid);
> -	if (IS_ERR(task)) {
> -		ret = PTR_ERR(task);
> +	vtsk = vhost_task_create(vhost_worker, worker, NUMA_NO_NODE);
> +	if (!vtsk) {
> +		ret = -ENOMEM;
>  		goto free_worker;
>  	}
>  
> -	worker->task = task;
> -	wake_up_process(task); /* avoid contributing to loadavg */
> -
> -	ret = vhost_attach_cgroups(dev);
> -	if (ret)
> -		goto stop_worker;
> -
> +	worker->vtsk = vtsk;
> +	vhost_task_start(vtsk, "vhost-%d", current->pid);
>  	return 0;
>  
> -stop_worker:
> -	kthread_stop(worker->task);
>  free_worker:
>  	kfree(worker);
>  	dev->worker = NULL;
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 2f6beab93784..3af59c65025e 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -16,6 +16,7 @@
>  #include <linux/irqbypass.h>
>  
>  struct vhost_work;
> +struct vhost_task;
>  typedef void (*vhost_work_fn_t)(struct vhost_work *work);
>  
>  #define VHOST_WORK_QUEUED 1
> @@ -26,9 +27,8 @@ struct vhost_work {
>  };
>  
>  struct vhost_worker {
> -	struct task_struct	*task;
> +	struct vhost_task	*vtsk;
>  	struct llist_head	work_list;
> -	struct vhost_dev	*dev;
>  	u64			kcov_handle;
>  };
>  
> -- 
> 2.25.1


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-07-20 13:06   ` Michael S. Tsirkin
@ 2023-07-23  4:03     ` michael.christie
  2023-07-23  9:31       ` Michael S. Tsirkin
  2023-08-10 18:57       ` Michael S. Tsirkin
  0 siblings, 2 replies; 98+ messages in thread
From: michael.christie @ 2023-07-23  4:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: hch, stefanha, jasowang, sgarzare, virtualization, brauner,
	ebiederm, torvalds, konrad.wilk, linux-kernel

On 7/20/23 8:06 AM, Michael S. Tsirkin wrote:
> On Thu, Feb 02, 2023 at 05:25:17PM -0600, Mike Christie wrote:
>> For vhost workers we use the kthread API which inherit's its values from
>> and checks against the kthreadd thread. This results in the wrong RLIMITs
>> being checked, so while tools like libvirt try to control the number of
>> threads based on the nproc rlimit setting we can end up creating more
>> threads than the user wanted.
>>
>> This patch has us use the vhost_task helpers which will inherit its
>> values/checks from the thread that owns the device similar to if we did
>> a clone in userspace. The vhost threads will now be counted in the nproc
>> rlimits. And we get features like cgroups and mm sharing automatically,
>> so we can remove those calls.
>>
>> Signed-off-by: Mike Christie <michael.christie@oracle.com>
>> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> 
> 
> Hi Mike,
> So this seems to have caused a measureable regression in networking
> performance (about 30%). Take a look here, and there's a zip file
> with detailed measuraments attached:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=2222603
> 
> 
> Could you take a look please?
> You can also ask reporter questions there assuming you
> have or can create a (free) account.
> 

Sorry for the late reply. I just got home from vacation.

The account creation link seems to be down. I keep getting a
"unable to establish SMTP connection to bz-exim-prod port 25 " error.

Can you give me Quan's email?

I think I can replicate the problem. I just need some extra info from Quan:

1. Just double check that they are using RHEL 9 on the host running the VMs.
2. The kernel config
3. Any tuning that was done. Is tuned running in guest and/or host running the
VMs and what profile is being used in each.
4. Number of vCPUs and virtqueues being used.
5. Can they dump the contents of:

/sys/kernel/debug/sched

and

sysctl  -a

on the host running the VMs.

6. With the 6.4 kernel, can they also run a quick test and tell me if they set
the scheduler to batch:

ps -T -o comm,pid,tid $QEMU_THREAD

then for each vhost thread do:

chrt -b -p 0 $VHOST_THREAD

Does that end up increasing perf? When I do this I see throughput go up by
around 50% vs 6.3 when sessions was 16 or more (16 was the number of vCPUs
and virtqueues per net device in the VM). Note that I'm not saying that is a fix.
It's just a difference I noticed when running some other tests.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-07-23  4:03     ` michael.christie
@ 2023-07-23  9:31       ` Michael S. Tsirkin
  2023-08-10 18:57       ` Michael S. Tsirkin
  1 sibling, 0 replies; 98+ messages in thread
From: Michael S. Tsirkin @ 2023-07-23  9:31 UTC (permalink / raw)
  To: michael.christie
  Cc: hch, stefanha, jasowang, sgarzare, virtualization, brauner,
	ebiederm, torvalds, konrad.wilk, linux-kernel

On Sat, Jul 22, 2023 at 11:03:29PM -0500, michael.christie@oracle.com wrote:
> On 7/20/23 8:06 AM, Michael S. Tsirkin wrote:
> > On Thu, Feb 02, 2023 at 05:25:17PM -0600, Mike Christie wrote:
> >> For vhost workers we use the kthread API which inherit's its values from
> >> and checks against the kthreadd thread. This results in the wrong RLIMITs
> >> being checked, so while tools like libvirt try to control the number of
> >> threads based on the nproc rlimit setting we can end up creating more
> >> threads than the user wanted.
> >>
> >> This patch has us use the vhost_task helpers which will inherit its
> >> values/checks from the thread that owns the device similar to if we did
> >> a clone in userspace. The vhost threads will now be counted in the nproc
> >> rlimits. And we get features like cgroups and mm sharing automatically,
> >> so we can remove those calls.
> >>
> >> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> >> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> > 
> > 
> > Hi Mike,
> > So this seems to have caused a measureable regression in networking
> > performance (about 30%). Take a look here, and there's a zip file
> > with detailed measuraments attached:
> > 
> > https://bugzilla.redhat.com/show_bug.cgi?id=2222603
> > 
> > 
> > Could you take a look please?
> > You can also ask reporter questions there assuming you
> > have or can create a (free) account.
> > 
> 
> Sorry for the late reply. I just got home from vacation.
> 
> The account creation link seems to be down. I keep getting a
> "unable to establish SMTP connection to bz-exim-prod port 25 " error.
> 
> Can you give me Quan's email?

Thanks for getting back!  I asked whether it's ok to share the email.
For now pasted your request in the bugzilla.

> I think I can replicate the problem. I just need some extra info from Quan:
> 
> 1. Just double check that they are using RHEL 9 on the host running the VMs.
> 2. The kernel config
> 3. Any tuning that was done. Is tuned running in guest and/or host running the
> VMs and what profile is being used in each.
> 4. Number of vCPUs and virtqueues being used.
> 5. Can they dump the contents of:
> 
> /sys/kernel/debug/sched
> 
> and
> 
> sysctl  -a
> 
> on the host running the VMs.
> 
> 6. With the 6.4 kernel, can they also run a quick test and tell me if they set
> the scheduler to batch:
> 
> ps -T -o comm,pid,tid $QEMU_THREAD
> 
> then for each vhost thread do:
> 
> chrt -b -p 0 $VHOST_THREAD
> 
> Does that end up increasing perf? When I do this I see throughput go up by
> around 50% vs 6.3 when sessions was 16 or more (16 was the number of vCPUs
> and virtqueues per net device in the VM). Note that I'm not saying that is a fix.
> It's just a difference I noticed when running some other tests.


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-07-23  4:03     ` michael.christie
  2023-07-23  9:31       ` Michael S. Tsirkin
@ 2023-08-10 18:57       ` Michael S. Tsirkin
  2023-08-11 18:51         ` Mike Christie
  1 sibling, 1 reply; 98+ messages in thread
From: Michael S. Tsirkin @ 2023-08-10 18:57 UTC (permalink / raw)
  To: michael.christie
  Cc: hch, stefanha, jasowang, sgarzare, virtualization, brauner,
	ebiederm, torvalds, konrad.wilk, linux-kernel

On Sat, Jul 22, 2023 at 11:03:29PM -0500, michael.christie@oracle.com wrote:
> On 7/20/23 8:06 AM, Michael S. Tsirkin wrote:
> > On Thu, Feb 02, 2023 at 05:25:17PM -0600, Mike Christie wrote:
> >> For vhost workers we use the kthread API which inherit's its values from
> >> and checks against the kthreadd thread. This results in the wrong RLIMITs
> >> being checked, so while tools like libvirt try to control the number of
> >> threads based on the nproc rlimit setting we can end up creating more
> >> threads than the user wanted.
> >>
> >> This patch has us use the vhost_task helpers which will inherit its
> >> values/checks from the thread that owns the device similar to if we did
> >> a clone in userspace. The vhost threads will now be counted in the nproc
> >> rlimits. And we get features like cgroups and mm sharing automatically,
> >> so we can remove those calls.
> >>
> >> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> >> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> > 
> > 
> > Hi Mike,
> > So this seems to have caused a measureable regression in networking
> > performance (about 30%). Take a look here, and there's a zip file
> > with detailed measuraments attached:
> > 
> > https://bugzilla.redhat.com/show_bug.cgi?id=2222603
> > 
> > 
> > Could you take a look please?
> > You can also ask reporter questions there assuming you
> > have or can create a (free) account.
> > 
> 
> Sorry for the late reply. I just got home from vacation.
> 
> The account creation link seems to be down. I keep getting a
> "unable to establish SMTP connection to bz-exim-prod port 25 " error.
> 
> Can you give me Quan's email?
> 
> I think I can replicate the problem. I just need some extra info from Quan:
> 
> 1. Just double check that they are using RHEL 9 on the host running the VMs.
> 2. The kernel config
> 3. Any tuning that was done. Is tuned running in guest and/or host running the
> VMs and what profile is being used in each.
> 4. Number of vCPUs and virtqueues being used.
> 5. Can they dump the contents of:
> 
> /sys/kernel/debug/sched
> 
> and
> 
> sysctl  -a
> 
> on the host running the VMs.
> 
> 6. With the 6.4 kernel, can they also run a quick test and tell me if they set
> the scheduler to batch:
> 
> ps -T -o comm,pid,tid $QEMU_THREAD
> 
> then for each vhost thread do:
> 
> chrt -b -p 0 $VHOST_THREAD
> 
> Does that end up increasing perf? When I do this I see throughput go up by
> around 50% vs 6.3 when sessions was 16 or more (16 was the number of vCPUs
> and virtqueues per net device in the VM). Note that I'm not saying that is a fix.
> It's just a difference I noticed when running some other tests.


Mike I'm unsure what to do at this point. Regressions are not nice
but if the kernel is released with the new userspace api we won't
be able to revert. So what's the plan?

-- 
MST


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-08-10 18:57       ` Michael S. Tsirkin
@ 2023-08-11 18:51         ` Mike Christie
  2023-08-13 19:01           ` Michael S. Tsirkin
  0 siblings, 1 reply; 98+ messages in thread
From: Mike Christie @ 2023-08-11 18:51 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: hch, stefanha, jasowang, sgarzare, virtualization, brauner,
	ebiederm, torvalds, konrad.wilk, linux-kernel

On 8/10/23 1:57 PM, Michael S. Tsirkin wrote:
> On Sat, Jul 22, 2023 at 11:03:29PM -0500, michael.christie@oracle.com wrote:
>> On 7/20/23 8:06 AM, Michael S. Tsirkin wrote:
>>> On Thu, Feb 02, 2023 at 05:25:17PM -0600, Mike Christie wrote:
>>>> For vhost workers we use the kthread API which inherit's its values from
>>>> and checks against the kthreadd thread. This results in the wrong RLIMITs
>>>> being checked, so while tools like libvirt try to control the number of
>>>> threads based on the nproc rlimit setting we can end up creating more
>>>> threads than the user wanted.
>>>>
>>>> This patch has us use the vhost_task helpers which will inherit its
>>>> values/checks from the thread that owns the device similar to if we did
>>>> a clone in userspace. The vhost threads will now be counted in the nproc
>>>> rlimits. And we get features like cgroups and mm sharing automatically,
>>>> so we can remove those calls.
>>>>
>>>> Signed-off-by: Mike Christie <michael.christie@oracle.com>
>>>> Acked-by: Michael S. Tsirkin <mst@redhat.com>
>>>
>>>
>>> Hi Mike,
>>> So this seems to have caused a measureable regression in networking
>>> performance (about 30%). Take a look here, and there's a zip file
>>> with detailed measuraments attached:
>>>
>>> https://bugzilla.redhat.com/show_bug.cgi?id=2222603
>>>
>>>
>>> Could you take a look please?
>>> You can also ask reporter questions there assuming you
>>> have or can create a (free) account.
>>>
>>
>> Sorry for the late reply. I just got home from vacation.
>>
>> The account creation link seems to be down. I keep getting a
>> "unable to establish SMTP connection to bz-exim-prod port 25 " error.
>>
>> Can you give me Quan's email?
>>
>> I think I can replicate the problem. I just need some extra info from Quan:
>>
>> 1. Just double check that they are using RHEL 9 on the host running the VMs.
>> 2. The kernel config
>> 3. Any tuning that was done. Is tuned running in guest and/or host running the
>> VMs and what profile is being used in each.
>> 4. Number of vCPUs and virtqueues being used.
>> 5. Can they dump the contents of:
>>
>> /sys/kernel/debug/sched
>>
>> and
>>
>> sysctl  -a
>>
>> on the host running the VMs.
>>
>> 6. With the 6.4 kernel, can they also run a quick test and tell me if they set
>> the scheduler to batch:
>>
>> ps -T -o comm,pid,tid $QEMU_THREAD
>>
>> then for each vhost thread do:
>>
>> chrt -b -p 0 $VHOST_THREAD
>>
>> Does that end up increasing perf? When I do this I see throughput go up by
>> around 50% vs 6.3 when sessions was 16 or more (16 was the number of vCPUs
>> and virtqueues per net device in the VM). Note that I'm not saying that is a fix.
>> It's just a difference I noticed when running some other tests.
> 
> 
> Mike I'm unsure what to do at this point. Regressions are not nice
> but if the kernel is released with the new userspace api we won't
> be able to revert. So what's the plan?
> 

I'm sort of stumped. I still can't replicate the problem out of the box. 6.3 and
6.4 perform the same for me. I've tried your setup and settings and with different
combos of using things like tuned and irqbalance.

I can sort of force the issue. In 6.4, the vhost thread inherits it's settings
from the parent thread. In 6.3, the vhost thread inherits from kthreadd and we
would then reset the sched settings. So in 6.4 if I just tune the parent differently
I can cause different performance. If we want the 6.3 behavior we can do the patch
below.

However, I don't think you guys are hitting this because you are just running
qemu from the normal shell and were not doing anything fancy with the sched
settings.


diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index da35e5b7f047..f2c2638d1106 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -2,6 +2,7 @@
 /*
  * Copyright (C) 2021 Oracle Corporation
  */
+#include <uapi/linux/sched/types.h>
 #include <linux/slab.h>
 #include <linux/completion.h>
 #include <linux/sched/task.h>
@@ -22,9 +23,16 @@ struct vhost_task {
 
 static int vhost_task_fn(void *data)
 {
+	static const struct sched_param param = { .sched_priority = 0 };
 	struct vhost_task *vtsk = data;
 	bool dead = false;
 
+	/*
+	 * Don't inherit the parent's sched info, so we maintain compat from
+	 * when we used kthreads and it reset this info.
+	 */
+	sched_setscheduler_nocheck(current, SCHED_NORMAL, &param);
+
 	for (;;) {
 		bool did_work;
 







^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-08-11 18:51         ` Mike Christie
@ 2023-08-13 19:01           ` Michael S. Tsirkin
  2023-08-14  3:13             ` michael.christie
  0 siblings, 1 reply; 98+ messages in thread
From: Michael S. Tsirkin @ 2023-08-13 19:01 UTC (permalink / raw)
  To: Mike Christie
  Cc: hch, stefanha, jasowang, sgarzare, virtualization, brauner,
	ebiederm, torvalds, konrad.wilk, linux-kernel

On Fri, Aug 11, 2023 at 01:51:36PM -0500, Mike Christie wrote:
> On 8/10/23 1:57 PM, Michael S. Tsirkin wrote:
> > On Sat, Jul 22, 2023 at 11:03:29PM -0500, michael.christie@oracle.com wrote:
> >> On 7/20/23 8:06 AM, Michael S. Tsirkin wrote:
> >>> On Thu, Feb 02, 2023 at 05:25:17PM -0600, Mike Christie wrote:
> >>>> For vhost workers we use the kthread API which inherit's its values from
> >>>> and checks against the kthreadd thread. This results in the wrong RLIMITs
> >>>> being checked, so while tools like libvirt try to control the number of
> >>>> threads based on the nproc rlimit setting we can end up creating more
> >>>> threads than the user wanted.
> >>>>
> >>>> This patch has us use the vhost_task helpers which will inherit its
> >>>> values/checks from the thread that owns the device similar to if we did
> >>>> a clone in userspace. The vhost threads will now be counted in the nproc
> >>>> rlimits. And we get features like cgroups and mm sharing automatically,
> >>>> so we can remove those calls.
> >>>>
> >>>> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> >>>> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> >>>
> >>>
> >>> Hi Mike,
> >>> So this seems to have caused a measureable regression in networking
> >>> performance (about 30%). Take a look here, and there's a zip file
> >>> with detailed measuraments attached:
> >>>
> >>> https://bugzilla.redhat.com/show_bug.cgi?id=2222603
> >>>
> >>>
> >>> Could you take a look please?
> >>> You can also ask reporter questions there assuming you
> >>> have or can create a (free) account.
> >>>
> >>
> >> Sorry for the late reply. I just got home from vacation.
> >>
> >> The account creation link seems to be down. I keep getting a
> >> "unable to establish SMTP connection to bz-exim-prod port 25 " error.
> >>
> >> Can you give me Quan's email?
> >>
> >> I think I can replicate the problem. I just need some extra info from Quan:
> >>
> >> 1. Just double check that they are using RHEL 9 on the host running the VMs.
> >> 2. The kernel config
> >> 3. Any tuning that was done. Is tuned running in guest and/or host running the
> >> VMs and what profile is being used in each.
> >> 4. Number of vCPUs and virtqueues being used.
> >> 5. Can they dump the contents of:
> >>
> >> /sys/kernel/debug/sched
> >>
> >> and
> >>
> >> sysctl  -a
> >>
> >> on the host running the VMs.
> >>
> >> 6. With the 6.4 kernel, can they also run a quick test and tell me if they set
> >> the scheduler to batch:
> >>
> >> ps -T -o comm,pid,tid $QEMU_THREAD
> >>
> >> then for each vhost thread do:
> >>
> >> chrt -b -p 0 $VHOST_THREAD
> >>
> >> Does that end up increasing perf? When I do this I see throughput go up by
> >> around 50% vs 6.3 when sessions was 16 or more (16 was the number of vCPUs
> >> and virtqueues per net device in the VM). Note that I'm not saying that is a fix.
> >> It's just a difference I noticed when running some other tests.
> > 
> > 
> > Mike I'm unsure what to do at this point. Regressions are not nice
> > but if the kernel is released with the new userspace api we won't
> > be able to revert. So what's the plan?
> > 
> 
> I'm sort of stumped. I still can't replicate the problem out of the box. 6.3 and
> 6.4 perform the same for me. I've tried your setup and settings and with different
> combos of using things like tuned and irqbalance.
> 
> I can sort of force the issue. In 6.4, the vhost thread inherits it's settings
> from the parent thread. In 6.3, the vhost thread inherits from kthreadd and we
> would then reset the sched settings. So in 6.4 if I just tune the parent differently
> I can cause different performance. If we want the 6.3 behavior we can do the patch
> below.
> 
> However, I don't think you guys are hitting this because you are just running
> qemu from the normal shell and were not doing anything fancy with the sched
> settings.
> 
> 
> diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
> index da35e5b7f047..f2c2638d1106 100644
> --- a/kernel/vhost_task.c
> +++ b/kernel/vhost_task.c
> @@ -2,6 +2,7 @@
>  /*
>   * Copyright (C) 2021 Oracle Corporation
>   */
> +#include <uapi/linux/sched/types.h>
>  #include <linux/slab.h>
>  #include <linux/completion.h>
>  #include <linux/sched/task.h>
> @@ -22,9 +23,16 @@ struct vhost_task {
>  
>  static int vhost_task_fn(void *data)
>  {
> +	static const struct sched_param param = { .sched_priority = 0 };
>  	struct vhost_task *vtsk = data;
>  	bool dead = false;
>  
> +	/*
> +	 * Don't inherit the parent's sched info, so we maintain compat from
> +	 * when we used kthreads and it reset this info.
> +	 */
> +	sched_setscheduler_nocheck(current, SCHED_NORMAL, &param);
> +
>  	for (;;) {
>  		bool did_work;
>  
> 
> 

yes seems unlikely, still, attach this to bugzilla so it can be
tested?

and, what will help you debug? any traces to enable?

Also wasn't there another issue with a non standard config?
Maybe if we fix that it will by chance fix this one too?

> 
> 


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
  2023-08-13 19:01           ` Michael S. Tsirkin
@ 2023-08-14  3:13             ` michael.christie
  0 siblings, 0 replies; 98+ messages in thread
From: michael.christie @ 2023-08-14  3:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: hch, stefanha, jasowang, sgarzare, virtualization, brauner,
	ebiederm, torvalds, konrad.wilk, linux-kernel

On 8/13/23 2:01 PM, Michael S. Tsirkin wrote:
> On Fri, Aug 11, 2023 at 01:51:36PM -0500, Mike Christie wrote:
>> On 8/10/23 1:57 PM, Michael S. Tsirkin wrote:
>>> On Sat, Jul 22, 2023 at 11:03:29PM -0500, michael.christie@oracle.com wrote:
>>>> On 7/20/23 8:06 AM, Michael S. Tsirkin wrote:
>>>>> On Thu, Feb 02, 2023 at 05:25:17PM -0600, Mike Christie wrote:
>>>>>> For vhost workers we use the kthread API which inherit's its values from
>>>>>> and checks against the kthreadd thread. This results in the wrong RLIMITs
>>>>>> being checked, so while tools like libvirt try to control the number of
>>>>>> threads based on the nproc rlimit setting we can end up creating more
>>>>>> threads than the user wanted.
>>>>>>
>>>>>> This patch has us use the vhost_task helpers which will inherit its
>>>>>> values/checks from the thread that owns the device similar to if we did
>>>>>> a clone in userspace. The vhost threads will now be counted in the nproc
>>>>>> rlimits. And we get features like cgroups and mm sharing automatically,
>>>>>> so we can remove those calls.
>>>>>>
>>>>>> Signed-off-by: Mike Christie <michael.christie@oracle.com>
>>>>>> Acked-by: Michael S. Tsirkin <mst@redhat.com>
>>>>>
>>>>>
>>>>> Hi Mike,
>>>>> So this seems to have caused a measureable regression in networking
>>>>> performance (about 30%). Take a look here, and there's a zip file
>>>>> with detailed measuraments attached:
>>>>>
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=2222603
>>>>>
>>>>>
>>>>> Could you take a look please?
>>>>> You can also ask reporter questions there assuming you
>>>>> have or can create a (free) account.
>>>>>
>>>>
>>>> Sorry for the late reply. I just got home from vacation.
>>>>
>>>> The account creation link seems to be down. I keep getting a
>>>> "unable to establish SMTP connection to bz-exim-prod port 25 " error.
>>>>
>>>> Can you give me Quan's email?
>>>>
>>>> I think I can replicate the problem. I just need some extra info from Quan:
>>>>
>>>> 1. Just double check that they are using RHEL 9 on the host running the VMs.
>>>> 2. The kernel config
>>>> 3. Any tuning that was done. Is tuned running in guest and/or host running the
>>>> VMs and what profile is being used in each.
>>>> 4. Number of vCPUs and virtqueues being used.
>>>> 5. Can they dump the contents of:
>>>>
>>>> /sys/kernel/debug/sched
>>>>
>>>> and
>>>>
>>>> sysctl  -a
>>>>
>>>> on the host running the VMs.
>>>>
>>>> 6. With the 6.4 kernel, can they also run a quick test and tell me if they set
>>>> the scheduler to batch:
>>>>
>>>> ps -T -o comm,pid,tid $QEMU_THREAD
>>>>
>>>> then for each vhost thread do:
>>>>
>>>> chrt -b -p 0 $VHOST_THREAD
>>>>
>>>> Does that end up increasing perf? When I do this I see throughput go up by
>>>> around 50% vs 6.3 when sessions was 16 or more (16 was the number of vCPUs
>>>> and virtqueues per net device in the VM). Note that I'm not saying that is a fix.
>>>> It's just a difference I noticed when running some other tests.
>>>
>>>
>>> Mike I'm unsure what to do at this point. Regressions are not nice
>>> but if the kernel is released with the new userspace api we won't
>>> be able to revert. So what's the plan?
>>>
>>
>> I'm sort of stumped. I still can't replicate the problem out of the box. 6.3 and
>> 6.4 perform the same for me. I've tried your setup and settings and with different
>> combos of using things like tuned and irqbalance.
>>
>> I can sort of force the issue. In 6.4, the vhost thread inherits it's settings
>> from the parent thread. In 6.3, the vhost thread inherits from kthreadd and we
>> would then reset the sched settings. So in 6.4 if I just tune the parent differently
>> I can cause different performance. If we want the 6.3 behavior we can do the patch
>> below.
>>
>> However, I don't think you guys are hitting this because you are just running
>> qemu from the normal shell and were not doing anything fancy with the sched
>> settings.
>>
>>
>> diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
>> index da35e5b7f047..f2c2638d1106 100644
>> --- a/kernel/vhost_task.c
>> +++ b/kernel/vhost_task.c
>> @@ -2,6 +2,7 @@
>>  /*
>>   * Copyright (C) 2021 Oracle Corporation
>>   */
>> +#include <uapi/linux/sched/types.h>
>>  #include <linux/slab.h>
>>  #include <linux/completion.h>
>>  #include <linux/sched/task.h>
>> @@ -22,9 +23,16 @@ struct vhost_task {
>>  
>>  static int vhost_task_fn(void *data)
>>  {
>> +	static const struct sched_param param = { .sched_priority = 0 };
>>  	struct vhost_task *vtsk = data;
>>  	bool dead = false;
>>  
>> +	/*
>> +	 * Don't inherit the parent's sched info, so we maintain compat from
>> +	 * when we used kthreads and it reset this info.
>> +	 */
>> +	sched_setscheduler_nocheck(current, SCHED_NORMAL, &param);
>> +
>>  	for (;;) {
>>  		bool did_work;
>>  
>>
>>
> 
> yes seems unlikely, still, attach this to bugzilla so it can be
> tested?
> 
> and, what will help you debug? any traces to enable?

I added the patch and asked for a perf trace.

> 
> Also wasn't there another issue with a non standard config?
> Maybe if we fix that it will by chance fix this one too?
> 

It was when CONFIG_RT_GROUP_SCHED was enabled in the kernel config then
I would see a large drop in IOPs/throughput.

In the current 6.5-rc6 I don't see the problem anymore. I haven't had a
chance to narrow down what fixed it.




^ permalink raw reply	[flat|nested] 98+ messages in thread

end of thread, other threads:[~2023-08-14  3:14 UTC | newest]

Thread overview: 98+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-02 23:25 [PATCH v11 0/8] Use copy_process in vhost layer Mike Christie
2023-02-02 23:25 ` [PATCH v11 1/8] fork: Make IO worker options flag based Mike Christie
2023-02-03  0:14   ` Linus Torvalds
2023-02-02 23:25 ` [PATCH v11 2/8] fork/vm: Move common PF_IO_WORKER behavior to new flag Mike Christie
2023-02-02 23:25 ` [PATCH v11 3/8] fork: add USER_WORKER flag to not dup/clone files Mike Christie
2023-02-03  0:16   ` Linus Torvalds
2023-02-02 23:25 ` [PATCH v11 4/8] fork: Add USER_WORKER flag to ignore signals Mike Christie
2023-02-03  0:19   ` Linus Torvalds
2023-02-05 16:06     ` Mike Christie
2023-02-02 23:25 ` [PATCH v11 5/8] fork: allow kernel code to call copy_process Mike Christie
2023-02-02 23:25 ` [PATCH v11 6/8] vhost_task: Allow vhost layer to use copy_process Mike Christie
2023-02-03  0:43   ` Linus Torvalds
2023-02-02 23:25 ` [PATCH v11 7/8] vhost: move worker thread fields to new struct Mike Christie
2023-02-02 23:25 ` [PATCH v11 8/8] vhost: use vhost_tasks for worker threads Mike Christie
2023-05-05 13:40   ` Nicolas Dichtel
2023-05-05 18:22     ` Linus Torvalds
2023-05-05 22:37       ` Mike Christie
2023-05-06  1:53         ` Linus Torvalds
2023-05-08 17:13         ` Christian Brauner
2023-05-09  8:09         ` Nicolas Dichtel
2023-05-09  8:17           ` Nicolas Dichtel
2023-05-13 12:39         ` Thorsten Leemhuis
2023-05-13 15:08           ` Linus Torvalds
2023-05-15 14:23             ` Christian Brauner
2023-05-15 15:44               ` Linus Torvalds
2023-05-15 15:52                 ` Jens Axboe
2023-05-15 15:54                   ` Linus Torvalds
2023-05-15 17:23                     ` Linus Torvalds
2023-05-15 15:56                   ` Linus Torvalds
2023-05-15 22:23                 ` Mike Christie
2023-05-15 22:54                   ` Linus Torvalds
2023-05-16  3:53                     ` Mike Christie
2023-05-16 13:18                       ` Oleg Nesterov
2023-05-16 13:40                       ` Oleg Nesterov
2023-05-16 15:56                     ` Eric W. Biederman
2023-05-16 18:37                       ` Oleg Nesterov
2023-05-16 20:12                         ` Eric W. Biederman
2023-05-17 17:09                           ` Oleg Nesterov
2023-05-17 18:22                             ` Mike Christie
2023-05-16  8:39                   ` Christian Brauner
2023-05-16 16:24                     ` Mike Christie
2023-05-16 16:44                       ` Christian Brauner
2023-05-19 12:15                     ` [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND Christian Brauner
2023-06-01  7:58                       ` Thorsten Leemhuis
2023-06-01 10:18                         ` Nicolas Dichtel
2023-06-01 10:47                         ` Christian Brauner
2023-06-01 11:29                           ` Thorsten Leemhuis
2023-06-01 12:26                           ` Linus Torvalds
2023-06-01 16:10                           ` Mike Christie
2023-05-16 14:06     ` [PATCH v11 8/8] vhost: use vhost_tasks for worker threads Linux regression tracking #adding (Thorsten Leemhuis)
2023-05-26  9:03       ` Linux regression tracking #update (Thorsten Leemhuis)
2023-06-02 11:38       ` Thorsten Leemhuis
2023-07-20 13:06   ` Michael S. Tsirkin
2023-07-23  4:03     ` michael.christie
2023-07-23  9:31       ` Michael S. Tsirkin
2023-08-10 18:57       ` Michael S. Tsirkin
2023-08-11 18:51         ` Mike Christie
2023-08-13 19:01           ` Michael S. Tsirkin
2023-08-14  3:13             ` michael.christie
2023-02-07  8:19 ` [PATCH v11 0/8] Use copy_process in vhost layer Christian Brauner
2023-05-18  0:09 [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND Mike Christie
2023-05-18  0:09 ` [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set Mike Christie
2023-05-18  2:34   ` Eric W. Biederman
2023-05-18  3:49   ` Eric W. Biederman
2023-05-18 15:21     ` Mike Christie
2023-05-18 16:25       ` Oleg Nesterov
2023-05-18 16:42         ` Mike Christie
2023-05-18 17:04           ` Oleg Nesterov
2023-05-18 18:28             ` Eric W. Biederman
2023-05-18 22:57               ` Mike Christie
2023-05-19  4:16                 ` Eric W. Biederman
2023-05-19 23:24                   ` Mike Christie
2023-05-22 13:30               ` Oleg Nesterov
2023-05-18  8:08   ` Christian Brauner
2023-05-18 15:27     ` Mike Christie
2023-05-18 17:07       ` Christian Brauner
2023-05-18 18:08         ` Oleg Nesterov
2023-05-18 18:12           ` Christian Brauner
2023-05-18 18:23             ` Oleg Nesterov
2023-05-18  0:09 ` [RFC PATCH 2/8] vhost/vhost_task: Hook vhost layer into signal handler Mike Christie
2023-05-18  0:16   ` Linus Torvalds
2023-05-18  1:01     ` Mike Christie
2023-05-18  8:16       ` Christian Brauner
2023-05-18  0:09 ` [RFC PATCH 3/8] fork/vhost_task: Switch to CLONE_THREAD and CLONE_SIGHAND Mike Christie
2023-05-18  8:18   ` Christian Brauner
2023-05-18  0:09 ` [RFC PATCH 4/8] vhost-net: Move vhost_net_open Mike Christie
2023-05-18  0:09 ` [RFC PATCH 5/8] vhost: Add callback that stops new work and waits on running ones Mike Christie
2023-05-18 14:18   ` Christian Brauner
2023-05-18 15:03     ` Mike Christie
2023-05-18 15:09       ` Christian Brauner
2023-05-18 18:38       ` Eric W. Biederman
2023-05-18  0:09 ` [RFC PATCH 6/8] vhost-scsi: Add callback to stop and wait on works Mike Christie
2023-05-18  0:09 ` [RFC PATCH 7/8] vhost-net: " Mike Christie
2023-05-18  0:09 ` [RFC PATCH 8/8] fork/vhost_task: remove no_files Mike Christie
2023-05-18  1:04   ` Mike Christie
2023-05-18  8:25 ` [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND Christian Brauner
2023-05-18  8:40   ` Christian Brauner
2023-05-18 14:30   ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).