All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V6 01/10] Use copy_process in vhost layer
@ 2021-11-29 19:46 ` Mike Christie
  0 siblings, 0 replies; 57+ messages in thread
From: Mike Christie @ 2021-11-29 19:46 UTC (permalink / raw)
  To: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel

The following patches made over Linus's tree, allow the vhost layer to do
a copy_process on the thread that does the VHOST_SET_OWNER ioctl like how
io_uring does a copy_process against its userspace app. This allows the
vhost layer's worker threads to inherit cgroups, namespaces, address
space, etc and this worker thread will also be accounted for against that
owner/parent process's RLIMIT_NPROC limit.

If you are not familiar with qemu and vhost here is more detailed
problem description:

Qemu will create vhost devices in the kernel which perform network, SCSI,
etc IO and management operations from worker threads created by the
kthread API. Because the kthread API does a copy_process on the kthreadd
thread, the vhost layer has to use kthread_use_mm to access the Qemu
thread's memory and cgroup_attach_task_all to add itself to the Qemu
thread's cgroups.

The problem with this approach is that we then have to add new functions/
args/functionality for every thing we want to inherit. I started doing
that here:

https://lkml.org/lkml/2021/6/23/1233

for the RLIMIT_NPROC check, but it seems it might be easier to just
inherit everything from the beginning, becuase I'd need to do something
like that patch several times.

V6:
- Rename kernel_worker to user_worker and fix prefixes.
- Add better patch descriptions.
V5:
- Handle kbuild errors by building patchset against current kernel that
  has all deps merged. Also add patch to remove create_io_thread code as
  it's not used anymore.
- Rebase patchset against current kernel and handle a new vm PF_IO_WORKER
  case added in 5.16-rc1.
- Add PF_USER_WORKER flag so we can check it later after the initial
  thread creation for the wake up, vm and singal cses.
- Added patch to auto reap the worker thread.
V4:
- Drop NO_SIG patch and replaced with Christian's SIG_IGN patch.
- Merged Christian's kernel_worker_flags_valid helpers into patch 5 that
  added the new kernel worker functions.
- Fixed extra "i" issue.
- Added PF_USER_WORKER flag and added check that kernel_worker_start users
  had that flag set. Also dropped patches that passed worker flags to
  copy_thread and replaced with PF_USER_WORKER check.
V3:
- Add parentheses in p->flag and work_flags check in copy_thread.
- Fix check in arm/arm64 which was doing the reverse of other archs
  where it did likely(!flags) instead of unlikely(flags).
V2:
- Rename kernel_copy_process to kernel_worker.
- Instead of exporting functions, make kernel_worker() a proper
  function/API that does common work for the caller.
- Instead of adding new fields to kernel_clone_args for each option
  make it flag based similar to CLONE_*.
- Drop unused completion struct in vhost.
- Fix compile warnings by merging vhost cgroup cleanup patch and
  vhost conversion patch.
~                              


_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH V6 01/10] Use copy_process in vhost layer
@ 2021-11-29 19:46 ` Mike Christie
  0 siblings, 0 replies; 57+ messages in thread
From: Mike Christie @ 2021-11-29 19:46 UTC (permalink / raw)
  To: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel

The following patches made over Linus's tree, allow the vhost layer to do
a copy_process on the thread that does the VHOST_SET_OWNER ioctl like how
io_uring does a copy_process against its userspace app. This allows the
vhost layer's worker threads to inherit cgroups, namespaces, address
space, etc and this worker thread will also be accounted for against that
owner/parent process's RLIMIT_NPROC limit.

If you are not familiar with qemu and vhost here is more detailed
problem description:

Qemu will create vhost devices in the kernel which perform network, SCSI,
etc IO and management operations from worker threads created by the
kthread API. Because the kthread API does a copy_process on the kthreadd
thread, the vhost layer has to use kthread_use_mm to access the Qemu
thread's memory and cgroup_attach_task_all to add itself to the Qemu
thread's cgroups.

The problem with this approach is that we then have to add new functions/
args/functionality for every thing we want to inherit. I started doing
that here:

https://lkml.org/lkml/2021/6/23/1233

for the RLIMIT_NPROC check, but it seems it might be easier to just
inherit everything from the beginning, becuase I'd need to do something
like that patch several times.

V6:
- Rename kernel_worker to user_worker and fix prefixes.
- Add better patch descriptions.
V5:
- Handle kbuild errors by building patchset against current kernel that
  has all deps merged. Also add patch to remove create_io_thread code as
  it's not used anymore.
- Rebase patchset against current kernel and handle a new vm PF_IO_WORKER
  case added in 5.16-rc1.
- Add PF_USER_WORKER flag so we can check it later after the initial
  thread creation for the wake up, vm and singal cses.
- Added patch to auto reap the worker thread.
V4:
- Drop NO_SIG patch and replaced with Christian's SIG_IGN patch.
- Merged Christian's kernel_worker_flags_valid helpers into patch 5 that
  added the new kernel worker functions.
- Fixed extra "i" issue.
- Added PF_USER_WORKER flag and added check that kernel_worker_start users
  had that flag set. Also dropped patches that passed worker flags to
  copy_thread and replaced with PF_USER_WORKER check.
V3:
- Add parentheses in p->flag and work_flags check in copy_thread.
- Fix check in arm/arm64 which was doing the reverse of other archs
  where it did likely(!flags) instead of unlikely(flags).
V2:
- Rename kernel_copy_process to kernel_worker.
- Instead of exporting functions, make kernel_worker() a proper
  function/API that does common work for the caller.
- Instead of adding new fields to kernel_clone_args for each option
  make it flag based similar to CLONE_*.
- Drop unused completion struct in vhost.
- Fix compile warnings by merging vhost cgroup cleanup patch and
  vhost conversion patch.
~                              



^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH V6 01/10] fork: Make IO worker options flag based
  2021-11-29 19:46 ` Mike Christie
@ 2021-11-29 19:46   ` Mike Christie
  -1 siblings, 0 replies; 57+ messages in thread
From: Mike Christie @ 2021-11-29 19:46 UTC (permalink / raw)
  To: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel
  Cc: Christoph Hellwig

This patchset adds a couple new options to kernel_clone_args for IO thread
like/related users. Instead of adding new fields to kernel_clone_args for
each option, this moves us to a flags based approach by first converting
io_thread.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Suggested-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 include/linux/sched/task.h | 4 +++-
 kernel/fork.c              | 4 ++--
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 058d7f371e25..65d5ec3a8a04 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -18,8 +18,11 @@ struct css_set;
 /* All the bits taken by the old clone syscall. */
 #define CLONE_LEGACY_FLAGS 0xffffffffULL
 
+#define USER_WORKER_IO		BIT(0)
+
 struct kernel_clone_args {
 	u64 flags;
+	u32 worker_flags;
 	int __user *pidfd;
 	int __user *child_tid;
 	int __user *parent_tid;
@@ -31,7 +34,6 @@ struct kernel_clone_args {
 	/* Number of elements in *set_tid */
 	size_t set_tid_size;
 	int cgroup;
-	int io_thread;
 	struct cgroup *cgrp;
 	struct css_set *cset;
 };
diff --git a/kernel/fork.c b/kernel/fork.c
index 3244cc56b697..6effd839d566 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2023,7 +2023,7 @@ static __latent_entropy struct task_struct *copy_process(
 	p = dup_task_struct(current, node);
 	if (!p)
 		goto fork_out;
-	if (args->io_thread) {
+	if (args->worker_flags & USER_WORKER_IO) {
 		/*
 		 * Mark us an IO worker, and block any signal that isn't
 		 * fatal or STOP
@@ -2524,7 +2524,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
 		.exit_signal	= (lower_32_bits(flags) & CSIGNAL),
 		.stack		= (unsigned long)fn,
 		.stack_size	= (unsigned long)arg,
-		.io_thread	= 1,
+		.worker_flags	= USER_WORKER_IO,
 	};
 
 	return copy_process(NULL, 0, node, &args);
-- 
2.25.1

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH V6 01/10] fork: Make IO worker options flag based
@ 2021-11-29 19:46   ` Mike Christie
  0 siblings, 0 replies; 57+ messages in thread
From: Mike Christie @ 2021-11-29 19:46 UTC (permalink / raw)
  To: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel
  Cc: Mike Christie, Christoph Hellwig

This patchset adds a couple new options to kernel_clone_args for IO thread
like/related users. Instead of adding new fields to kernel_clone_args for
each option, this moves us to a flags based approach by first converting
io_thread.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Suggested-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 include/linux/sched/task.h | 4 +++-
 kernel/fork.c              | 4 ++--
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 058d7f371e25..65d5ec3a8a04 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -18,8 +18,11 @@ struct css_set;
 /* All the bits taken by the old clone syscall. */
 #define CLONE_LEGACY_FLAGS 0xffffffffULL
 
+#define USER_WORKER_IO		BIT(0)
+
 struct kernel_clone_args {
 	u64 flags;
+	u32 worker_flags;
 	int __user *pidfd;
 	int __user *child_tid;
 	int __user *parent_tid;
@@ -31,7 +34,6 @@ struct kernel_clone_args {
 	/* Number of elements in *set_tid */
 	size_t set_tid_size;
 	int cgroup;
-	int io_thread;
 	struct cgroup *cgrp;
 	struct css_set *cset;
 };
diff --git a/kernel/fork.c b/kernel/fork.c
index 3244cc56b697..6effd839d566 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2023,7 +2023,7 @@ static __latent_entropy struct task_struct *copy_process(
 	p = dup_task_struct(current, node);
 	if (!p)
 		goto fork_out;
-	if (args->io_thread) {
+	if (args->worker_flags & USER_WORKER_IO) {
 		/*
 		 * Mark us an IO worker, and block any signal that isn't
 		 * fatal or STOP
@@ -2524,7 +2524,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
 		.exit_signal	= (lower_32_bits(flags) & CSIGNAL),
 		.stack		= (unsigned long)fn,
 		.stack_size	= (unsigned long)arg,
-		.io_thread	= 1,
+		.worker_flags	= USER_WORKER_IO,
 	};
 
 	return copy_process(NULL, 0, node, &args);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH V6 02/10] fork/vm: Move common PF_IO_WORKER behavior to new flag
  2021-11-29 19:46 ` Mike Christie
@ 2021-11-29 19:46   ` Mike Christie
  -1 siblings, 0 replies; 57+ messages in thread
From: Mike Christie @ 2021-11-29 19:46 UTC (permalink / raw)
  To: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel

This adds a new flag, PF_USER_WORKER, that's used for behavior common to
to both PF_IO_WORKER and users like vhost which will use the new
user_worker helpers that will use the flag and are added later in this
patchset.

The common behavior PF_USER_WORKER covers is the initial frame and fpu
setup and the vm reclaim handling.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
[For m68k changes]
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
---
 arch/alpha/kernel/process.c      | 2 +-
 arch/arc/kernel/process.c        | 2 +-
 arch/arm/kernel/process.c        | 2 +-
 arch/arm64/kernel/process.c      | 2 +-
 arch/csky/kernel/process.c       | 2 +-
 arch/h8300/kernel/process.c      | 2 +-
 arch/hexagon/kernel/process.c    | 2 +-
 arch/ia64/kernel/process.c       | 2 +-
 arch/m68k/kernel/process.c       | 2 +-
 arch/microblaze/kernel/process.c | 2 +-
 arch/mips/kernel/process.c       | 2 +-
 arch/nds32/kernel/process.c      | 2 +-
 arch/nios2/kernel/process.c      | 2 +-
 arch/openrisc/kernel/process.c   | 2 +-
 arch/parisc/kernel/process.c     | 2 +-
 arch/powerpc/kernel/process.c    | 2 +-
 arch/riscv/kernel/process.c      | 2 +-
 arch/s390/kernel/process.c       | 2 +-
 arch/sh/kernel/process_32.c      | 2 +-
 arch/sparc/kernel/process_32.c   | 2 +-
 arch/sparc/kernel/process_64.c   | 2 +-
 arch/um/kernel/process.c         | 2 +-
 arch/x86/kernel/fpu/core.c       | 4 ++--
 arch/x86/kernel/process.c        | 2 +-
 arch/xtensa/kernel/process.c     | 2 +-
 include/linux/sched.h            | 1 +
 include/linux/sched/task.h       | 3 ++-
 kernel/fork.c                    | 5 ++++-
 mm/vmscan.c                      | 4 ++--
 29 files changed, 35 insertions(+), 30 deletions(-)

diff --git a/arch/alpha/kernel/process.c b/arch/alpha/kernel/process.c
index 5f8527081da9..f4759e4ee4a9 100644
--- a/arch/alpha/kernel/process.c
+++ b/arch/alpha/kernel/process.c
@@ -249,7 +249,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 	childti->pcb.ksp = (unsigned long) childstack;
 	childti->pcb.flags = 1;	/* set FEN, clear everything else */
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		/* kernel thread */
 		memset(childstack, 0,
 			sizeof(struct switch_stack) + sizeof(struct pt_regs));
diff --git a/arch/arc/kernel/process.c b/arch/arc/kernel/process.c
index 8e90052f6f05..b409ecb1407f 100644
--- a/arch/arc/kernel/process.c
+++ b/arch/arc/kernel/process.c
@@ -191,7 +191,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 	childksp[0] = 0;			/* fp */
 	childksp[1] = (unsigned long)ret_from_fork; /* blink */
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		memset(c_regs, 0, sizeof(struct pt_regs));
 
 		c_callee->r13 = kthread_arg;
diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
index d47159f3791c..44603062d661 100644
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -251,7 +251,7 @@ int copy_thread(unsigned long clone_flags, unsigned long stack_start,
 	thread->cpu_domain = get_domain();
 #endif
 
-	if (likely(!(p->flags & (PF_KTHREAD | PF_IO_WORKER)))) {
+	if (likely(!(p->flags & (PF_KTHREAD | PF_USER_WORKER)))) {
 		*childregs = *current_pt_regs();
 		childregs->ARM_r0 = 0;
 		if (stack_start)
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index aacf2f5559a8..b4c6c41847be 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -333,7 +333,7 @@ int copy_thread(unsigned long clone_flags, unsigned long stack_start,
 
 	ptrauth_thread_init_kernel(p);
 
-	if (likely(!(p->flags & (PF_KTHREAD | PF_IO_WORKER)))) {
+	if (likely(!(p->flags & (PF_KTHREAD | PF_USER_WORKER)))) {
 		*childregs = *current_pt_regs();
 		childregs->regs[0] = 0;
 
diff --git a/arch/csky/kernel/process.c b/arch/csky/kernel/process.c
index 3d0ca22cd0e2..509f2bfe4ace 100644
--- a/arch/csky/kernel/process.c
+++ b/arch/csky/kernel/process.c
@@ -49,7 +49,7 @@ int copy_thread(unsigned long clone_flags,
 	/* setup thread.sp for switch_to !!! */
 	p->thread.sp = (unsigned long)childstack;
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		childstack->r15 = (unsigned long) ret_from_kernel_thread;
 		childstack->r10 = kthread_arg;
diff --git a/arch/h8300/kernel/process.c b/arch/h8300/kernel/process.c
index 8833fa4f5d51..050aca44ba6d 100644
--- a/arch/h8300/kernel/process.c
+++ b/arch/h8300/kernel/process.c
@@ -112,7 +112,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 
 	childregs = (struct pt_regs *) (THREAD_SIZE + task_stack_page(p)) - 1;
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		childregs->retpc = (unsigned long) ret_from_kernel_thread;
 		childregs->er4 = topstk; /* arg */
diff --git a/arch/hexagon/kernel/process.c b/arch/hexagon/kernel/process.c
index 232dfd8956aa..40f8294c6c7c 100644
--- a/arch/hexagon/kernel/process.c
+++ b/arch/hexagon/kernel/process.c
@@ -73,7 +73,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
 						    sizeof(*ss));
 	ss->lr = (unsigned long)ret_from_fork;
 	p->thread.switch_sp = ss;
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		/* r24 <- fn, r25 <- arg */
 		ss->r24 = usp;
diff --git a/arch/ia64/kernel/process.c b/arch/ia64/kernel/process.c
index 834df24a88f1..29015ebdcf1d 100644
--- a/arch/ia64/kernel/process.c
+++ b/arch/ia64/kernel/process.c
@@ -338,7 +338,7 @@ copy_thread(unsigned long clone_flags, unsigned long user_stack_base,
 
 	ia64_drop_fpu(p);	/* don't pick up stale state from a CPU's fph */
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		if (unlikely(!user_stack_base)) {
 			/* fork_idle() called us */
 			return 0;
diff --git a/arch/m68k/kernel/process.c b/arch/m68k/kernel/process.c
index a6030dbaa089..cbb693012b5e 100644
--- a/arch/m68k/kernel/process.c
+++ b/arch/m68k/kernel/process.c
@@ -157,7 +157,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
 	 */
 	p->thread.fc = USER_DATA;
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		/* kernel thread */
 		memset(frame, 0, sizeof(struct fork_frame));
 		frame->regs.sr = PS_S;
diff --git a/arch/microblaze/kernel/process.c b/arch/microblaze/kernel/process.c
index 5e2b91c1e8ce..de1da9900f7e 100644
--- a/arch/microblaze/kernel/process.c
+++ b/arch/microblaze/kernel/process.c
@@ -59,7 +59,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
 	struct pt_regs *childregs = task_pt_regs(p);
 	struct thread_info *ti = task_thread_info(p);
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		/* if we're creating a new kernel thread then just zeroing all
 		 * the registers. That's OK for a brand new thread.*/
 		memset(childregs, 0, sizeof(struct pt_regs));
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index cbff1b974f88..6f1ed337cd41 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -120,7 +120,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 	/*  Put the stack after the struct pt_regs.  */
 	childksp = (unsigned long) childregs;
 	p->thread.cp0_status = (read_c0_status() & ~(ST0_CU2|ST0_CU1)) | ST0_KERNEL_CUMASK;
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		/* kernel thread */
 		unsigned long status = p->thread.cp0_status;
 		memset(childregs, 0, sizeof(struct pt_regs));
diff --git a/arch/nds32/kernel/process.c b/arch/nds32/kernel/process.c
index 49fab9e39cbf..dba91dd1e289 100644
--- a/arch/nds32/kernel/process.c
+++ b/arch/nds32/kernel/process.c
@@ -156,7 +156,7 @@ int copy_thread(unsigned long clone_flags, unsigned long stack_start,
 
 	memset(&p->thread.cpu_context, 0, sizeof(struct cpu_context));
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		/* kernel thread fn */
 		p->thread.cpu_context.r6 = stack_start;
diff --git a/arch/nios2/kernel/process.c b/arch/nios2/kernel/process.c
index f8ea522a1588..eabf3452e5e2 100644
--- a/arch/nios2/kernel/process.c
+++ b/arch/nios2/kernel/process.c
@@ -109,7 +109,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
 	struct switch_stack *childstack =
 		((struct switch_stack *)childregs) - 1;
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		memset(childstack, 0,
 			sizeof(struct switch_stack) + sizeof(struct pt_regs));
 
diff --git a/arch/openrisc/kernel/process.c b/arch/openrisc/kernel/process.c
index 3c0c91bcdcba..aa110383cfa1 100644
--- a/arch/openrisc/kernel/process.c
+++ b/arch/openrisc/kernel/process.c
@@ -172,7 +172,7 @@ copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
 	sp -= sizeof(struct pt_regs);
 	kregs = (struct pt_regs *)sp;
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		memset(kregs, 0, sizeof(struct pt_regs));
 		kregs->gpr[20] = usp; /* fn, kernel thread */
 		kregs->gpr[22] = arg;
diff --git a/arch/parisc/kernel/process.c b/arch/parisc/kernel/process.c
index ea3d83b6fb62..a76120e30eb4 100644
--- a/arch/parisc/kernel/process.c
+++ b/arch/parisc/kernel/process.c
@@ -197,7 +197,7 @@ copy_thread(unsigned long clone_flags, unsigned long usp,
 	extern void * const ret_from_kernel_thread;
 	extern void * const child_return;
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		/* kernel thread */
 		memset(cregs, 0, sizeof(struct pt_regs));
 		if (!usp) /* idle thread */
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 406d7ee9e322..0b06b7809816 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1700,7 +1700,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 	/* Copy registers */
 	sp -= sizeof(struct pt_regs);
 	childregs = (struct pt_regs *) sp;
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		/* kernel thread */
 		memset(childregs, 0, sizeof(struct pt_regs));
 		childregs->gpr[1] = sp + sizeof(struct pt_regs);
diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
index 03ac3aa611f5..8deeb94eb51e 100644
--- a/arch/riscv/kernel/process.c
+++ b/arch/riscv/kernel/process.c
@@ -125,7 +125,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
 	struct pt_regs *childregs = task_pt_regs(p);
 
 	/* p->thread holds context to be restored by __switch_to() */
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		/* Kernel thread */
 		memset(childregs, 0, sizeof(struct pt_regs));
 		childregs->gp = gp_in_global;
diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
index e8858b2de24b..5ce90a23b1eb 100644
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -130,7 +130,7 @@ int copy_thread(unsigned long clone_flags, unsigned long new_stackp,
 	frame->sf.gprs[9] = (unsigned long)frame;
 
 	/* Store access registers to kernel stack of new process. */
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		/* kernel thread */
 		memset(&frame->childregs, 0, sizeof(struct pt_regs));
 		frame->childregs.psw.mask = PSW_KERNEL_BITS | PSW_MASK_DAT |
diff --git a/arch/sh/kernel/process_32.c b/arch/sh/kernel/process_32.c
index 1c28e3cddb60..0506a739b0a8 100644
--- a/arch/sh/kernel/process_32.c
+++ b/arch/sh/kernel/process_32.c
@@ -114,7 +114,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
 
 	childregs = task_pt_regs(p);
 	p->thread.sp = (unsigned long) childregs;
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		p->thread.pc = (unsigned long) ret_from_kernel_thread;
 		childregs->regs[4] = arg;
diff --git a/arch/sparc/kernel/process_32.c b/arch/sparc/kernel/process_32.c
index 2dc0bf9fe62e..5386e56b5e6c 100644
--- a/arch/sparc/kernel/process_32.c
+++ b/arch/sparc/kernel/process_32.c
@@ -296,7 +296,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
 	ti->ksp = (unsigned long) new_stack;
 	p->thread.kregs = childregs;
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		extern int nwindows;
 		unsigned long psr;
 		memset(new_stack, 0, STACKFRAME_SZ + TRACEREG_SZ);
diff --git a/arch/sparc/kernel/process_64.c b/arch/sparc/kernel/process_64.c
index f5b2cac8669f..6058b3966f71 100644
--- a/arch/sparc/kernel/process_64.c
+++ b/arch/sparc/kernel/process_64.c
@@ -594,7 +594,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
 				       sizeof(struct sparc_stackf));
 	t->fpsaved[0] = 0;
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		memset(child_trap_frame, 0, child_stack_sz);
 		__thread_flag_byte_ptr(t)[TI_FLAG_BYTE_CWP] = 
 			(current_pt_regs()->tstate + 1) & TSTATE_CWP;
diff --git a/arch/um/kernel/process.c b/arch/um/kernel/process.c
index 82107373ac7e..2cb57aefacbe 100644
--- a/arch/um/kernel/process.c
+++ b/arch/um/kernel/process.c
@@ -157,7 +157,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp,
 		unsigned long arg, struct task_struct * p, unsigned long tls)
 {
 	void (*handler)(void);
-	int kthread = current->flags & (PF_KTHREAD | PF_IO_WORKER);
+	int kthread = current->flags & (PF_KTHREAD | PF_USER_WORKER);
 	int ret = 0;
 
 	p->thread = (struct thread_struct) INIT_THREAD;
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 8ea306b1bf8e..f3fc1b1db999 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -485,10 +485,10 @@ int fpu_clone(struct task_struct *dst, unsigned long clone_flags)
 	set_tsk_thread_flag(dst, TIF_NEED_FPU_LOAD);
 
 	/*
-	 * No FPU state inheritance for kernel threads and IO
+	 * No FPU state inheritance for kernel threads and user
 	 * worker threads.
 	 */
-	if (dst->flags & (PF_KTHREAD | PF_IO_WORKER)) {
+	if (dst->flags & (PF_KTHREAD | PF_USER_WORKER)) {
 		/* Clear out the minimal state */
 		memcpy(&dst_fpu->fpstate->regs, &init_fpstate.regs,
 		       init_fpstate_copy_size());
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 04143a653a8a..7c451f234c26 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -195,7 +195,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
 	task_user_gs(p) = get_user_gs(current_pt_regs());
 #endif
 
-	if (unlikely(p->flags & PF_IO_WORKER)) {
+	if (unlikely(p->flags & PF_USER_WORKER)) {
 		/*
 		 * An IO thread is a user space thread, but it doesn't
 		 * return to ret_after_fork().
diff --git a/arch/xtensa/kernel/process.c b/arch/xtensa/kernel/process.c
index bd80df890b1e..00d81668ead4 100644
--- a/arch/xtensa/kernel/process.c
+++ b/arch/xtensa/kernel/process.c
@@ -224,7 +224,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp_thread_fn,
 #error Unsupported Xtensa ABI
 #endif
 
-	if (!(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (!(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		struct pt_regs *regs = current_pt_regs();
 		unsigned long usp = usp_thread_fn ?
 			usp_thread_fn : regs->areg[1];
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 78c351e35fec..5ca45456a77a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1683,6 +1683,7 @@ extern struct pid *cad_pid;
 #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
 #define PF_SWAPWRITE		0x00800000	/* Allowed to write to swap */
+#define PF_USER_WORKER		0x01000000	/* Kernel thread cloned from userspace thread */
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
 #define PF_MCE_EARLY		0x08000000      /* Early kill for mce process policy */
 #define PF_MEMALLOC_PIN		0x10000000	/* Allocation context constrained to zones which allow long term pinning. */
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 65d5ec3a8a04..b7861cb6a74b 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -18,7 +18,8 @@ struct css_set;
 /* All the bits taken by the old clone syscall. */
 #define CLONE_LEGACY_FLAGS 0xffffffffULL
 
-#define USER_WORKER_IO		BIT(0)
+#define USER_WORKER		BIT(0)
+#define USER_WORKER_IO		BIT(1)
 
 struct kernel_clone_args {
 	u64 flags;
diff --git a/kernel/fork.c b/kernel/fork.c
index 6effd839d566..0d278ac26f2f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2032,6 +2032,9 @@ static __latent_entropy struct task_struct *copy_process(
 		siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
 	}
 
+	if (args->worker_flags & USER_WORKER)
+		p->flags |= PF_USER_WORKER;
+
 	/*
 	 * This _must_ happen before we call free_task(), i.e. before we jump
 	 * to any of the bad_fork_* labels. This is to avoid freeing
@@ -2524,7 +2527,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
 		.exit_signal	= (lower_32_bits(flags) & CSIGNAL),
 		.stack		= (unsigned long)fn,
 		.stack_size	= (unsigned long)arg,
-		.worker_flags	= USER_WORKER_IO,
+		.worker_flags	= USER_WORKER | USER_WORKER_IO,
 	};
 
 	return copy_process(NULL, 0, node, &args);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fb9584641ac7..f504ff88a09d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1028,12 +1028,12 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason)
 	DEFINE_WAIT(wait);
 
 	/*
-	 * Do not throttle IO workers, kthreads other than kswapd or
+	 * Do not throttle user workers, kthreads other than kswapd or
 	 * workqueues. They may be required for reclaim to make
 	 * forward progress (e.g. journalling workqueues or kthreads).
 	 */
 	if (!current_is_kswapd() &&
-	    current->flags & (PF_IO_WORKER|PF_KTHREAD))
+	    current->flags & (PF_USER_WORKER|PF_KTHREAD))
 		return;
 
 	/*
-- 
2.25.1

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH V6 02/10] fork/vm: Move common PF_IO_WORKER behavior to new flag
@ 2021-11-29 19:46   ` Mike Christie
  0 siblings, 0 replies; 57+ messages in thread
From: Mike Christie @ 2021-11-29 19:46 UTC (permalink / raw)
  To: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel
  Cc: Mike Christie

This adds a new flag, PF_USER_WORKER, that's used for behavior common to
to both PF_IO_WORKER and users like vhost which will use the new
user_worker helpers that will use the flag and are added later in this
patchset.

The common behavior PF_USER_WORKER covers is the initial frame and fpu
setup and the vm reclaim handling.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
[For m68k changes]
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
---
 arch/alpha/kernel/process.c      | 2 +-
 arch/arc/kernel/process.c        | 2 +-
 arch/arm/kernel/process.c        | 2 +-
 arch/arm64/kernel/process.c      | 2 +-
 arch/csky/kernel/process.c       | 2 +-
 arch/h8300/kernel/process.c      | 2 +-
 arch/hexagon/kernel/process.c    | 2 +-
 arch/ia64/kernel/process.c       | 2 +-
 arch/m68k/kernel/process.c       | 2 +-
 arch/microblaze/kernel/process.c | 2 +-
 arch/mips/kernel/process.c       | 2 +-
 arch/nds32/kernel/process.c      | 2 +-
 arch/nios2/kernel/process.c      | 2 +-
 arch/openrisc/kernel/process.c   | 2 +-
 arch/parisc/kernel/process.c     | 2 +-
 arch/powerpc/kernel/process.c    | 2 +-
 arch/riscv/kernel/process.c      | 2 +-
 arch/s390/kernel/process.c       | 2 +-
 arch/sh/kernel/process_32.c      | 2 +-
 arch/sparc/kernel/process_32.c   | 2 +-
 arch/sparc/kernel/process_64.c   | 2 +-
 arch/um/kernel/process.c         | 2 +-
 arch/x86/kernel/fpu/core.c       | 4 ++--
 arch/x86/kernel/process.c        | 2 +-
 arch/xtensa/kernel/process.c     | 2 +-
 include/linux/sched.h            | 1 +
 include/linux/sched/task.h       | 3 ++-
 kernel/fork.c                    | 5 ++++-
 mm/vmscan.c                      | 4 ++--
 29 files changed, 35 insertions(+), 30 deletions(-)

diff --git a/arch/alpha/kernel/process.c b/arch/alpha/kernel/process.c
index 5f8527081da9..f4759e4ee4a9 100644
--- a/arch/alpha/kernel/process.c
+++ b/arch/alpha/kernel/process.c
@@ -249,7 +249,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 	childti->pcb.ksp = (unsigned long) childstack;
 	childti->pcb.flags = 1;	/* set FEN, clear everything else */
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		/* kernel thread */
 		memset(childstack, 0,
 			sizeof(struct switch_stack) + sizeof(struct pt_regs));
diff --git a/arch/arc/kernel/process.c b/arch/arc/kernel/process.c
index 8e90052f6f05..b409ecb1407f 100644
--- a/arch/arc/kernel/process.c
+++ b/arch/arc/kernel/process.c
@@ -191,7 +191,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 	childksp[0] = 0;			/* fp */
 	childksp[1] = (unsigned long)ret_from_fork; /* blink */
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		memset(c_regs, 0, sizeof(struct pt_regs));
 
 		c_callee->r13 = kthread_arg;
diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
index d47159f3791c..44603062d661 100644
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -251,7 +251,7 @@ int copy_thread(unsigned long clone_flags, unsigned long stack_start,
 	thread->cpu_domain = get_domain();
 #endif
 
-	if (likely(!(p->flags & (PF_KTHREAD | PF_IO_WORKER)))) {
+	if (likely(!(p->flags & (PF_KTHREAD | PF_USER_WORKER)))) {
 		*childregs = *current_pt_regs();
 		childregs->ARM_r0 = 0;
 		if (stack_start)
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index aacf2f5559a8..b4c6c41847be 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -333,7 +333,7 @@ int copy_thread(unsigned long clone_flags, unsigned long stack_start,
 
 	ptrauth_thread_init_kernel(p);
 
-	if (likely(!(p->flags & (PF_KTHREAD | PF_IO_WORKER)))) {
+	if (likely(!(p->flags & (PF_KTHREAD | PF_USER_WORKER)))) {
 		*childregs = *current_pt_regs();
 		childregs->regs[0] = 0;
 
diff --git a/arch/csky/kernel/process.c b/arch/csky/kernel/process.c
index 3d0ca22cd0e2..509f2bfe4ace 100644
--- a/arch/csky/kernel/process.c
+++ b/arch/csky/kernel/process.c
@@ -49,7 +49,7 @@ int copy_thread(unsigned long clone_flags,
 	/* setup thread.sp for switch_to !!! */
 	p->thread.sp = (unsigned long)childstack;
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		childstack->r15 = (unsigned long) ret_from_kernel_thread;
 		childstack->r10 = kthread_arg;
diff --git a/arch/h8300/kernel/process.c b/arch/h8300/kernel/process.c
index 8833fa4f5d51..050aca44ba6d 100644
--- a/arch/h8300/kernel/process.c
+++ b/arch/h8300/kernel/process.c
@@ -112,7 +112,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 
 	childregs = (struct pt_regs *) (THREAD_SIZE + task_stack_page(p)) - 1;
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		childregs->retpc = (unsigned long) ret_from_kernel_thread;
 		childregs->er4 = topstk; /* arg */
diff --git a/arch/hexagon/kernel/process.c b/arch/hexagon/kernel/process.c
index 232dfd8956aa..40f8294c6c7c 100644
--- a/arch/hexagon/kernel/process.c
+++ b/arch/hexagon/kernel/process.c
@@ -73,7 +73,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
 						    sizeof(*ss));
 	ss->lr = (unsigned long)ret_from_fork;
 	p->thread.switch_sp = ss;
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		/* r24 <- fn, r25 <- arg */
 		ss->r24 = usp;
diff --git a/arch/ia64/kernel/process.c b/arch/ia64/kernel/process.c
index 834df24a88f1..29015ebdcf1d 100644
--- a/arch/ia64/kernel/process.c
+++ b/arch/ia64/kernel/process.c
@@ -338,7 +338,7 @@ copy_thread(unsigned long clone_flags, unsigned long user_stack_base,
 
 	ia64_drop_fpu(p);	/* don't pick up stale state from a CPU's fph */
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		if (unlikely(!user_stack_base)) {
 			/* fork_idle() called us */
 			return 0;
diff --git a/arch/m68k/kernel/process.c b/arch/m68k/kernel/process.c
index a6030dbaa089..cbb693012b5e 100644
--- a/arch/m68k/kernel/process.c
+++ b/arch/m68k/kernel/process.c
@@ -157,7 +157,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
 	 */
 	p->thread.fc = USER_DATA;
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		/* kernel thread */
 		memset(frame, 0, sizeof(struct fork_frame));
 		frame->regs.sr = PS_S;
diff --git a/arch/microblaze/kernel/process.c b/arch/microblaze/kernel/process.c
index 5e2b91c1e8ce..de1da9900f7e 100644
--- a/arch/microblaze/kernel/process.c
+++ b/arch/microblaze/kernel/process.c
@@ -59,7 +59,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
 	struct pt_regs *childregs = task_pt_regs(p);
 	struct thread_info *ti = task_thread_info(p);
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		/* if we're creating a new kernel thread then just zeroing all
 		 * the registers. That's OK for a brand new thread.*/
 		memset(childregs, 0, sizeof(struct pt_regs));
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index cbff1b974f88..6f1ed337cd41 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -120,7 +120,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 	/*  Put the stack after the struct pt_regs.  */
 	childksp = (unsigned long) childregs;
 	p->thread.cp0_status = (read_c0_status() & ~(ST0_CU2|ST0_CU1)) | ST0_KERNEL_CUMASK;
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		/* kernel thread */
 		unsigned long status = p->thread.cp0_status;
 		memset(childregs, 0, sizeof(struct pt_regs));
diff --git a/arch/nds32/kernel/process.c b/arch/nds32/kernel/process.c
index 49fab9e39cbf..dba91dd1e289 100644
--- a/arch/nds32/kernel/process.c
+++ b/arch/nds32/kernel/process.c
@@ -156,7 +156,7 @@ int copy_thread(unsigned long clone_flags, unsigned long stack_start,
 
 	memset(&p->thread.cpu_context, 0, sizeof(struct cpu_context));
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		/* kernel thread fn */
 		p->thread.cpu_context.r6 = stack_start;
diff --git a/arch/nios2/kernel/process.c b/arch/nios2/kernel/process.c
index f8ea522a1588..eabf3452e5e2 100644
--- a/arch/nios2/kernel/process.c
+++ b/arch/nios2/kernel/process.c
@@ -109,7 +109,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
 	struct switch_stack *childstack =
 		((struct switch_stack *)childregs) - 1;
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		memset(childstack, 0,
 			sizeof(struct switch_stack) + sizeof(struct pt_regs));
 
diff --git a/arch/openrisc/kernel/process.c b/arch/openrisc/kernel/process.c
index 3c0c91bcdcba..aa110383cfa1 100644
--- a/arch/openrisc/kernel/process.c
+++ b/arch/openrisc/kernel/process.c
@@ -172,7 +172,7 @@ copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
 	sp -= sizeof(struct pt_regs);
 	kregs = (struct pt_regs *)sp;
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		memset(kregs, 0, sizeof(struct pt_regs));
 		kregs->gpr[20] = usp; /* fn, kernel thread */
 		kregs->gpr[22] = arg;
diff --git a/arch/parisc/kernel/process.c b/arch/parisc/kernel/process.c
index ea3d83b6fb62..a76120e30eb4 100644
--- a/arch/parisc/kernel/process.c
+++ b/arch/parisc/kernel/process.c
@@ -197,7 +197,7 @@ copy_thread(unsigned long clone_flags, unsigned long usp,
 	extern void * const ret_from_kernel_thread;
 	extern void * const child_return;
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		/* kernel thread */
 		memset(cregs, 0, sizeof(struct pt_regs));
 		if (!usp) /* idle thread */
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 406d7ee9e322..0b06b7809816 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1700,7 +1700,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 	/* Copy registers */
 	sp -= sizeof(struct pt_regs);
 	childregs = (struct pt_regs *) sp;
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		/* kernel thread */
 		memset(childregs, 0, sizeof(struct pt_regs));
 		childregs->gpr[1] = sp + sizeof(struct pt_regs);
diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
index 03ac3aa611f5..8deeb94eb51e 100644
--- a/arch/riscv/kernel/process.c
+++ b/arch/riscv/kernel/process.c
@@ -125,7 +125,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
 	struct pt_regs *childregs = task_pt_regs(p);
 
 	/* p->thread holds context to be restored by __switch_to() */
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		/* Kernel thread */
 		memset(childregs, 0, sizeof(struct pt_regs));
 		childregs->gp = gp_in_global;
diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
index e8858b2de24b..5ce90a23b1eb 100644
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -130,7 +130,7 @@ int copy_thread(unsigned long clone_flags, unsigned long new_stackp,
 	frame->sf.gprs[9] = (unsigned long)frame;
 
 	/* Store access registers to kernel stack of new process. */
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		/* kernel thread */
 		memset(&frame->childregs, 0, sizeof(struct pt_regs));
 		frame->childregs.psw.mask = PSW_KERNEL_BITS | PSW_MASK_DAT |
diff --git a/arch/sh/kernel/process_32.c b/arch/sh/kernel/process_32.c
index 1c28e3cddb60..0506a739b0a8 100644
--- a/arch/sh/kernel/process_32.c
+++ b/arch/sh/kernel/process_32.c
@@ -114,7 +114,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
 
 	childregs = task_pt_regs(p);
 	p->thread.sp = (unsigned long) childregs;
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		p->thread.pc = (unsigned long) ret_from_kernel_thread;
 		childregs->regs[4] = arg;
diff --git a/arch/sparc/kernel/process_32.c b/arch/sparc/kernel/process_32.c
index 2dc0bf9fe62e..5386e56b5e6c 100644
--- a/arch/sparc/kernel/process_32.c
+++ b/arch/sparc/kernel/process_32.c
@@ -296,7 +296,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
 	ti->ksp = (unsigned long) new_stack;
 	p->thread.kregs = childregs;
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		extern int nwindows;
 		unsigned long psr;
 		memset(new_stack, 0, STACKFRAME_SZ + TRACEREG_SZ);
diff --git a/arch/sparc/kernel/process_64.c b/arch/sparc/kernel/process_64.c
index f5b2cac8669f..6058b3966f71 100644
--- a/arch/sparc/kernel/process_64.c
+++ b/arch/sparc/kernel/process_64.c
@@ -594,7 +594,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
 				       sizeof(struct sparc_stackf));
 	t->fpsaved[0] = 0;
 
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		memset(child_trap_frame, 0, child_stack_sz);
 		__thread_flag_byte_ptr(t)[TI_FLAG_BYTE_CWP] = 
 			(current_pt_regs()->tstate + 1) & TSTATE_CWP;
diff --git a/arch/um/kernel/process.c b/arch/um/kernel/process.c
index 82107373ac7e..2cb57aefacbe 100644
--- a/arch/um/kernel/process.c
+++ b/arch/um/kernel/process.c
@@ -157,7 +157,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp,
 		unsigned long arg, struct task_struct * p, unsigned long tls)
 {
 	void (*handler)(void);
-	int kthread = current->flags & (PF_KTHREAD | PF_IO_WORKER);
+	int kthread = current->flags & (PF_KTHREAD | PF_USER_WORKER);
 	int ret = 0;
 
 	p->thread = (struct thread_struct) INIT_THREAD;
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 8ea306b1bf8e..f3fc1b1db999 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -485,10 +485,10 @@ int fpu_clone(struct task_struct *dst, unsigned long clone_flags)
 	set_tsk_thread_flag(dst, TIF_NEED_FPU_LOAD);
 
 	/*
-	 * No FPU state inheritance for kernel threads and IO
+	 * No FPU state inheritance for kernel threads and user
 	 * worker threads.
 	 */
-	if (dst->flags & (PF_KTHREAD | PF_IO_WORKER)) {
+	if (dst->flags & (PF_KTHREAD | PF_USER_WORKER)) {
 		/* Clear out the minimal state */
 		memcpy(&dst_fpu->fpstate->regs, &init_fpstate.regs,
 		       init_fpstate_copy_size());
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 04143a653a8a..7c451f234c26 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -195,7 +195,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
 	task_user_gs(p) = get_user_gs(current_pt_regs());
 #endif
 
-	if (unlikely(p->flags & PF_IO_WORKER)) {
+	if (unlikely(p->flags & PF_USER_WORKER)) {
 		/*
 		 * An IO thread is a user space thread, but it doesn't
 		 * return to ret_after_fork().
diff --git a/arch/xtensa/kernel/process.c b/arch/xtensa/kernel/process.c
index bd80df890b1e..00d81668ead4 100644
--- a/arch/xtensa/kernel/process.c
+++ b/arch/xtensa/kernel/process.c
@@ -224,7 +224,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp_thread_fn,
 #error Unsupported Xtensa ABI
 #endif
 
-	if (!(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (!(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		struct pt_regs *regs = current_pt_regs();
 		unsigned long usp = usp_thread_fn ?
 			usp_thread_fn : regs->areg[1];
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 78c351e35fec..5ca45456a77a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1683,6 +1683,7 @@ extern struct pid *cad_pid;
 #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
 #define PF_SWAPWRITE		0x00800000	/* Allowed to write to swap */
+#define PF_USER_WORKER		0x01000000	/* Kernel thread cloned from userspace thread */
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
 #define PF_MCE_EARLY		0x08000000      /* Early kill for mce process policy */
 #define PF_MEMALLOC_PIN		0x10000000	/* Allocation context constrained to zones which allow long term pinning. */
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 65d5ec3a8a04..b7861cb6a74b 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -18,7 +18,8 @@ struct css_set;
 /* All the bits taken by the old clone syscall. */
 #define CLONE_LEGACY_FLAGS 0xffffffffULL
 
-#define USER_WORKER_IO		BIT(0)
+#define USER_WORKER		BIT(0)
+#define USER_WORKER_IO		BIT(1)
 
 struct kernel_clone_args {
 	u64 flags;
diff --git a/kernel/fork.c b/kernel/fork.c
index 6effd839d566..0d278ac26f2f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2032,6 +2032,9 @@ static __latent_entropy struct task_struct *copy_process(
 		siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
 	}
 
+	if (args->worker_flags & USER_WORKER)
+		p->flags |= PF_USER_WORKER;
+
 	/*
 	 * This _must_ happen before we call free_task(), i.e. before we jump
 	 * to any of the bad_fork_* labels. This is to avoid freeing
@@ -2524,7 +2527,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
 		.exit_signal	= (lower_32_bits(flags) & CSIGNAL),
 		.stack		= (unsigned long)fn,
 		.stack_size	= (unsigned long)arg,
-		.worker_flags	= USER_WORKER_IO,
+		.worker_flags	= USER_WORKER | USER_WORKER_IO,
 	};
 
 	return copy_process(NULL, 0, node, &args);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fb9584641ac7..f504ff88a09d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1028,12 +1028,12 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason)
 	DEFINE_WAIT(wait);
 
 	/*
-	 * Do not throttle IO workers, kthreads other than kswapd or
+	 * Do not throttle user workers, kthreads other than kswapd or
 	 * workqueues. They may be required for reclaim to make
 	 * forward progress (e.g. journalling workqueues or kthreads).
 	 */
 	if (!current_is_kswapd() &&
-	    current->flags & (PF_IO_WORKER|PF_KTHREAD))
+	    current->flags & (PF_USER_WORKER|PF_KTHREAD))
 		return;
 
 	/*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH V6 03/10] fork: add USER_WORKER flag to not dup/clone files
  2021-11-29 19:46 ` Mike Christie
@ 2021-11-29 19:47   ` Mike Christie
  -1 siblings, 0 replies; 57+ messages in thread
From: Mike Christie @ 2021-11-29 19:47 UTC (permalink / raw)
  To: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel
  Cc: Christoph Hellwig

Each vhost device gets a thread that is used to perform IO and management
operations. Instead of a thread that is accessing a device, the thread is
part of the device, so when it calls the user_worker() function added in
the next patch we can't dup or clone the parent's files/FDS because it
would do an extra increment on ourself.

Later, when we do:

Qemu process exits:
        do_exit -> exit_files -> put_files_struct -> close_files

we would leak the device's resources because of that extra refcount
on the fd or file_struct.

This patch adds a no_files option so these worker threads can prevent
taking an extra refcount on themselves.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 include/linux/sched/task.h |  1 +
 kernel/fork.c              | 11 +++++++++--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index b7861cb6a74b..e094502b2eea 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -20,6 +20,7 @@ struct css_set;
 
 #define USER_WORKER		BIT(0)
 #define USER_WORKER_IO		BIT(1)
+#define USER_WORKER_NO_FILES	BIT(2)
 
 struct kernel_clone_args {
 	u64 flags;
diff --git a/kernel/fork.c b/kernel/fork.c
index 0d278ac26f2f..a47bec00f37c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1529,7 +1529,8 @@ static int copy_fs(unsigned long clone_flags, struct task_struct *tsk)
 	return 0;
 }
 
-static int copy_files(unsigned long clone_flags, struct task_struct *tsk)
+static int copy_files(unsigned long clone_flags, struct task_struct *tsk,
+		      int no_files)
 {
 	struct files_struct *oldf, *newf;
 	int error = 0;
@@ -1541,6 +1542,11 @@ static int copy_files(unsigned long clone_flags, struct task_struct *tsk)
 	if (!oldf)
 		goto out;
 
+	if (no_files) {
+		tsk->files = NULL;
+		goto out;
+	}
+
 	if (clone_flags & CLONE_FILES) {
 		atomic_inc(&oldf->count);
 		goto out;
@@ -2179,7 +2185,8 @@ static __latent_entropy struct task_struct *copy_process(
 	retval = copy_semundo(clone_flags, p);
 	if (retval)
 		goto bad_fork_cleanup_security;
-	retval = copy_files(clone_flags, p);
+	retval = copy_files(clone_flags, p,
+			    args->worker_flags & USER_WORKER_NO_FILES);
 	if (retval)
 		goto bad_fork_cleanup_semundo;
 	retval = copy_fs(clone_flags, p);
-- 
2.25.1

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH V6 03/10] fork: add USER_WORKER flag to not dup/clone files
@ 2021-11-29 19:47   ` Mike Christie
  0 siblings, 0 replies; 57+ messages in thread
From: Mike Christie @ 2021-11-29 19:47 UTC (permalink / raw)
  To: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel
  Cc: Mike Christie, Christoph Hellwig

Each vhost device gets a thread that is used to perform IO and management
operations. Instead of a thread that is accessing a device, the thread is
part of the device, so when it calls the user_worker() function added in
the next patch we can't dup or clone the parent's files/FDS because it
would do an extra increment on ourself.

Later, when we do:

Qemu process exits:
        do_exit -> exit_files -> put_files_struct -> close_files

we would leak the device's resources because of that extra refcount
on the fd or file_struct.

This patch adds a no_files option so these worker threads can prevent
taking an extra refcount on themselves.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 include/linux/sched/task.h |  1 +
 kernel/fork.c              | 11 +++++++++--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index b7861cb6a74b..e094502b2eea 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -20,6 +20,7 @@ struct css_set;
 
 #define USER_WORKER		BIT(0)
 #define USER_WORKER_IO		BIT(1)
+#define USER_WORKER_NO_FILES	BIT(2)
 
 struct kernel_clone_args {
 	u64 flags;
diff --git a/kernel/fork.c b/kernel/fork.c
index 0d278ac26f2f..a47bec00f37c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1529,7 +1529,8 @@ static int copy_fs(unsigned long clone_flags, struct task_struct *tsk)
 	return 0;
 }
 
-static int copy_files(unsigned long clone_flags, struct task_struct *tsk)
+static int copy_files(unsigned long clone_flags, struct task_struct *tsk,
+		      int no_files)
 {
 	struct files_struct *oldf, *newf;
 	int error = 0;
@@ -1541,6 +1542,11 @@ static int copy_files(unsigned long clone_flags, struct task_struct *tsk)
 	if (!oldf)
 		goto out;
 
+	if (no_files) {
+		tsk->files = NULL;
+		goto out;
+	}
+
 	if (clone_flags & CLONE_FILES) {
 		atomic_inc(&oldf->count);
 		goto out;
@@ -2179,7 +2185,8 @@ static __latent_entropy struct task_struct *copy_process(
 	retval = copy_semundo(clone_flags, p);
 	if (retval)
 		goto bad_fork_cleanup_security;
-	retval = copy_files(clone_flags, p);
+	retval = copy_files(clone_flags, p,
+			    args->worker_flags & USER_WORKER_NO_FILES);
 	if (retval)
 		goto bad_fork_cleanup_semundo;
 	retval = copy_fs(clone_flags, p);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH V6 04/10] fork: Add USER_WORKER flag to ignore signals
  2021-11-29 19:46 ` Mike Christie
@ 2021-11-29 19:47   ` Mike Christie
  -1 siblings, 0 replies; 57+ messages in thread
From: Mike Christie @ 2021-11-29 19:47 UTC (permalink / raw)
  To: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel
  Cc: Christoph Hellwig

From: Christian Brauner <christian.brauner@ubuntu.com>

Since this is mirroring kthread's sig ignore api introduced in

commit 10ab825bdef8 ("change kernel threads to ignore signals instead of
blocking them")

this patch adds an option flag, USER_WORKER_SIG_IGN, handled in
copy_process() after copy_sighand() and copy_signals() to ease the
transition from kthreads to this new api.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 include/linux/sched/task.h | 1 +
 kernel/fork.c              | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index e094502b2eea..f8a658700075 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -21,6 +21,7 @@ struct css_set;
 #define USER_WORKER		BIT(0)
 #define USER_WORKER_IO		BIT(1)
 #define USER_WORKER_NO_FILES	BIT(2)
+#define USER_WORKER_SIG_IGN	BIT(3)
 
 struct kernel_clone_args {
 	u64 flags;
diff --git a/kernel/fork.c b/kernel/fork.c
index a47bec00f37c..c9152596a285 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2211,6 +2211,9 @@ static __latent_entropy struct task_struct *copy_process(
 	if (retval)
 		goto bad_fork_cleanup_io;
 
+	if (args->worker_flags & USER_WORKER_SIG_IGN)
+		ignore_signals(p);
+
 	stackleak_task_init(p);
 
 	if (pid != &init_struct_pid) {
-- 
2.25.1

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH V6 04/10] fork: Add USER_WORKER flag to ignore signals
@ 2021-11-29 19:47   ` Mike Christie
  0 siblings, 0 replies; 57+ messages in thread
From: Mike Christie @ 2021-11-29 19:47 UTC (permalink / raw)
  To: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel
  Cc: Mike Christie, Christoph Hellwig

From: Christian Brauner <christian.brauner@ubuntu.com>

Since this is mirroring kthread's sig ignore api introduced in

commit 10ab825bdef8 ("change kernel threads to ignore signals instead of
blocking them")

this patch adds an option flag, USER_WORKER_SIG_IGN, handled in
copy_process() after copy_sighand() and copy_signals() to ease the
transition from kthreads to this new api.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 include/linux/sched/task.h | 1 +
 kernel/fork.c              | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index e094502b2eea..f8a658700075 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -21,6 +21,7 @@ struct css_set;
 #define USER_WORKER		BIT(0)
 #define USER_WORKER_IO		BIT(1)
 #define USER_WORKER_NO_FILES	BIT(2)
+#define USER_WORKER_SIG_IGN	BIT(3)
 
 struct kernel_clone_args {
 	u64 flags;
diff --git a/kernel/fork.c b/kernel/fork.c
index a47bec00f37c..c9152596a285 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2211,6 +2211,9 @@ static __latent_entropy struct task_struct *copy_process(
 	if (retval)
 		goto bad_fork_cleanup_io;
 
+	if (args->worker_flags & USER_WORKER_SIG_IGN)
+		ignore_signals(p);
+
 	stackleak_task_init(p);
 
 	if (pid != &init_struct_pid) {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH V6 05/10] signal: Perfom autoreap for PF_USER_WORKER
  2021-11-29 19:46 ` Mike Christie
@ 2021-11-29 19:47   ` Mike Christie
  -1 siblings, 0 replies; 57+ messages in thread
From: Mike Christie @ 2021-11-29 19:47 UTC (permalink / raw)
  To: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel

Userspace doesn't know about PF_USER_WORKER threads, so it can't do wait
to clean them up. For cases like where qemu will do dynamic/hot add/remove
of vhost devices, then we need to auto reap the thread like was done for
the kthread case, because qemu does not know what API the kernel/vhost
layer is using.

This has us do autoreaping for these threads similar to when the parent
ignores SIGCHLD and for kthreads.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 kernel/signal.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index a629b11bf3e0..4ce2cc195269 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2071,9 +2071,9 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
 
 	psig = tsk->parent->sighand;
 	spin_lock_irqsave(&psig->siglock, flags);
-	if (!tsk->ptrace && sig == SIGCHLD &&
+	if (!tsk->ptrace && (tsk->flags & PF_USER_WORKER || (sig == SIGCHLD &&
 	    (psig->action[SIGCHLD-1].sa.sa_handler == SIG_IGN ||
-	     (psig->action[SIGCHLD-1].sa.sa_flags & SA_NOCLDWAIT))) {
+	     (psig->action[SIGCHLD-1].sa.sa_flags & SA_NOCLDWAIT))))) {
 		/*
 		 * We are exiting and our parent doesn't care.  POSIX.1
 		 * defines special semantics for setting SIGCHLD to SIG_IGN
-- 
2.25.1

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH V6 05/10] signal: Perfom autoreap for PF_USER_WORKER
@ 2021-11-29 19:47   ` Mike Christie
  0 siblings, 0 replies; 57+ messages in thread
From: Mike Christie @ 2021-11-29 19:47 UTC (permalink / raw)
  To: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel
  Cc: Mike Christie

Userspace doesn't know about PF_USER_WORKER threads, so it can't do wait
to clean them up. For cases like where qemu will do dynamic/hot add/remove
of vhost devices, then we need to auto reap the thread like was done for
the kthread case, because qemu does not know what API the kernel/vhost
layer is using.

This has us do autoreaping for these threads similar to when the parent
ignores SIGCHLD and for kthreads.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 kernel/signal.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index a629b11bf3e0..4ce2cc195269 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2071,9 +2071,9 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
 
 	psig = tsk->parent->sighand;
 	spin_lock_irqsave(&psig->siglock, flags);
-	if (!tsk->ptrace && sig == SIGCHLD &&
+	if (!tsk->ptrace && (tsk->flags & PF_USER_WORKER || (sig == SIGCHLD &&
 	    (psig->action[SIGCHLD-1].sa.sa_handler == SIG_IGN ||
-	     (psig->action[SIGCHLD-1].sa.sa_flags & SA_NOCLDWAIT))) {
+	     (psig->action[SIGCHLD-1].sa.sa_flags & SA_NOCLDWAIT))))) {
 		/*
 		 * We are exiting and our parent doesn't care.  POSIX.1
 		 * defines special semantics for setting SIGCHLD to SIG_IGN
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH V6 06/10] fork: add helpers to clone a process for kernel use
  2021-11-29 19:46 ` Mike Christie
@ 2021-11-29 19:47   ` Mike Christie
  -1 siblings, 0 replies; 57+ messages in thread
From: Mike Christie @ 2021-11-29 19:47 UTC (permalink / raw)
  To: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel
  Cc: Christoph Hellwig

The vhost layer is creating kthreads to execute IO and management
operations. These threads need to share a mm with a userspace thread,
inherit cgroups, and we would like to have the thread accounted for
under the userspace thread's rlimit nproc value so a user can't overwhelm
the system with threads when creating VMs.

We have helpers for cgroups and mm but not for the rlimit nproc and in
the future we will probably want helpers for things like namespaces. For
those two items and to allow future sharing/inheritance, this patch adds
two helpers, user_worker_create and user_worker_start that allow callers
to create threads that copy or inherit the caller's attributes like mm,
cgroups, namespaces, etc, and are accounted for under the callers rlimits
nproc value similar to if the caller did a clone() in userspace. However,
instead of returning to userspace the thread is usable in the kernel for
modules like vhost or layers like io_uring.

[added flag validation code from Christian Brauner's SIG_IGN patch]
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 include/linux/sched/task.h |  5 +++
 kernel/fork.c              | 72 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 77 insertions(+)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index f8a658700075..ecb21c0d95ce 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -95,6 +95,11 @@ struct mm_struct *copy_init_mm(void);
 extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
 extern long kernel_wait4(pid_t, int __user *, int, struct rusage *);
 int kernel_wait(pid_t pid, int *stat);
+struct task_struct *user_worker_create(int (*fn)(void *), void *arg, int node,
+				       unsigned long clone_flags,
+				       u32 worker_flags);
+__printf(2, 3)
+void user_worker_start(struct task_struct *tsk, const char namefmt[], ...);
 
 extern void free_task(struct task_struct *tsk);
 
diff --git a/kernel/fork.c b/kernel/fork.c
index c9152596a285..e72239ae1e08 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2543,6 +2543,78 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
 	return copy_process(NULL, 0, node, &args);
 }
 
+static bool user_worker_flags_valid(struct kernel_clone_args *kargs)
+{
+	/* Verify that no unknown flags are passed along. */
+	if (kargs->worker_flags & ~(USER_WORKER_IO | USER_WORKER |
+				    USER_WORKER_NO_FILES | USER_WORKER_SIG_IGN))
+		return false;
+
+	/*
+	 * If we're ignoring all signals don't allow sharing struct sighand and
+	 * don't bother clearing signal handlers.
+	 */
+	if ((kargs->flags & (CLONE_SIGHAND | CLONE_CLEAR_SIGHAND)) &&
+	    (kargs->worker_flags & USER_WORKER_SIG_IGN))
+		return false;
+
+	return true;
+}
+
+/**
+ * user_worker_create - create a copy of a process to be used by the kernel
+ * @fn: thread stack
+ * @arg: data to be passed to fn
+ * @node: numa node to allocate task from
+ * @clone_flags: CLONE flags
+ * @worker_flags: USER_WORKER flags
+ *
+ * This returns a created task, or an error pointer. The returned task is
+ * inactive, and the caller must fire it up through user_worker_start(). If
+ * this is an PF_IO_WORKER all singals but KILL and STOP are blocked.
+ */
+struct task_struct *user_worker_create(int (*fn)(void *), void *arg, int node,
+				       unsigned long clone_flags,
+				       u32 worker_flags)
+{
+	struct kernel_clone_args args = {
+		.flags		= ((lower_32_bits(clone_flags) | CLONE_VM |
+				   CLONE_UNTRACED) & ~CSIGNAL),
+		.exit_signal	= (lower_32_bits(clone_flags) & CSIGNAL),
+		.stack		= (unsigned long)fn,
+		.stack_size	= (unsigned long)arg,
+		.worker_flags	= USER_WORKER | worker_flags,
+	};
+
+	if (!user_worker_flags_valid(&args))
+		return ERR_PTR(-EINVAL);
+
+	return copy_process(NULL, 0, node, &args);
+}
+EXPORT_SYMBOL_GPL(user_worker_create);
+
+/**
+ * user_worker_start - Start a task created with user_worker_create
+ * @tsk: task to wake up
+ * @namefmt: printf-style format string for the thread name
+ * @arg: arguments for @namefmt
+ */
+void user_worker_start(struct task_struct *tsk, const char namefmt[], ...)
+{
+	char name[TASK_COMM_LEN];
+	va_list args;
+
+	WARN_ON(!(tsk->flags & PF_USER_WORKER));
+
+	va_start(args, namefmt);
+	vsnprintf(name, sizeof(name), namefmt, args);
+	set_task_comm(tsk, name);
+	va_end(args);
+
+	wake_up_new_task(tsk);
+}
+EXPORT_SYMBOL_GPL(user_worker_start);
+
 /*
  *  Ok, this is the main fork-routine.
  *
-- 
2.25.1

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH V6 06/10] fork: add helpers to clone a process for kernel use
@ 2021-11-29 19:47   ` Mike Christie
  0 siblings, 0 replies; 57+ messages in thread
From: Mike Christie @ 2021-11-29 19:47 UTC (permalink / raw)
  To: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel
  Cc: Mike Christie, Christoph Hellwig

The vhost layer is creating kthreads to execute IO and management
operations. These threads need to share a mm with a userspace thread,
inherit cgroups, and we would like to have the thread accounted for
under the userspace thread's rlimit nproc value so a user can't overwhelm
the system with threads when creating VMs.

We have helpers for cgroups and mm but not for the rlimit nproc and in
the future we will probably want helpers for things like namespaces. For
those two items and to allow future sharing/inheritance, this patch adds
two helpers, user_worker_create and user_worker_start that allow callers
to create threads that copy or inherit the caller's attributes like mm,
cgroups, namespaces, etc, and are accounted for under the callers rlimits
nproc value similar to if the caller did a clone() in userspace. However,
instead of returning to userspace the thread is usable in the kernel for
modules like vhost or layers like io_uring.

[added flag validation code from Christian Brauner's SIG_IGN patch]
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 include/linux/sched/task.h |  5 +++
 kernel/fork.c              | 72 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 77 insertions(+)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index f8a658700075..ecb21c0d95ce 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -95,6 +95,11 @@ struct mm_struct *copy_init_mm(void);
 extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
 extern long kernel_wait4(pid_t, int __user *, int, struct rusage *);
 int kernel_wait(pid_t pid, int *stat);
+struct task_struct *user_worker_create(int (*fn)(void *), void *arg, int node,
+				       unsigned long clone_flags,
+				       u32 worker_flags);
+__printf(2, 3)
+void user_worker_start(struct task_struct *tsk, const char namefmt[], ...);
 
 extern void free_task(struct task_struct *tsk);
 
diff --git a/kernel/fork.c b/kernel/fork.c
index c9152596a285..e72239ae1e08 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2543,6 +2543,78 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
 	return copy_process(NULL, 0, node, &args);
 }
 
+static bool user_worker_flags_valid(struct kernel_clone_args *kargs)
+{
+	/* Verify that no unknown flags are passed along. */
+	if (kargs->worker_flags & ~(USER_WORKER_IO | USER_WORKER |
+				    USER_WORKER_NO_FILES | USER_WORKER_SIG_IGN))
+		return false;
+
+	/*
+	 * If we're ignoring all signals don't allow sharing struct sighand and
+	 * don't bother clearing signal handlers.
+	 */
+	if ((kargs->flags & (CLONE_SIGHAND | CLONE_CLEAR_SIGHAND)) &&
+	    (kargs->worker_flags & USER_WORKER_SIG_IGN))
+		return false;
+
+	return true;
+}
+
+/**
+ * user_worker_create - create a copy of a process to be used by the kernel
+ * @fn: thread stack
+ * @arg: data to be passed to fn
+ * @node: numa node to allocate task from
+ * @clone_flags: CLONE flags
+ * @worker_flags: USER_WORKER flags
+ *
+ * This returns a created task, or an error pointer. The returned task is
+ * inactive, and the caller must fire it up through user_worker_start(). If
+ * this is an PF_IO_WORKER all singals but KILL and STOP are blocked.
+ */
+struct task_struct *user_worker_create(int (*fn)(void *), void *arg, int node,
+				       unsigned long clone_flags,
+				       u32 worker_flags)
+{
+	struct kernel_clone_args args = {
+		.flags		= ((lower_32_bits(clone_flags) | CLONE_VM |
+				   CLONE_UNTRACED) & ~CSIGNAL),
+		.exit_signal	= (lower_32_bits(clone_flags) & CSIGNAL),
+		.stack		= (unsigned long)fn,
+		.stack_size	= (unsigned long)arg,
+		.worker_flags	= USER_WORKER | worker_flags,
+	};
+
+	if (!user_worker_flags_valid(&args))
+		return ERR_PTR(-EINVAL);
+
+	return copy_process(NULL, 0, node, &args);
+}
+EXPORT_SYMBOL_GPL(user_worker_create);
+
+/**
+ * user_worker_start - Start a task created with user_worker_create
+ * @tsk: task to wake up
+ * @namefmt: printf-style format string for the thread name
+ * @arg: arguments for @namefmt
+ */
+void user_worker_start(struct task_struct *tsk, const char namefmt[], ...)
+{
+	char name[TASK_COMM_LEN];
+	va_list args;
+
+	WARN_ON(!(tsk->flags & PF_USER_WORKER));
+
+	va_start(args, namefmt);
+	vsnprintf(name, sizeof(name), namefmt, args);
+	set_task_comm(tsk, name);
+	va_end(args);
+
+	wake_up_new_task(tsk);
+}
+EXPORT_SYMBOL_GPL(user_worker_start);
+
 /*
  *  Ok, this is the main fork-routine.
  *
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH V6 07/10] io_uring: switch to user_worker
  2021-11-29 19:46 ` Mike Christie
@ 2021-11-29 19:47   ` Mike Christie
  -1 siblings, 0 replies; 57+ messages in thread
From: Mike Christie @ 2021-11-29 19:47 UTC (permalink / raw)
  To: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel

The user_worker_create/start helpers allow callers to create threads that
copy or inherit the caller's attributes like mm, cgroups, namespaces, etc,
and are accounted for under the callers rlimits nproc value similar to if
the caller did a clone() in userspace. However, instead of returning to
userspace the thread is usable in the kernel for modules like vhost or
layers like io_uring.

The user_worker_create function provides the same functionality as
create_io_thread while also allowing different signal and file handling
for users like vhost, so this patch converts io_uring and io-wq. The next
patch will then remove the create_io_thread code.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 fs/io-wq.c    | 15 ++++++++-------
 fs/io_uring.c | 11 +++++------
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index 88202de519f6..761796c14d3b 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -70,6 +70,9 @@ struct io_worker {
 
 #define IO_WQ_NR_HASH_BUCKETS	(1u << IO_WQ_HASH_ORDER)
 
+#define IO_WQ_CLONE_FLAGS	(CLONE_FS | CLONE_FILES | CLONE_SIGHAND | \
+				 CLONE_THREAD | CLONE_IO)
+
 struct io_wqe_acct {
 	unsigned nr_workers;
 	unsigned max_workers;
@@ -600,13 +603,9 @@ static int io_wqe_worker(void *data)
 	struct io_wqe *wqe = worker->wqe;
 	struct io_wq *wq = wqe->wq;
 	bool last_timeout = false;
-	char buf[TASK_COMM_LEN];
 
 	worker->flags |= (IO_WORKER_F_UP | IO_WORKER_F_RUNNING);
 
-	snprintf(buf, sizeof(buf), "iou-wrk-%d", wq->task->pid);
-	set_task_comm(current, buf);
-
 	audit_alloc_kernel(current);
 
 	while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) {
@@ -704,7 +703,7 @@ static void io_init_new_worker(struct io_wqe *wqe, struct io_worker *worker,
 	list_add_tail_rcu(&worker->all_list, &wqe->all_list);
 	worker->flags |= IO_WORKER_F_FREE;
 	raw_spin_unlock(&wqe->lock);
-	wake_up_new_task(tsk);
+	user_worker_start(tsk, "iou-wrk-%d", wqe->wq->task->pid);
 }
 
 static bool io_wq_work_match_all(struct io_wq_work *work, void *data)
@@ -734,7 +733,8 @@ static void create_worker_cont(struct callback_head *cb)
 	worker = container_of(cb, struct io_worker, create_work);
 	clear_bit_unlock(0, &worker->create_state);
 	wqe = worker->wqe;
-	tsk = create_io_thread(io_wqe_worker, worker, wqe->node);
+	tsk = user_worker_create(io_wqe_worker, worker, wqe->node,
+				 IO_WQ_CLONE_FLAGS, USER_WORKER_IO);
 	if (!IS_ERR(tsk)) {
 		io_init_new_worker(wqe, worker, tsk);
 		io_worker_release(worker);
@@ -801,7 +801,8 @@ static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
 	if (index == IO_WQ_ACCT_BOUND)
 		worker->flags |= IO_WORKER_F_BOUND;
 
-	tsk = create_io_thread(io_wqe_worker, worker, wqe->node);
+	tsk = user_worker_create(io_wqe_worker, worker, wqe->node,
+				 IO_WQ_CLONE_FLAGS, USER_WORKER_IO);
 	if (!IS_ERR(tsk)) {
 		io_init_new_worker(wqe, worker, tsk);
 	} else if (!io_should_retry_thread(PTR_ERR(tsk))) {
diff --git a/fs/io_uring.c b/fs/io_uring.c
index c4f217613f56..d275f01c0828 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -7475,12 +7475,8 @@ static int io_sq_thread(void *data)
 	struct io_sq_data *sqd = data;
 	struct io_ring_ctx *ctx;
 	unsigned long timeout = 0;
-	char buf[TASK_COMM_LEN];
 	DEFINE_WAIT(wait);
 
-	snprintf(buf, sizeof(buf), "iou-sqp-%d", sqd->task_pid);
-	set_task_comm(current, buf);
-
 	if (sqd->sq_cpu != -1)
 		set_cpus_allowed_ptr(current, cpumask_of(sqd->sq_cpu));
 	else
@@ -8709,6 +8705,8 @@ static __cold int io_sq_offload_create(struct io_ring_ctx *ctx,
 		fdput(f);
 	}
 	if (ctx->flags & IORING_SETUP_SQPOLL) {
+		unsigned long flags = CLONE_FS | CLONE_FILES | CLONE_SIGHAND |
+					CLONE_THREAD | CLONE_IO;
 		struct task_struct *tsk;
 		struct io_sq_data *sqd;
 		bool attached;
@@ -8754,7 +8752,8 @@ static __cold int io_sq_offload_create(struct io_ring_ctx *ctx,
 
 		sqd->task_pid = current->pid;
 		sqd->task_tgid = current->tgid;
-		tsk = create_io_thread(io_sq_thread, sqd, NUMA_NO_NODE);
+		tsk = user_worker_create(io_sq_thread, sqd, NUMA_NO_NODE, flags,
+					 USER_WORKER_IO);
 		if (IS_ERR(tsk)) {
 			ret = PTR_ERR(tsk);
 			goto err_sqpoll;
@@ -8762,7 +8761,7 @@ static __cold int io_sq_offload_create(struct io_ring_ctx *ctx,
 
 		sqd->thread = tsk;
 		ret = io_uring_alloc_task_context(tsk, ctx);
-		wake_up_new_task(tsk);
+		user_worker_start(tsk, "iou-sqp-%d", sqd->task_pid);
 		if (ret)
 			goto err;
 	} else if (p->flags & IORING_SETUP_SQ_AFF) {
-- 
2.25.1

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH V6 07/10] io_uring: switch to user_worker
@ 2021-11-29 19:47   ` Mike Christie
  0 siblings, 0 replies; 57+ messages in thread
From: Mike Christie @ 2021-11-29 19:47 UTC (permalink / raw)
  To: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel
  Cc: Mike Christie

The user_worker_create/start helpers allow callers to create threads that
copy or inherit the caller's attributes like mm, cgroups, namespaces, etc,
and are accounted for under the callers rlimits nproc value similar to if
the caller did a clone() in userspace. However, instead of returning to
userspace the thread is usable in the kernel for modules like vhost or
layers like io_uring.

The user_worker_create function provides the same functionality as
create_io_thread while also allowing different signal and file handling
for users like vhost, so this patch converts io_uring and io-wq. The next
patch will then remove the create_io_thread code.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 fs/io-wq.c    | 15 ++++++++-------
 fs/io_uring.c | 11 +++++------
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index 88202de519f6..761796c14d3b 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -70,6 +70,9 @@ struct io_worker {
 
 #define IO_WQ_NR_HASH_BUCKETS	(1u << IO_WQ_HASH_ORDER)
 
+#define IO_WQ_CLONE_FLAGS	(CLONE_FS | CLONE_FILES | CLONE_SIGHAND | \
+				 CLONE_THREAD | CLONE_IO)
+
 struct io_wqe_acct {
 	unsigned nr_workers;
 	unsigned max_workers;
@@ -600,13 +603,9 @@ static int io_wqe_worker(void *data)
 	struct io_wqe *wqe = worker->wqe;
 	struct io_wq *wq = wqe->wq;
 	bool last_timeout = false;
-	char buf[TASK_COMM_LEN];
 
 	worker->flags |= (IO_WORKER_F_UP | IO_WORKER_F_RUNNING);
 
-	snprintf(buf, sizeof(buf), "iou-wrk-%d", wq->task->pid);
-	set_task_comm(current, buf);
-
 	audit_alloc_kernel(current);
 
 	while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) {
@@ -704,7 +703,7 @@ static void io_init_new_worker(struct io_wqe *wqe, struct io_worker *worker,
 	list_add_tail_rcu(&worker->all_list, &wqe->all_list);
 	worker->flags |= IO_WORKER_F_FREE;
 	raw_spin_unlock(&wqe->lock);
-	wake_up_new_task(tsk);
+	user_worker_start(tsk, "iou-wrk-%d", wqe->wq->task->pid);
 }
 
 static bool io_wq_work_match_all(struct io_wq_work *work, void *data)
@@ -734,7 +733,8 @@ static void create_worker_cont(struct callback_head *cb)
 	worker = container_of(cb, struct io_worker, create_work);
 	clear_bit_unlock(0, &worker->create_state);
 	wqe = worker->wqe;
-	tsk = create_io_thread(io_wqe_worker, worker, wqe->node);
+	tsk = user_worker_create(io_wqe_worker, worker, wqe->node,
+				 IO_WQ_CLONE_FLAGS, USER_WORKER_IO);
 	if (!IS_ERR(tsk)) {
 		io_init_new_worker(wqe, worker, tsk);
 		io_worker_release(worker);
@@ -801,7 +801,8 @@ static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
 	if (index == IO_WQ_ACCT_BOUND)
 		worker->flags |= IO_WORKER_F_BOUND;
 
-	tsk = create_io_thread(io_wqe_worker, worker, wqe->node);
+	tsk = user_worker_create(io_wqe_worker, worker, wqe->node,
+				 IO_WQ_CLONE_FLAGS, USER_WORKER_IO);
 	if (!IS_ERR(tsk)) {
 		io_init_new_worker(wqe, worker, tsk);
 	} else if (!io_should_retry_thread(PTR_ERR(tsk))) {
diff --git a/fs/io_uring.c b/fs/io_uring.c
index c4f217613f56..d275f01c0828 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -7475,12 +7475,8 @@ static int io_sq_thread(void *data)
 	struct io_sq_data *sqd = data;
 	struct io_ring_ctx *ctx;
 	unsigned long timeout = 0;
-	char buf[TASK_COMM_LEN];
 	DEFINE_WAIT(wait);
 
-	snprintf(buf, sizeof(buf), "iou-sqp-%d", sqd->task_pid);
-	set_task_comm(current, buf);
-
 	if (sqd->sq_cpu != -1)
 		set_cpus_allowed_ptr(current, cpumask_of(sqd->sq_cpu));
 	else
@@ -8709,6 +8705,8 @@ static __cold int io_sq_offload_create(struct io_ring_ctx *ctx,
 		fdput(f);
 	}
 	if (ctx->flags & IORING_SETUP_SQPOLL) {
+		unsigned long flags = CLONE_FS | CLONE_FILES | CLONE_SIGHAND |
+					CLONE_THREAD | CLONE_IO;
 		struct task_struct *tsk;
 		struct io_sq_data *sqd;
 		bool attached;
@@ -8754,7 +8752,8 @@ static __cold int io_sq_offload_create(struct io_ring_ctx *ctx,
 
 		sqd->task_pid = current->pid;
 		sqd->task_tgid = current->tgid;
-		tsk = create_io_thread(io_sq_thread, sqd, NUMA_NO_NODE);
+		tsk = user_worker_create(io_sq_thread, sqd, NUMA_NO_NODE, flags,
+					 USER_WORKER_IO);
 		if (IS_ERR(tsk)) {
 			ret = PTR_ERR(tsk);
 			goto err_sqpoll;
@@ -8762,7 +8761,7 @@ static __cold int io_sq_offload_create(struct io_ring_ctx *ctx,
 
 		sqd->thread = tsk;
 		ret = io_uring_alloc_task_context(tsk, ctx);
-		wake_up_new_task(tsk);
+		user_worker_start(tsk, "iou-sqp-%d", sqd->task_pid);
 		if (ret)
 			goto err;
 	} else if (p->flags & IORING_SETUP_SQ_AFF) {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH V6 08/10] fork: remove create_io_thread
  2021-11-29 19:46 ` Mike Christie
@ 2021-11-29 19:47   ` Mike Christie
  -1 siblings, 0 replies; 57+ messages in thread
From: Mike Christie @ 2021-11-29 19:47 UTC (permalink / raw)
  To: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel

create_io_thread is not used anymore so remove it.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 include/linux/sched/task.h |  1 -
 kernel/fork.c              | 22 ----------------------
 2 files changed, 23 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index ecb21c0d95ce..cef69ce64e6f 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -89,7 +89,6 @@ extern void exit_files(struct task_struct *);
 extern void exit_itimers(struct signal_struct *);
 
 extern pid_t kernel_clone(struct kernel_clone_args *kargs);
-struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node);
 struct task_struct *fork_idle(int);
 struct mm_struct *copy_init_mm(void);
 extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
diff --git a/kernel/fork.c b/kernel/fork.c
index e72239ae1e08..3d4586502190 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2521,28 +2521,6 @@ struct mm_struct *copy_init_mm(void)
 	return dup_mm(NULL, &init_mm);
 }
 
-/*
- * This is like kernel_clone(), but shaved down and tailored to just
- * creating io_uring workers. It returns a created task, or an error pointer.
- * The returned task is inactive, and the caller must fire it up through
- * wake_up_new_task(p). All signals are blocked in the created task.
- */
-struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
-{
-	unsigned long flags = CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|
-				CLONE_IO;
-	struct kernel_clone_args args = {
-		.flags		= ((lower_32_bits(flags) | CLONE_VM |
-				    CLONE_UNTRACED) & ~CSIGNAL),
-		.exit_signal	= (lower_32_bits(flags) & CSIGNAL),
-		.stack		= (unsigned long)fn,
-		.stack_size	= (unsigned long)arg,
-		.worker_flags	= USER_WORKER | USER_WORKER_IO,
-	};
-
-	return copy_process(NULL, 0, node, &args);
-}
-
 static bool user_worker_flags_valid(struct kernel_clone_args *kargs)
 {
 	/* Verify that no unknown flags are passed along. */
-- 
2.25.1

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH V6 08/10] fork: remove create_io_thread
@ 2021-11-29 19:47   ` Mike Christie
  0 siblings, 0 replies; 57+ messages in thread
From: Mike Christie @ 2021-11-29 19:47 UTC (permalink / raw)
  To: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel
  Cc: Mike Christie

create_io_thread is not used anymore so remove it.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 include/linux/sched/task.h |  1 -
 kernel/fork.c              | 22 ----------------------
 2 files changed, 23 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index ecb21c0d95ce..cef69ce64e6f 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -89,7 +89,6 @@ extern void exit_files(struct task_struct *);
 extern void exit_itimers(struct signal_struct *);
 
 extern pid_t kernel_clone(struct kernel_clone_args *kargs);
-struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node);
 struct task_struct *fork_idle(int);
 struct mm_struct *copy_init_mm(void);
 extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
diff --git a/kernel/fork.c b/kernel/fork.c
index e72239ae1e08..3d4586502190 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2521,28 +2521,6 @@ struct mm_struct *copy_init_mm(void)
 	return dup_mm(NULL, &init_mm);
 }
 
-/*
- * This is like kernel_clone(), but shaved down and tailored to just
- * creating io_uring workers. It returns a created task, or an error pointer.
- * The returned task is inactive, and the caller must fire it up through
- * wake_up_new_task(p). All signals are blocked in the created task.
- */
-struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
-{
-	unsigned long flags = CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|
-				CLONE_IO;
-	struct kernel_clone_args args = {
-		.flags		= ((lower_32_bits(flags) | CLONE_VM |
-				    CLONE_UNTRACED) & ~CSIGNAL),
-		.exit_signal	= (lower_32_bits(flags) & CSIGNAL),
-		.stack		= (unsigned long)fn,
-		.stack_size	= (unsigned long)arg,
-		.worker_flags	= USER_WORKER | USER_WORKER_IO,
-	};
-
-	return copy_process(NULL, 0, node, &args);
-}
-
 static bool user_worker_flags_valid(struct kernel_clone_args *kargs)
 {
 	/* Verify that no unknown flags are passed along. */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH V6 09/10] vhost: move worker thread fields to new struct
  2021-11-29 19:46 ` Mike Christie
@ 2021-11-29 19:47   ` Mike Christie
  -1 siblings, 0 replies; 57+ messages in thread
From: Mike Christie @ 2021-11-29 19:47 UTC (permalink / raw)
  To: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel
  Cc: Christoph Hellwig

This is just a prep patch. It moves the worker related fields to a new
vhost_worker struct and moves the code around to create some helpers that
will be used in the next patch.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/vhost/vhost.c | 98 ++++++++++++++++++++++++++++---------------
 drivers/vhost/vhost.h | 11 +++--
 2 files changed, 72 insertions(+), 37 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 59edb5a1ffe2..c9a1f706989c 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -263,8 +263,8 @@ void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
 		 * sure it was not in the list.
 		 * test_and_set_bit() implies a memory barrier.
 		 */
-		llist_add(&work->node, &dev->work_list);
-		wake_up_process(dev->worker);
+		llist_add(&work->node, &dev->worker->work_list);
+		wake_up_process(dev->worker->task);
 	}
 }
 EXPORT_SYMBOL_GPL(vhost_work_queue);
@@ -272,7 +272,7 @@ EXPORT_SYMBOL_GPL(vhost_work_queue);
 /* A lockless hint for busy polling code to exit the loop */
 bool vhost_has_work(struct vhost_dev *dev)
 {
-	return !llist_empty(&dev->work_list);
+	return dev->worker && !llist_empty(&dev->worker->work_list);
 }
 EXPORT_SYMBOL_GPL(vhost_has_work);
 
@@ -343,7 +343,8 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 
 static int vhost_worker(void *data)
 {
-	struct vhost_dev *dev = data;
+	struct vhost_worker *worker = data;
+	struct vhost_dev *dev = worker->dev;
 	struct vhost_work *work, *work_next;
 	struct llist_node *node;
 
@@ -358,7 +359,7 @@ static int vhost_worker(void *data)
 			break;
 		}
 
-		node = llist_del_all(&dev->work_list);
+		node = llist_del_all(&worker->work_list);
 		if (!node)
 			schedule();
 
@@ -368,7 +369,7 @@ static int vhost_worker(void *data)
 		llist_for_each_entry_safe(work, work_next, node, node) {
 			clear_bit(VHOST_WORK_QUEUED, &work->flags);
 			__set_current_state(TASK_RUNNING);
-			kcov_remote_start_common(dev->kcov_handle);
+			kcov_remote_start_common(worker->kcov_handle);
 			work->fn(work);
 			kcov_remote_stop();
 			if (need_resched())
@@ -487,7 +488,6 @@ void vhost_dev_init(struct vhost_dev *dev,
 	dev->byte_weight = byte_weight;
 	dev->use_worker = use_worker;
 	dev->msg_handler = msg_handler;
-	init_llist_head(&dev->work_list);
 	init_waitqueue_head(&dev->wait);
 	INIT_LIST_HEAD(&dev->read_list);
 	INIT_LIST_HEAD(&dev->pending_list);
@@ -579,10 +579,60 @@ static void vhost_detach_mm(struct vhost_dev *dev)
 	dev->mm = NULL;
 }
 
+static void vhost_worker_free(struct vhost_dev *dev)
+{
+	struct vhost_worker *worker = dev->worker;
+
+	if (!worker)
+		return;
+
+	dev->worker = NULL;
+	WARN_ON(!llist_empty(&worker->work_list));
+	kthread_stop(worker->task);
+	kfree(worker);
+}
+
+static int vhost_worker_create(struct vhost_dev *dev)
+{
+	struct vhost_worker *worker;
+	struct task_struct *task;
+	int ret;
+
+	worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT);
+	if (!worker)
+		return -ENOMEM;
+
+	dev->worker = worker;
+	worker->dev = dev;
+	worker->kcov_handle = kcov_common_handle();
+	init_llist_head(&worker->work_list);
+
+	task = kthread_create(vhost_worker, worker, "vhost-%d", current->pid);
+	if (IS_ERR(task)) {
+		ret = PTR_ERR(task);
+		goto free_worker;
+	}
+
+	worker->task = task;
+	wake_up_process(task); /* avoid contributing to loadavg */
+
+	ret = vhost_attach_cgroups(dev);
+	if (ret)
+		goto stop_worker;
+
+	return 0;
+
+stop_worker:
+	kthread_stop(worker->task);
+free_worker:
+	kfree(worker);
+	dev->worker = NULL;
+	return ret;
+}
+
 /* Caller should have device mutex */
 long vhost_dev_set_owner(struct vhost_dev *dev)
 {
-	struct task_struct *worker;
 	int err;
 
 	/* Is there an owner already? */
@@ -593,36 +643,21 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
 
 	vhost_attach_mm(dev);
 
-	dev->kcov_handle = kcov_common_handle();
 	if (dev->use_worker) {
-		worker = kthread_create(vhost_worker, dev,
-					"vhost-%d", current->pid);
-		if (IS_ERR(worker)) {
-			err = PTR_ERR(worker);
-			goto err_worker;
-		}
-
-		dev->worker = worker;
-		wake_up_process(worker); /* avoid contributing to loadavg */
-
-		err = vhost_attach_cgroups(dev);
+		err = vhost_worker_create(dev);
 		if (err)
-			goto err_cgroup;
+			goto err_worker;
 	}
 
 	err = vhost_dev_alloc_iovecs(dev);
 	if (err)
-		goto err_cgroup;
+		goto err_iovecs;
 
 	return 0;
-err_cgroup:
-	if (dev->worker) {
-		kthread_stop(dev->worker);
-		dev->worker = NULL;
-	}
+err_iovecs:
+	vhost_worker_free(dev);
 err_worker:
 	vhost_detach_mm(dev);
-	dev->kcov_handle = 0;
 err_mm:
 	return err;
 }
@@ -712,12 +747,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
 	dev->iotlb = NULL;
 	vhost_clear_msg(dev);
 	wake_up_interruptible_poll(&dev->wait, EPOLLIN | EPOLLRDNORM);
-	WARN_ON(!llist_empty(&dev->work_list));
-	if (dev->worker) {
-		kthread_stop(dev->worker);
-		dev->worker = NULL;
-		dev->kcov_handle = 0;
-	}
+	vhost_worker_free(dev);
 	vhost_detach_mm(dev);
 }
 EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 638bb640d6b4..102ce25e4e13 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -25,6 +25,13 @@ struct vhost_work {
 	unsigned long		flags;
 };
 
+struct vhost_worker {
+	struct task_struct	*task;
+	struct llist_head	work_list;
+	struct vhost_dev	*dev;
+	u64			kcov_handle;
+};
+
 /* Poll a file (eventfd or socket) */
 /* Note: there's nothing vhost specific about this structure. */
 struct vhost_poll {
@@ -148,8 +155,7 @@ struct vhost_dev {
 	struct vhost_virtqueue **vqs;
 	int nvqs;
 	struct eventfd_ctx *log_ctx;
-	struct llist_head work_list;
-	struct task_struct *worker;
+	struct vhost_worker *worker;
 	struct vhost_iotlb *umem;
 	struct vhost_iotlb *iotlb;
 	spinlock_t iotlb_lock;
@@ -159,7 +165,6 @@ struct vhost_dev {
 	int iov_limit;
 	int weight;
 	int byte_weight;
-	u64 kcov_handle;
 	bool use_worker;
 	int (*msg_handler)(struct vhost_dev *dev,
 			   struct vhost_iotlb_msg *msg);
-- 
2.25.1

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH V6 09/10] vhost: move worker thread fields to new struct
@ 2021-11-29 19:47   ` Mike Christie
  0 siblings, 0 replies; 57+ messages in thread
From: Mike Christie @ 2021-11-29 19:47 UTC (permalink / raw)
  To: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel
  Cc: Mike Christie, Christoph Hellwig

This is just a prep patch. It moves the worker related fields to a new
vhost_worker struct and moves the code around to create some helpers that
will be used in the next patch.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/vhost/vhost.c | 98 ++++++++++++++++++++++++++++---------------
 drivers/vhost/vhost.h | 11 +++--
 2 files changed, 72 insertions(+), 37 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 59edb5a1ffe2..c9a1f706989c 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -263,8 +263,8 @@ void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
 		 * sure it was not in the list.
 		 * test_and_set_bit() implies a memory barrier.
 		 */
-		llist_add(&work->node, &dev->work_list);
-		wake_up_process(dev->worker);
+		llist_add(&work->node, &dev->worker->work_list);
+		wake_up_process(dev->worker->task);
 	}
 }
 EXPORT_SYMBOL_GPL(vhost_work_queue);
@@ -272,7 +272,7 @@ EXPORT_SYMBOL_GPL(vhost_work_queue);
 /* A lockless hint for busy polling code to exit the loop */
 bool vhost_has_work(struct vhost_dev *dev)
 {
-	return !llist_empty(&dev->work_list);
+	return dev->worker && !llist_empty(&dev->worker->work_list);
 }
 EXPORT_SYMBOL_GPL(vhost_has_work);
 
@@ -343,7 +343,8 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 
 static int vhost_worker(void *data)
 {
-	struct vhost_dev *dev = data;
+	struct vhost_worker *worker = data;
+	struct vhost_dev *dev = worker->dev;
 	struct vhost_work *work, *work_next;
 	struct llist_node *node;
 
@@ -358,7 +359,7 @@ static int vhost_worker(void *data)
 			break;
 		}
 
-		node = llist_del_all(&dev->work_list);
+		node = llist_del_all(&worker->work_list);
 		if (!node)
 			schedule();
 
@@ -368,7 +369,7 @@ static int vhost_worker(void *data)
 		llist_for_each_entry_safe(work, work_next, node, node) {
 			clear_bit(VHOST_WORK_QUEUED, &work->flags);
 			__set_current_state(TASK_RUNNING);
-			kcov_remote_start_common(dev->kcov_handle);
+			kcov_remote_start_common(worker->kcov_handle);
 			work->fn(work);
 			kcov_remote_stop();
 			if (need_resched())
@@ -487,7 +488,6 @@ void vhost_dev_init(struct vhost_dev *dev,
 	dev->byte_weight = byte_weight;
 	dev->use_worker = use_worker;
 	dev->msg_handler = msg_handler;
-	init_llist_head(&dev->work_list);
 	init_waitqueue_head(&dev->wait);
 	INIT_LIST_HEAD(&dev->read_list);
 	INIT_LIST_HEAD(&dev->pending_list);
@@ -579,10 +579,60 @@ static void vhost_detach_mm(struct vhost_dev *dev)
 	dev->mm = NULL;
 }
 
+static void vhost_worker_free(struct vhost_dev *dev)
+{
+	struct vhost_worker *worker = dev->worker;
+
+	if (!worker)
+		return;
+
+	dev->worker = NULL;
+	WARN_ON(!llist_empty(&worker->work_list));
+	kthread_stop(worker->task);
+	kfree(worker);
+}
+
+static int vhost_worker_create(struct vhost_dev *dev)
+{
+	struct vhost_worker *worker;
+	struct task_struct *task;
+	int ret;
+
+	worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT);
+	if (!worker)
+		return -ENOMEM;
+
+	dev->worker = worker;
+	worker->dev = dev;
+	worker->kcov_handle = kcov_common_handle();
+	init_llist_head(&worker->work_list);
+
+	task = kthread_create(vhost_worker, worker, "vhost-%d", current->pid);
+	if (IS_ERR(task)) {
+		ret = PTR_ERR(task);
+		goto free_worker;
+	}
+
+	worker->task = task;
+	wake_up_process(task); /* avoid contributing to loadavg */
+
+	ret = vhost_attach_cgroups(dev);
+	if (ret)
+		goto stop_worker;
+
+	return 0;
+
+stop_worker:
+	kthread_stop(worker->task);
+free_worker:
+	kfree(worker);
+	dev->worker = NULL;
+	return ret;
+}
+
 /* Caller should have device mutex */
 long vhost_dev_set_owner(struct vhost_dev *dev)
 {
-	struct task_struct *worker;
 	int err;
 
 	/* Is there an owner already? */
@@ -593,36 +643,21 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
 
 	vhost_attach_mm(dev);
 
-	dev->kcov_handle = kcov_common_handle();
 	if (dev->use_worker) {
-		worker = kthread_create(vhost_worker, dev,
-					"vhost-%d", current->pid);
-		if (IS_ERR(worker)) {
-			err = PTR_ERR(worker);
-			goto err_worker;
-		}
-
-		dev->worker = worker;
-		wake_up_process(worker); /* avoid contributing to loadavg */
-
-		err = vhost_attach_cgroups(dev);
+		err = vhost_worker_create(dev);
 		if (err)
-			goto err_cgroup;
+			goto err_worker;
 	}
 
 	err = vhost_dev_alloc_iovecs(dev);
 	if (err)
-		goto err_cgroup;
+		goto err_iovecs;
 
 	return 0;
-err_cgroup:
-	if (dev->worker) {
-		kthread_stop(dev->worker);
-		dev->worker = NULL;
-	}
+err_iovecs:
+	vhost_worker_free(dev);
 err_worker:
 	vhost_detach_mm(dev);
-	dev->kcov_handle = 0;
 err_mm:
 	return err;
 }
@@ -712,12 +747,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
 	dev->iotlb = NULL;
 	vhost_clear_msg(dev);
 	wake_up_interruptible_poll(&dev->wait, EPOLLIN | EPOLLRDNORM);
-	WARN_ON(!llist_empty(&dev->work_list));
-	if (dev->worker) {
-		kthread_stop(dev->worker);
-		dev->worker = NULL;
-		dev->kcov_handle = 0;
-	}
+	vhost_worker_free(dev);
 	vhost_detach_mm(dev);
 }
 EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 638bb640d6b4..102ce25e4e13 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -25,6 +25,13 @@ struct vhost_work {
 	unsigned long		flags;
 };
 
+struct vhost_worker {
+	struct task_struct	*task;
+	struct llist_head	work_list;
+	struct vhost_dev	*dev;
+	u64			kcov_handle;
+};
+
 /* Poll a file (eventfd or socket) */
 /* Note: there's nothing vhost specific about this structure. */
 struct vhost_poll {
@@ -148,8 +155,7 @@ struct vhost_dev {
 	struct vhost_virtqueue **vqs;
 	int nvqs;
 	struct eventfd_ctx *log_ctx;
-	struct llist_head work_list;
-	struct task_struct *worker;
+	struct vhost_worker *worker;
 	struct vhost_iotlb *umem;
 	struct vhost_iotlb *iotlb;
 	spinlock_t iotlb_lock;
@@ -159,7 +165,6 @@ struct vhost_dev {
 	int iov_limit;
 	int weight;
 	int byte_weight;
-	u64 kcov_handle;
 	bool use_worker;
 	int (*msg_handler)(struct vhost_dev *dev,
 			   struct vhost_iotlb_msg *msg);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH V6 10/10] vhost: use user_worker to check RLIMITs
  2021-11-29 19:46 ` Mike Christie
@ 2021-11-29 19:47   ` Mike Christie
  -1 siblings, 0 replies; 57+ messages in thread
From: Mike Christie @ 2021-11-29 19:47 UTC (permalink / raw)
  To: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel
  Cc: Christoph Hellwig

For vhost workers we use the kthread API which inherit's its values from
and checks against the kthreadd thread. This results in the wrong RLIMITs
being checked. This patch has us use the user_worker helpers which will
inherit its values/checks from the thread that owns the device similar to
if we did a clone in userspace.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/vhost/vhost.c | 65 +++++++++++++++----------------------------
 drivers/vhost/vhost.h |  7 ++++-
 2 files changed, 28 insertions(+), 44 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c9a1f706989c..8cf259d798c0 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -22,7 +22,6 @@
 #include <linux/slab.h>
 #include <linux/vmalloc.h>
 #include <linux/kthread.h>
-#include <linux/cgroup.h>
 #include <linux/module.h>
 #include <linux/sort.h>
 #include <linux/sched/mm.h>
@@ -344,17 +343,14 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 static int vhost_worker(void *data)
 {
 	struct vhost_worker *worker = data;
-	struct vhost_dev *dev = worker->dev;
 	struct vhost_work *work, *work_next;
 	struct llist_node *node;
 
-	kthread_use_mm(dev->mm);
-
 	for (;;) {
 		/* mb paired w/ kthread_stop */
 		set_current_state(TASK_INTERRUPTIBLE);
 
-		if (kthread_should_stop()) {
+		if (test_bit(VHOST_WORKER_FLAG_STOP, &worker->flags)) {
 			__set_current_state(TASK_RUNNING);
 			break;
 		}
@@ -376,8 +372,9 @@ static int vhost_worker(void *data)
 				schedule();
 		}
 	}
-	kthread_unuse_mm(dev->mm);
-	return 0;
+
+	complete(worker->exit_done);
+	do_exit(0);
 }
 
 static void vhost_vq_free_iovecs(struct vhost_virtqueue *vq)
@@ -517,31 +514,6 @@ long vhost_dev_check_owner(struct vhost_dev *dev)
 }
 EXPORT_SYMBOL_GPL(vhost_dev_check_owner);
 
-struct vhost_attach_cgroups_struct {
-	struct vhost_work work;
-	struct task_struct *owner;
-	int ret;
-};
-
-static void vhost_attach_cgroups_work(struct vhost_work *work)
-{
-	struct vhost_attach_cgroups_struct *s;
-
-	s = container_of(work, struct vhost_attach_cgroups_struct, work);
-	s->ret = cgroup_attach_task_all(s->owner, current);
-}
-
-static int vhost_attach_cgroups(struct vhost_dev *dev)
-{
-	struct vhost_attach_cgroups_struct attach;
-
-	attach.owner = current;
-	vhost_work_init(&attach.work, vhost_attach_cgroups_work);
-	vhost_work_queue(dev, &attach.work);
-	vhost_work_dev_flush(dev);
-	return attach.ret;
-}
-
 /* Caller should have device mutex */
 bool vhost_dev_has_owner(struct vhost_dev *dev)
 {
@@ -579,6 +551,16 @@ static void vhost_detach_mm(struct vhost_dev *dev)
 	dev->mm = NULL;
 }
 
+static void vhost_worker_stop(struct vhost_worker *worker)
+{
+	DECLARE_COMPLETION_ONSTACK(exit_done);
+
+	worker->exit_done = &exit_done;
+	set_bit(VHOST_WORKER_FLAG_STOP, &worker->flags);
+	wake_up_process(worker->task);
+	wait_for_completion(worker->exit_done);
+}
+
 static void vhost_worker_free(struct vhost_dev *dev)
 {
 	struct vhost_worker *worker = dev->worker;
@@ -588,7 +570,7 @@ static void vhost_worker_free(struct vhost_dev *dev)
 
 	dev->worker = NULL;
 	WARN_ON(!llist_empty(&worker->work_list));
-	kthread_stop(worker->task);
+	vhost_worker_stop(worker);
 	kfree(worker);
 }
 
@@ -603,27 +585,24 @@ static int vhost_worker_create(struct vhost_dev *dev)
 		return -ENOMEM;
 
 	dev->worker = worker;
-	worker->dev = dev;
 	worker->kcov_handle = kcov_common_handle();
 	init_llist_head(&worker->work_list);
 
-	task = kthread_create(vhost_worker, worker, "vhost-%d", current->pid);
+	/*
+	 * vhost used to use the kthread API which ignores all signals by
+	 * default and the drivers expect this behavior.
+	 */
+	task = user_worker_create(vhost_worker, worker, NUMA_NO_NODE, CLONE_FS,
+				  USER_WORKER_NO_FILES | USER_WORKER_SIG_IGN);
 	if (IS_ERR(task)) {
 		ret = PTR_ERR(task);
 		goto free_worker;
 	}
 
 	worker->task = task;
-	wake_up_process(task); /* avoid contributing to loadavg */
-
-	ret = vhost_attach_cgroups(dev);
-	if (ret)
-		goto stop_worker;
-
+	user_worker_start(task, "vhost-%d", current->pid);
 	return 0;
 
-stop_worker:
-	kthread_stop(worker->task);
 free_worker:
 	kfree(worker);
 	dev->worker = NULL;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 102ce25e4e13..09748694cb66 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -25,11 +25,16 @@ struct vhost_work {
 	unsigned long		flags;
 };
 
+enum {
+	VHOST_WORKER_FLAG_STOP,
+};
+
 struct vhost_worker {
 	struct task_struct	*task;
+	struct completion	*exit_done;
 	struct llist_head	work_list;
-	struct vhost_dev	*dev;
 	u64			kcov_handle;
+	unsigned long		flags;
 };
 
 /* Poll a file (eventfd or socket) */
-- 
2.25.1

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH V6 10/10] vhost: use user_worker to check RLIMITs
@ 2021-11-29 19:47   ` Mike Christie
  0 siblings, 0 replies; 57+ messages in thread
From: Mike Christie @ 2021-11-29 19:47 UTC (permalink / raw)
  To: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel
  Cc: Mike Christie, Christoph Hellwig

For vhost workers we use the kthread API which inherit's its values from
and checks against the kthreadd thread. This results in the wrong RLIMITs
being checked. This patch has us use the user_worker helpers which will
inherit its values/checks from the thread that owns the device similar to
if we did a clone in userspace.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/vhost/vhost.c | 65 +++++++++++++++----------------------------
 drivers/vhost/vhost.h |  7 ++++-
 2 files changed, 28 insertions(+), 44 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c9a1f706989c..8cf259d798c0 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -22,7 +22,6 @@
 #include <linux/slab.h>
 #include <linux/vmalloc.h>
 #include <linux/kthread.h>
-#include <linux/cgroup.h>
 #include <linux/module.h>
 #include <linux/sort.h>
 #include <linux/sched/mm.h>
@@ -344,17 +343,14 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 static int vhost_worker(void *data)
 {
 	struct vhost_worker *worker = data;
-	struct vhost_dev *dev = worker->dev;
 	struct vhost_work *work, *work_next;
 	struct llist_node *node;
 
-	kthread_use_mm(dev->mm);
-
 	for (;;) {
 		/* mb paired w/ kthread_stop */
 		set_current_state(TASK_INTERRUPTIBLE);
 
-		if (kthread_should_stop()) {
+		if (test_bit(VHOST_WORKER_FLAG_STOP, &worker->flags)) {
 			__set_current_state(TASK_RUNNING);
 			break;
 		}
@@ -376,8 +372,9 @@ static int vhost_worker(void *data)
 				schedule();
 		}
 	}
-	kthread_unuse_mm(dev->mm);
-	return 0;
+
+	complete(worker->exit_done);
+	do_exit(0);
 }
 
 static void vhost_vq_free_iovecs(struct vhost_virtqueue *vq)
@@ -517,31 +514,6 @@ long vhost_dev_check_owner(struct vhost_dev *dev)
 }
 EXPORT_SYMBOL_GPL(vhost_dev_check_owner);
 
-struct vhost_attach_cgroups_struct {
-	struct vhost_work work;
-	struct task_struct *owner;
-	int ret;
-};
-
-static void vhost_attach_cgroups_work(struct vhost_work *work)
-{
-	struct vhost_attach_cgroups_struct *s;
-
-	s = container_of(work, struct vhost_attach_cgroups_struct, work);
-	s->ret = cgroup_attach_task_all(s->owner, current);
-}
-
-static int vhost_attach_cgroups(struct vhost_dev *dev)
-{
-	struct vhost_attach_cgroups_struct attach;
-
-	attach.owner = current;
-	vhost_work_init(&attach.work, vhost_attach_cgroups_work);
-	vhost_work_queue(dev, &attach.work);
-	vhost_work_dev_flush(dev);
-	return attach.ret;
-}
-
 /* Caller should have device mutex */
 bool vhost_dev_has_owner(struct vhost_dev *dev)
 {
@@ -579,6 +551,16 @@ static void vhost_detach_mm(struct vhost_dev *dev)
 	dev->mm = NULL;
 }
 
+static void vhost_worker_stop(struct vhost_worker *worker)
+{
+	DECLARE_COMPLETION_ONSTACK(exit_done);
+
+	worker->exit_done = &exit_done;
+	set_bit(VHOST_WORKER_FLAG_STOP, &worker->flags);
+	wake_up_process(worker->task);
+	wait_for_completion(worker->exit_done);
+}
+
 static void vhost_worker_free(struct vhost_dev *dev)
 {
 	struct vhost_worker *worker = dev->worker;
@@ -588,7 +570,7 @@ static void vhost_worker_free(struct vhost_dev *dev)
 
 	dev->worker = NULL;
 	WARN_ON(!llist_empty(&worker->work_list));
-	kthread_stop(worker->task);
+	vhost_worker_stop(worker);
 	kfree(worker);
 }
 
@@ -603,27 +585,24 @@ static int vhost_worker_create(struct vhost_dev *dev)
 		return -ENOMEM;
 
 	dev->worker = worker;
-	worker->dev = dev;
 	worker->kcov_handle = kcov_common_handle();
 	init_llist_head(&worker->work_list);
 
-	task = kthread_create(vhost_worker, worker, "vhost-%d", current->pid);
+	/*
+	 * vhost used to use the kthread API which ignores all signals by
+	 * default and the drivers expect this behavior.
+	 */
+	task = user_worker_create(vhost_worker, worker, NUMA_NO_NODE, CLONE_FS,
+				  USER_WORKER_NO_FILES | USER_WORKER_SIG_IGN);
 	if (IS_ERR(task)) {
 		ret = PTR_ERR(task);
 		goto free_worker;
 	}
 
 	worker->task = task;
-	wake_up_process(task); /* avoid contributing to loadavg */
-
-	ret = vhost_attach_cgroups(dev);
-	if (ret)
-		goto stop_worker;
-
+	user_worker_start(task, "vhost-%d", current->pid);
 	return 0;
 
-stop_worker:
-	kthread_stop(worker->task);
 free_worker:
 	kfree(worker);
 	dev->worker = NULL;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 102ce25e4e13..09748694cb66 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -25,11 +25,16 @@ struct vhost_work {
 	unsigned long		flags;
 };
 
+enum {
+	VHOST_WORKER_FLAG_STOP,
+};
+
 struct vhost_worker {
 	struct task_struct	*task;
+	struct completion	*exit_done;
 	struct llist_head	work_list;
-	struct vhost_dev	*dev;
 	u64			kcov_handle;
+	unsigned long		flags;
 };
 
 /* Poll a file (eventfd or socket) */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH V6 01/10] Use copy_process in vhost layer
  2021-11-29 19:46 ` Mike Christie
@ 2021-12-08 20:34   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 57+ messages in thread
From: Michael S. Tsirkin @ 2021-12-08 20:34 UTC (permalink / raw)
  To: Mike Christie
  Cc: geert, vverma, hdanton, hch, stefanha, jasowang, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel

On Mon, Nov 29, 2021 at 01:46:57PM -0600, Mike Christie wrote:
> The following patches made over Linus's tree, allow the vhost layer to do
> a copy_process on the thread that does the VHOST_SET_OWNER ioctl like how
> io_uring does a copy_process against its userspace app. This allows the
> vhost layer's worker threads to inherit cgroups, namespaces, address
> space, etc and this worker thread will also be accounted for against that
> owner/parent process's RLIMIT_NPROC limit.
> 
> If you are not familiar with qemu and vhost here is more detailed
> problem description:
> 
> Qemu will create vhost devices in the kernel which perform network, SCSI,
> etc IO and management operations from worker threads created by the
> kthread API. Because the kthread API does a copy_process on the kthreadd
> thread, the vhost layer has to use kthread_use_mm to access the Qemu
> thread's memory and cgroup_attach_task_all to add itself to the Qemu
> thread's cgroups.
> 
> The problem with this approach is that we then have to add new functions/
> args/functionality for every thing we want to inherit. I started doing
> that here:
> 
> https://lkml.org/lkml/2021/6/23/1233
> 
> for the RLIMIT_NPROC check, but it seems it might be easier to just
> inherit everything from the beginning, becuase I'd need to do something
> like that patch several times.


So who's merging this? Me? Did all patches get acked by appropriate
maintainers?

> V6:
> - Rename kernel_worker to user_worker and fix prefixes.
> - Add better patch descriptions.
> V5:
> - Handle kbuild errors by building patchset against current kernel that
>   has all deps merged. Also add patch to remove create_io_thread code as
>   it's not used anymore.
> - Rebase patchset against current kernel and handle a new vm PF_IO_WORKER
>   case added in 5.16-rc1.
> - Add PF_USER_WORKER flag so we can check it later after the initial
>   thread creation for the wake up, vm and singal cses.
> - Added patch to auto reap the worker thread.
> V4:
> - Drop NO_SIG patch and replaced with Christian's SIG_IGN patch.
> - Merged Christian's kernel_worker_flags_valid helpers into patch 5 that
>   added the new kernel worker functions.
> - Fixed extra "i" issue.
> - Added PF_USER_WORKER flag and added check that kernel_worker_start users
>   had that flag set. Also dropped patches that passed worker flags to
>   copy_thread and replaced with PF_USER_WORKER check.
> V3:
> - Add parentheses in p->flag and work_flags check in copy_thread.
> - Fix check in arm/arm64 which was doing the reverse of other archs
>   where it did likely(!flags) instead of unlikely(flags).
> V2:
> - Rename kernel_copy_process to kernel_worker.
> - Instead of exporting functions, make kernel_worker() a proper
>   function/API that does common work for the caller.
> - Instead of adding new fields to kernel_clone_args for each option
>   make it flag based similar to CLONE_*.
> - Drop unused completion struct in vhost.
> - Fix compile warnings by merging vhost cgroup cleanup patch and
>   vhost conversion patch.
> ~                              
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH V6 01/10] Use copy_process in vhost layer
@ 2021-12-08 20:34   ` Michael S. Tsirkin
  0 siblings, 0 replies; 57+ messages in thread
From: Michael S. Tsirkin @ 2021-12-08 20:34 UTC (permalink / raw)
  To: Mike Christie
  Cc: axboe, hdanton, linux-kernel, virtualization, hch, vverma, geert,
	stefanha, christian.brauner

On Mon, Nov 29, 2021 at 01:46:57PM -0600, Mike Christie wrote:
> The following patches made over Linus's tree, allow the vhost layer to do
> a copy_process on the thread that does the VHOST_SET_OWNER ioctl like how
> io_uring does a copy_process against its userspace app. This allows the
> vhost layer's worker threads to inherit cgroups, namespaces, address
> space, etc and this worker thread will also be accounted for against that
> owner/parent process's RLIMIT_NPROC limit.
> 
> If you are not familiar with qemu and vhost here is more detailed
> problem description:
> 
> Qemu will create vhost devices in the kernel which perform network, SCSI,
> etc IO and management operations from worker threads created by the
> kthread API. Because the kthread API does a copy_process on the kthreadd
> thread, the vhost layer has to use kthread_use_mm to access the Qemu
> thread's memory and cgroup_attach_task_all to add itself to the Qemu
> thread's cgroups.
> 
> The problem with this approach is that we then have to add new functions/
> args/functionality for every thing we want to inherit. I started doing
> that here:
> 
> https://lkml.org/lkml/2021/6/23/1233
> 
> for the RLIMIT_NPROC check, but it seems it might be easier to just
> inherit everything from the beginning, becuase I'd need to do something
> like that patch several times.


So who's merging this? Me? Did all patches get acked by appropriate
maintainers?

> V6:
> - Rename kernel_worker to user_worker and fix prefixes.
> - Add better patch descriptions.
> V5:
> - Handle kbuild errors by building patchset against current kernel that
>   has all deps merged. Also add patch to remove create_io_thread code as
>   it's not used anymore.
> - Rebase patchset against current kernel and handle a new vm PF_IO_WORKER
>   case added in 5.16-rc1.
> - Add PF_USER_WORKER flag so we can check it later after the initial
>   thread creation for the wake up, vm and singal cses.
> - Added patch to auto reap the worker thread.
> V4:
> - Drop NO_SIG patch and replaced with Christian's SIG_IGN patch.
> - Merged Christian's kernel_worker_flags_valid helpers into patch 5 that
>   added the new kernel worker functions.
> - Fixed extra "i" issue.
> - Added PF_USER_WORKER flag and added check that kernel_worker_start users
>   had that flag set. Also dropped patches that passed worker flags to
>   copy_thread and replaced with PF_USER_WORKER check.
> V3:
> - Add parentheses in p->flag and work_flags check in copy_thread.
> - Fix check in arm/arm64 which was doing the reverse of other archs
>   where it did likely(!flags) instead of unlikely(flags).
> V2:
> - Rename kernel_copy_process to kernel_worker.
> - Instead of exporting functions, make kernel_worker() a proper
>   function/API that does common work for the caller.
> - Instead of adding new fields to kernel_clone_args for each option
>   make it flag based similar to CLONE_*.
> - Drop unused completion struct in vhost.
> - Fix compile warnings by merging vhost cgroup cleanup patch and
>   vhost conversion patch.
> ~                              
> 

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH V6 01/10] Use copy_process in vhost layer
  2021-12-08 20:34   ` Michael S. Tsirkin
@ 2021-12-08 22:13     ` michael.christie
  -1 siblings, 0 replies; 57+ messages in thread
From: michael.christie @ 2021-12-08 22:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: geert, vverma, hdanton, hch, stefanha, jasowang, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel

On 12/8/21 2:34 PM, Michael S. Tsirkin wrote:
> On Mon, Nov 29, 2021 at 01:46:57PM -0600, Mike Christie wrote:
>> The following patches made over Linus's tree, allow the vhost layer to do
>> a copy_process on the thread that does the VHOST_SET_OWNER ioctl like how
>> io_uring does a copy_process against its userspace app. This allows the
>> vhost layer's worker threads to inherit cgroups, namespaces, address
>> space, etc and this worker thread will also be accounted for against that
>> owner/parent process's RLIMIT_NPROC limit.
>>
>> If you are not familiar with qemu and vhost here is more detailed
>> problem description:
>>
>> Qemu will create vhost devices in the kernel which perform network, SCSI,
>> etc IO and management operations from worker threads created by the
>> kthread API. Because the kthread API does a copy_process on the kthreadd
>> thread, the vhost layer has to use kthread_use_mm to access the Qemu
>> thread's memory and cgroup_attach_task_all to add itself to the Qemu
>> thread's cgroups.
>>
>> The problem with this approach is that we then have to add new functions/
>> args/functionality for every thing we want to inherit. I started doing
>> that here:
>>
>> https://urldefense.com/v3/__https://lkml.org/lkml/2021/6/23/1233__;!!ACWV5N9M2RV99hQ!ceUHd4m4MTJFOGccB9N5r7WonxVoYYT2XPiYwWt2-Vt1Y-DmQirRN8OqKozFLN1h73N6$ 
>>
>> for the RLIMIT_NPROC check, but it seems it might be easier to just
>> inherit everything from the beginning, becuase I'd need to do something
>> like that patch several times.
> 
> 
> So who's merging this? Me? Did all patches get acked by appropriate
> maintainers?
> 

Not yet.

Jens, The last open review comment was from you for the name change
and additional patch description info.

In this posting, I changed the name from:

kernel_worker/kernel_worker_start

to

user_worker_create/user_worker_start

I didn't do the start/create_user_worker format originally discussed
because a while back Christoph had given me a review comment about trying
to tie everything together into an API. Everything having the user_worker
prefix felt nicer in that it was easy to tell the functions and flags went
together, and so I thought it would handle his comment too.

And patch:

[PATCH V6 07/10] io_uring: switch to user_worker

should better explain the reason for the switch.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH V6 01/10] Use copy_process in vhost layer
@ 2021-12-08 22:13     ` michael.christie
  0 siblings, 0 replies; 57+ messages in thread
From: michael.christie @ 2021-12-08 22:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: axboe, hdanton, linux-kernel, virtualization, hch, vverma, geert,
	stefanha, christian.brauner

On 12/8/21 2:34 PM, Michael S. Tsirkin wrote:
> On Mon, Nov 29, 2021 at 01:46:57PM -0600, Mike Christie wrote:
>> The following patches made over Linus's tree, allow the vhost layer to do
>> a copy_process on the thread that does the VHOST_SET_OWNER ioctl like how
>> io_uring does a copy_process against its userspace app. This allows the
>> vhost layer's worker threads to inherit cgroups, namespaces, address
>> space, etc and this worker thread will also be accounted for against that
>> owner/parent process's RLIMIT_NPROC limit.
>>
>> If you are not familiar with qemu and vhost here is more detailed
>> problem description:
>>
>> Qemu will create vhost devices in the kernel which perform network, SCSI,
>> etc IO and management operations from worker threads created by the
>> kthread API. Because the kthread API does a copy_process on the kthreadd
>> thread, the vhost layer has to use kthread_use_mm to access the Qemu
>> thread's memory and cgroup_attach_task_all to add itself to the Qemu
>> thread's cgroups.
>>
>> The problem with this approach is that we then have to add new functions/
>> args/functionality for every thing we want to inherit. I started doing
>> that here:
>>
>> https://urldefense.com/v3/__https://lkml.org/lkml/2021/6/23/1233__;!!ACWV5N9M2RV99hQ!ceUHd4m4MTJFOGccB9N5r7WonxVoYYT2XPiYwWt2-Vt1Y-DmQirRN8OqKozFLN1h73N6$ 
>>
>> for the RLIMIT_NPROC check, but it seems it might be easier to just
>> inherit everything from the beginning, becuase I'd need to do something
>> like that patch several times.
> 
> 
> So who's merging this? Me? Did all patches get acked by appropriate
> maintainers?
> 

Not yet.

Jens, The last open review comment was from you for the name change
and additional patch description info.

In this posting, I changed the name from:

kernel_worker/kernel_worker_start

to

user_worker_create/user_worker_start

I didn't do the start/create_user_worker format originally discussed
because a while back Christoph had given me a review comment about trying
to tie everything together into an API. Everything having the user_worker
prefix felt nicer in that it was easy to tell the functions and flags went
together, and so I thought it would handle his comment too.

And patch:

[PATCH V6 07/10] io_uring: switch to user_worker

should better explain the reason for the switch.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH V6 01/10] Use copy_process in vhost layer
  2021-12-08 22:13     ` michael.christie
  (?)
@ 2021-12-09  9:32     ` Christian Brauner
  -1 siblings, 0 replies; 57+ messages in thread
From: Christian Brauner @ 2021-12-09  9:32 UTC (permalink / raw)
  To: michael.christie
  Cc: Michael S. Tsirkin, geert, vverma, hdanton, hch, stefanha,
	jasowang, sgarzare, virtualization, axboe, linux-kernel

On Wed, Dec 08, 2021 at 04:13:27PM -0600, michael.christie@oracle.com wrote:
> On 12/8/21 2:34 PM, Michael S. Tsirkin wrote:
> > On Mon, Nov 29, 2021 at 01:46:57PM -0600, Mike Christie wrote:
> >> The following patches made over Linus's tree, allow the vhost layer to do
> >> a copy_process on the thread that does the VHOST_SET_OWNER ioctl like how
> >> io_uring does a copy_process against its userspace app. This allows the
> >> vhost layer's worker threads to inherit cgroups, namespaces, address
> >> space, etc and this worker thread will also be accounted for against that
> >> owner/parent process's RLIMIT_NPROC limit.
> >>
> >> If you are not familiar with qemu and vhost here is more detailed
> >> problem description:
> >>
> >> Qemu will create vhost devices in the kernel which perform network, SCSI,
> >> etc IO and management operations from worker threads created by the
> >> kthread API. Because the kthread API does a copy_process on the kthreadd
> >> thread, the vhost layer has to use kthread_use_mm to access the Qemu
> >> thread's memory and cgroup_attach_task_all to add itself to the Qemu
> >> thread's cgroups.
> >>
> >> The problem with this approach is that we then have to add new functions/
> >> args/functionality for every thing we want to inherit. I started doing
> >> that here:
> >>
> >> https://urldefense.com/v3/__https://lkml.org/lkml/2021/6/23/1233__;!!ACWV5N9M2RV99hQ!ceUHd4m4MTJFOGccB9N5r7WonxVoYYT2XPiYwWt2-Vt1Y-DmQirRN8OqKozFLN1h73N6$ 
> >>
> >> for the RLIMIT_NPROC check, but it seems it might be easier to just
> >> inherit everything from the beginning, becuase I'd need to do something
> >> like that patch several times.
> > 
> > 
> > So who's merging this? Me? Did all patches get acked by appropriate
> > maintainers?
> > 
> 
> Not yet.
> 
> Jens, The last open review comment was from you for the name change
> and additional patch description info.
> 
> In this posting, I changed the name from:
> 
> kernel_worker/kernel_worker_start
> 
> to
> 
> user_worker_create/user_worker_start
> 
> I didn't do the start/create_user_worker format originally discussed
> because a while back Christoph had given me a review comment about trying
> to tie everything together into an API. Everything having the user_worker
> prefix felt nicer in that it was easy to tell the functions and flags went
> together, and so I thought it would handle his comment too.
> 
> And patch:
> 
> [PATCH V6 07/10] io_uring: switch to user_worker
> 
> should better explain the reason for the switch.

I was waiting on Jens to give his review since io_uring is currently the
biggest users of this before the vhost switch.

Christian

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH V6 05/10] signal: Perfom autoreap for PF_USER_WORKER
  2021-11-29 19:47   ` Mike Christie
@ 2021-12-17 18:42     ` Eric W. Biederman
  -1 siblings, 0 replies; 57+ messages in thread
From: Eric W. Biederman @ 2021-12-17 18:42 UTC (permalink / raw)
  To: Mike Christie
  Cc: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel

Mike Christie <michael.christie@oracle.com> writes:

> Userspace doesn't know about PF_USER_WORKER threads, so it can't do wait
> to clean them up. For cases like where qemu will do dynamic/hot add/remove
> of vhost devices, then we need to auto reap the thread like was done for
> the kthread case, because qemu does not know what API the kernel/vhost
> layer is using.
>
> This has us do autoreaping for these threads similar to when the parent
> ignores SIGCHLD and for kthreads.

There is a lot wrong with this change.
1) you can just set "task->signal = SIGCHLD" to get this
   behavior so it is unnecessary.

2) This is not the autoreaping you want.  This autoreaping just kicks
   in when the parents signal handler is SIG_IGN.  Since I presume
   you are not controlling the parent this is just plain nonsense.

The autoreap you want is the autoreap in exit_notify, and you don't
want to call do_notify_parent at all.

Eric

> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> ---
>  kernel/signal.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index a629b11bf3e0..4ce2cc195269 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2071,9 +2071,9 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
>  
>  	psig = tsk->parent->sighand;
>  	spin_lock_irqsave(&psig->siglock, flags);
> -	if (!tsk->ptrace && sig == SIGCHLD &&
> +	if (!tsk->ptrace && (tsk->flags & PF_USER_WORKER || (sig == SIGCHLD &&
>  	    (psig->action[SIGCHLD-1].sa.sa_handler == SIG_IGN ||
> -	     (psig->action[SIGCHLD-1].sa.sa_flags & SA_NOCLDWAIT))) {
> +	     (psig->action[SIGCHLD-1].sa.sa_flags & SA_NOCLDWAIT))))) {
>  		/*
>  		 * We are exiting and our parent doesn't care.  POSIX.1
>  		 * defines special semantics for setting SIGCHLD to SIG_IGN

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH V6 05/10] signal: Perfom autoreap for PF_USER_WORKER
@ 2021-12-17 18:42     ` Eric W. Biederman
  0 siblings, 0 replies; 57+ messages in thread
From: Eric W. Biederman @ 2021-12-17 18:42 UTC (permalink / raw)
  To: Mike Christie
  Cc: axboe, hdanton, mst, linux-kernel, virtualization, hch, vverma,
	geert, stefanha, christian.brauner

Mike Christie <michael.christie@oracle.com> writes:

> Userspace doesn't know about PF_USER_WORKER threads, so it can't do wait
> to clean them up. For cases like where qemu will do dynamic/hot add/remove
> of vhost devices, then we need to auto reap the thread like was done for
> the kthread case, because qemu does not know what API the kernel/vhost
> layer is using.
>
> This has us do autoreaping for these threads similar to when the parent
> ignores SIGCHLD and for kthreads.

There is a lot wrong with this change.
1) you can just set "task->signal = SIGCHLD" to get this
   behavior so it is unnecessary.

2) This is not the autoreaping you want.  This autoreaping just kicks
   in when the parents signal handler is SIG_IGN.  Since I presume
   you are not controlling the parent this is just plain nonsense.

The autoreap you want is the autoreap in exit_notify, and you don't
want to call do_notify_parent at all.

Eric

> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> ---
>  kernel/signal.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index a629b11bf3e0..4ce2cc195269 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2071,9 +2071,9 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
>  
>  	psig = tsk->parent->sighand;
>  	spin_lock_irqsave(&psig->siglock, flags);
> -	if (!tsk->ptrace && sig == SIGCHLD &&
> +	if (!tsk->ptrace && (tsk->flags & PF_USER_WORKER || (sig == SIGCHLD &&
>  	    (psig->action[SIGCHLD-1].sa.sa_handler == SIG_IGN ||
> -	     (psig->action[SIGCHLD-1].sa.sa_flags & SA_NOCLDWAIT))) {
> +	     (psig->action[SIGCHLD-1].sa.sa_flags & SA_NOCLDWAIT))))) {
>  		/*
>  		 * We are exiting and our parent doesn't care.  POSIX.1
>  		 * defines special semantics for setting SIGCHLD to SIG_IGN
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH V6 06/10] fork: add helpers to clone a process for kernel use
  2021-11-29 19:47   ` Mike Christie
@ 2021-12-17 18:53     ` Eric W. Biederman
  -1 siblings, 0 replies; 57+ messages in thread
From: Eric W. Biederman @ 2021-12-17 18:53 UTC (permalink / raw)
  To: Mike Christie
  Cc: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel,
	Christoph Hellwig

Mike Christie <michael.christie@oracle.com> writes:

> The vhost layer is creating kthreads to execute IO and management
> operations. These threads need to share a mm with a userspace thread,
> inherit cgroups, and we would like to have the thread accounted for
> under the userspace thread's rlimit nproc value so a user can't overwhelm
> the system with threads when creating VMs.
>
> We have helpers for cgroups and mm but not for the rlimit nproc and in
> the future we will probably want helpers for things like namespaces. For
> those two items and to allow future sharing/inheritance, this patch adds
> two helpers, user_worker_create and user_worker_start that allow callers
> to create threads that copy or inherit the caller's attributes like mm,
> cgroups, namespaces, etc, and are accounted for under the callers rlimits
> nproc value similar to if the caller did a clone() in userspace. However,
> instead of returning to userspace the thread is usable in the kernel for
> modules like vhost or layers like io_uring.

If you are making this a general API it would be good to wrap the called
function the way kthread_create does so that the code in the function
can just return and let the wrapper call do_exit for it, especially if
you are going to have modular users.

There is a lot of deep magic in what happens if a thread created with
kernel_thread returns.  It makes sense to expose that magic to the 1 or
2 callers that use kernel_thread directly.  It does not make sense to
expose to anything higher up and in creating a nice API you are doing
that.

Currently I have just removed all of the modular users of do_exit
and in the process of removing do_exit itself so I am a little more
sensitive to this than I would ordinarily be.  But I think my comment
stands even without my changes you conflict with.

Eric


> [added flag validation code from Christian Brauner's SIG_IGN patch]
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  include/linux/sched/task.h |  5 +++
>  kernel/fork.c              | 72 ++++++++++++++++++++++++++++++++++++++
>  2 files changed, 77 insertions(+)
>
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index f8a658700075..ecb21c0d95ce 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -95,6 +95,11 @@ struct mm_struct *copy_init_mm(void);
>  extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
>  extern long kernel_wait4(pid_t, int __user *, int, struct rusage *);
>  int kernel_wait(pid_t pid, int *stat);
> +struct task_struct *user_worker_create(int (*fn)(void *), void *arg, int node,
> +				       unsigned long clone_flags,
> +				       u32 worker_flags);
> +__printf(2, 3)
> +void user_worker_start(struct task_struct *tsk, const char namefmt[], ...);
>  
>  extern void free_task(struct task_struct *tsk);
>  
> diff --git a/kernel/fork.c b/kernel/fork.c
> index c9152596a285..e72239ae1e08 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2543,6 +2543,78 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
>  	return copy_process(NULL, 0, node, &args);
>  }
>  
> +static bool user_worker_flags_valid(struct kernel_clone_args *kargs)
> +{
> +	/* Verify that no unknown flags are passed along. */
> +	if (kargs->worker_flags & ~(USER_WORKER_IO | USER_WORKER |
> +				    USER_WORKER_NO_FILES | USER_WORKER_SIG_IGN))
> +		return false;
> +
> +	/*
> +	 * If we're ignoring all signals don't allow sharing struct sighand and
> +	 * don't bother clearing signal handlers.
> +	 */
> +	if ((kargs->flags & (CLONE_SIGHAND | CLONE_CLEAR_SIGHAND)) &&
> +	    (kargs->worker_flags & USER_WORKER_SIG_IGN))
> +		return false;
> +
> +	return true;
> +}
> +
> +/**
> + * user_worker_create - create a copy of a process to be used by the kernel
> + * @fn: thread stack
> + * @arg: data to be passed to fn
> + * @node: numa node to allocate task from
> + * @clone_flags: CLONE flags
> + * @worker_flags: USER_WORKER flags
> + *
> + * This returns a created task, or an error pointer. The returned task is
> + * inactive, and the caller must fire it up through user_worker_start(). If
> + * this is an PF_IO_WORKER all singals but KILL and STOP are blocked.
> + */
> +struct task_struct *user_worker_create(int (*fn)(void *), void *arg, int node,
> +				       unsigned long clone_flags,
> +				       u32 worker_flags)
> +{
> +	struct kernel_clone_args args = {
> +		.flags		= ((lower_32_bits(clone_flags) | CLONE_VM |
> +				   CLONE_UNTRACED) & ~CSIGNAL),
> +		.exit_signal	= (lower_32_bits(clone_flags) & CSIGNAL),
> +		.stack		= (unsigned long)fn,
> +		.stack_size	= (unsigned long)arg,
> +		.worker_flags	= USER_WORKER | worker_flags,
> +	};
> +
> +	if (!user_worker_flags_valid(&args))
> +		return ERR_PTR(-EINVAL);
> +
> +	return copy_process(NULL, 0, node, &args);
> +}
> +EXPORT_SYMBOL_GPL(user_worker_create);
> +
> +/**
> + * user_worker_start - Start a task created with user_worker_create
> + * @tsk: task to wake up
> + * @namefmt: printf-style format string for the thread name
> + * @arg: arguments for @namefmt
> + */
> +void user_worker_start(struct task_struct *tsk, const char namefmt[], ...)
> +{
> +	char name[TASK_COMM_LEN];
> +	va_list args;
> +
> +	WARN_ON(!(tsk->flags & PF_USER_WORKER));
> +
> +	va_start(args, namefmt);
> +	vsnprintf(name, sizeof(name), namefmt, args);
> +	set_task_comm(tsk, name);
> +	va_end(args);
> +
> +	wake_up_new_task(tsk);
> +}
> +EXPORT_SYMBOL_GPL(user_worker_start);
> +
>  /*
>   *  Ok, this is the main fork-routine.
>   *

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH V6 06/10] fork: add helpers to clone a process for kernel use
@ 2021-12-17 18:53     ` Eric W. Biederman
  0 siblings, 0 replies; 57+ messages in thread
From: Eric W. Biederman @ 2021-12-17 18:53 UTC (permalink / raw)
  To: Mike Christie
  Cc: axboe, hdanton, mst, linux-kernel, virtualization, hch, vverma,
	geert, stefanha, christian.brauner, Christoph Hellwig

Mike Christie <michael.christie@oracle.com> writes:

> The vhost layer is creating kthreads to execute IO and management
> operations. These threads need to share a mm with a userspace thread,
> inherit cgroups, and we would like to have the thread accounted for
> under the userspace thread's rlimit nproc value so a user can't overwhelm
> the system with threads when creating VMs.
>
> We have helpers for cgroups and mm but not for the rlimit nproc and in
> the future we will probably want helpers for things like namespaces. For
> those two items and to allow future sharing/inheritance, this patch adds
> two helpers, user_worker_create and user_worker_start that allow callers
> to create threads that copy or inherit the caller's attributes like mm,
> cgroups, namespaces, etc, and are accounted for under the callers rlimits
> nproc value similar to if the caller did a clone() in userspace. However,
> instead of returning to userspace the thread is usable in the kernel for
> modules like vhost or layers like io_uring.

If you are making this a general API it would be good to wrap the called
function the way kthread_create does so that the code in the function
can just return and let the wrapper call do_exit for it, especially if
you are going to have modular users.

There is a lot of deep magic in what happens if a thread created with
kernel_thread returns.  It makes sense to expose that magic to the 1 or
2 callers that use kernel_thread directly.  It does not make sense to
expose to anything higher up and in creating a nice API you are doing
that.

Currently I have just removed all of the modular users of do_exit
and in the process of removing do_exit itself so I am a little more
sensitive to this than I would ordinarily be.  But I think my comment
stands even without my changes you conflict with.

Eric


> [added flag validation code from Christian Brauner's SIG_IGN patch]
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  include/linux/sched/task.h |  5 +++
>  kernel/fork.c              | 72 ++++++++++++++++++++++++++++++++++++++
>  2 files changed, 77 insertions(+)
>
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index f8a658700075..ecb21c0d95ce 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -95,6 +95,11 @@ struct mm_struct *copy_init_mm(void);
>  extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
>  extern long kernel_wait4(pid_t, int __user *, int, struct rusage *);
>  int kernel_wait(pid_t pid, int *stat);
> +struct task_struct *user_worker_create(int (*fn)(void *), void *arg, int node,
> +				       unsigned long clone_flags,
> +				       u32 worker_flags);
> +__printf(2, 3)
> +void user_worker_start(struct task_struct *tsk, const char namefmt[], ...);
>  
>  extern void free_task(struct task_struct *tsk);
>  
> diff --git a/kernel/fork.c b/kernel/fork.c
> index c9152596a285..e72239ae1e08 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2543,6 +2543,78 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
>  	return copy_process(NULL, 0, node, &args);
>  }
>  
> +static bool user_worker_flags_valid(struct kernel_clone_args *kargs)
> +{
> +	/* Verify that no unknown flags are passed along. */
> +	if (kargs->worker_flags & ~(USER_WORKER_IO | USER_WORKER |
> +				    USER_WORKER_NO_FILES | USER_WORKER_SIG_IGN))
> +		return false;
> +
> +	/*
> +	 * If we're ignoring all signals don't allow sharing struct sighand and
> +	 * don't bother clearing signal handlers.
> +	 */
> +	if ((kargs->flags & (CLONE_SIGHAND | CLONE_CLEAR_SIGHAND)) &&
> +	    (kargs->worker_flags & USER_WORKER_SIG_IGN))
> +		return false;
> +
> +	return true;
> +}
> +
> +/**
> + * user_worker_create - create a copy of a process to be used by the kernel
> + * @fn: thread stack
> + * @arg: data to be passed to fn
> + * @node: numa node to allocate task from
> + * @clone_flags: CLONE flags
> + * @worker_flags: USER_WORKER flags
> + *
> + * This returns a created task, or an error pointer. The returned task is
> + * inactive, and the caller must fire it up through user_worker_start(). If
> + * this is an PF_IO_WORKER all singals but KILL and STOP are blocked.
> + */
> +struct task_struct *user_worker_create(int (*fn)(void *), void *arg, int node,
> +				       unsigned long clone_flags,
> +				       u32 worker_flags)
> +{
> +	struct kernel_clone_args args = {
> +		.flags		= ((lower_32_bits(clone_flags) | CLONE_VM |
> +				   CLONE_UNTRACED) & ~CSIGNAL),
> +		.exit_signal	= (lower_32_bits(clone_flags) & CSIGNAL),
> +		.stack		= (unsigned long)fn,
> +		.stack_size	= (unsigned long)arg,
> +		.worker_flags	= USER_WORKER | worker_flags,
> +	};
> +
> +	if (!user_worker_flags_valid(&args))
> +		return ERR_PTR(-EINVAL);
> +
> +	return copy_process(NULL, 0, node, &args);
> +}
> +EXPORT_SYMBOL_GPL(user_worker_create);
> +
> +/**
> + * user_worker_start - Start a task created with user_worker_create
> + * @tsk: task to wake up
> + * @namefmt: printf-style format string for the thread name
> + * @arg: arguments for @namefmt
> + */
> +void user_worker_start(struct task_struct *tsk, const char namefmt[], ...)
> +{
> +	char name[TASK_COMM_LEN];
> +	va_list args;
> +
> +	WARN_ON(!(tsk->flags & PF_USER_WORKER));
> +
> +	va_start(args, namefmt);
> +	vsnprintf(name, sizeof(name), namefmt, args);
> +	set_task_comm(tsk, name);
> +	va_end(args);
> +
> +	wake_up_new_task(tsk);
> +}
> +EXPORT_SYMBOL_GPL(user_worker_start);
> +
>  /*
>   *  Ok, this is the main fork-routine.
>   *
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH V6 10/10] vhost: use user_worker to check RLIMITs
  2021-11-29 19:47   ` Mike Christie
@ 2021-12-17 19:01     ` Eric W. Biederman
  -1 siblings, 0 replies; 57+ messages in thread
From: Eric W. Biederman @ 2021-12-17 19:01 UTC (permalink / raw)
  To: Mike Christie
  Cc: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel,
	Christoph Hellwig

Mike Christie <michael.christie@oracle.com> writes:

> For vhost workers we use the kthread API which inherit's its values from
> and checks against the kthreadd thread. This results in the wrong RLIMITs
> being checked. This patch has us use the user_worker helpers which will
> inherit its values/checks from the thread that owns the device similar to
> if we did a clone in userspace.
>
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/vhost/vhost.c | 65 +++++++++++++++----------------------------
>  drivers/vhost/vhost.h |  7 ++++-
>  2 files changed, 28 insertions(+), 44 deletions(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index c9a1f706989c..8cf259d798c0 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -22,7 +22,6 @@
>  #include <linux/slab.h>
>  #include <linux/vmalloc.h>
>  #include <linux/kthread.h>
> -#include <linux/cgroup.h>
>  #include <linux/module.h>
>  #include <linux/sort.h>
>  #include <linux/sched/mm.h>
> @@ -344,17 +343,14 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>  static int vhost_worker(void *data)
>  {
>  	struct vhost_worker *worker = data;
> -	struct vhost_dev *dev = worker->dev;
>  	struct vhost_work *work, *work_next;
>  	struct llist_node *node;
>  
> -	kthread_use_mm(dev->mm);
> -
>  	for (;;) {
>  		/* mb paired w/ kthread_stop */
>  		set_current_state(TASK_INTERRUPTIBLE);
>  
> -		if (kthread_should_stop()) {
> +		if (test_bit(VHOST_WORKER_FLAG_STOP, &worker->flags)) {
>  			__set_current_state(TASK_RUNNING);
>  			break;
>  		}
> @@ -376,8 +372,9 @@ static int vhost_worker(void *data)
>  				schedule();
>  		}
>  	}
> -	kthread_unuse_mm(dev->mm);
> -	return 0;
> +
> +	complete(worker->exit_done);
> +	do_exit(0);

This code worries me.

It has the potential for a caller to do:

	vhost_worker_stop()
        module_put();

Then the exiting work thread tries to do:
	do_exit()

Except the code that calls do_exit has already been removed from the
kernel.  Maybe the vhost code can never be removed from the kernel
but otherwise I expect that is possible.

Eric

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH V6 10/10] vhost: use user_worker to check RLIMITs
@ 2021-12-17 19:01     ` Eric W. Biederman
  0 siblings, 0 replies; 57+ messages in thread
From: Eric W. Biederman @ 2021-12-17 19:01 UTC (permalink / raw)
  To: Mike Christie
  Cc: axboe, hdanton, mst, linux-kernel, virtualization, hch, vverma,
	geert, stefanha, christian.brauner, Christoph Hellwig

Mike Christie <michael.christie@oracle.com> writes:

> For vhost workers we use the kthread API which inherit's its values from
> and checks against the kthreadd thread. This results in the wrong RLIMITs
> being checked. This patch has us use the user_worker helpers which will
> inherit its values/checks from the thread that owns the device similar to
> if we did a clone in userspace.
>
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/vhost/vhost.c | 65 +++++++++++++++----------------------------
>  drivers/vhost/vhost.h |  7 ++++-
>  2 files changed, 28 insertions(+), 44 deletions(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index c9a1f706989c..8cf259d798c0 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -22,7 +22,6 @@
>  #include <linux/slab.h>
>  #include <linux/vmalloc.h>
>  #include <linux/kthread.h>
> -#include <linux/cgroup.h>
>  #include <linux/module.h>
>  #include <linux/sort.h>
>  #include <linux/sched/mm.h>
> @@ -344,17 +343,14 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>  static int vhost_worker(void *data)
>  {
>  	struct vhost_worker *worker = data;
> -	struct vhost_dev *dev = worker->dev;
>  	struct vhost_work *work, *work_next;
>  	struct llist_node *node;
>  
> -	kthread_use_mm(dev->mm);
> -
>  	for (;;) {
>  		/* mb paired w/ kthread_stop */
>  		set_current_state(TASK_INTERRUPTIBLE);
>  
> -		if (kthread_should_stop()) {
> +		if (test_bit(VHOST_WORKER_FLAG_STOP, &worker->flags)) {
>  			__set_current_state(TASK_RUNNING);
>  			break;
>  		}
> @@ -376,8 +372,9 @@ static int vhost_worker(void *data)
>  				schedule();
>  		}
>  	}
> -	kthread_unuse_mm(dev->mm);
> -	return 0;
> +
> +	complete(worker->exit_done);
> +	do_exit(0);

This code worries me.

It has the potential for a caller to do:

	vhost_worker_stop()
        module_put();

Then the exiting work thread tries to do:
	do_exit()

Except the code that calls do_exit has already been removed from the
kernel.  Maybe the vhost code can never be removed from the kernel
but otherwise I expect that is possible.

Eric
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH V6 01/10] Use copy_process in vhost layer
  2021-11-29 19:46 ` Mike Christie
@ 2021-12-17 19:26   ` Eric W. Biederman
  -1 siblings, 0 replies; 57+ messages in thread
From: Eric W. Biederman @ 2021-12-17 19:26 UTC (permalink / raw)
  To: Mike Christie
  Cc: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel

Mike Christie <michael.christie@oracle.com> writes:

> The following patches made over Linus's tree, allow the vhost layer to do
> a copy_process on the thread that does the VHOST_SET_OWNER ioctl like how
> io_uring does a copy_process against its userspace app. This allows the
> vhost layer's worker threads to inherit cgroups, namespaces, address
> space, etc and this worker thread will also be accounted for against that
> owner/parent process's RLIMIT_NPROC limit.
>
> If you are not familiar with qemu and vhost here is more detailed
> problem description:
>
> Qemu will create vhost devices in the kernel which perform network, SCSI,
> etc IO and management operations from worker threads created by the
> kthread API. Because the kthread API does a copy_process on the kthreadd
> thread, the vhost layer has to use kthread_use_mm to access the Qemu
> thread's memory and cgroup_attach_task_all to add itself to the Qemu
> thread's cgroups.
>
> The problem with this approach is that we then have to add new functions/
> args/functionality for every thing we want to inherit. I started doing
> that here:
>
> https://lkml.org/lkml/2021/6/23/1233
>
> for the RLIMIT_NPROC check, but it seems it might be easier to just
> inherit everything from the beginning, becuase I'd need to do something
> like that patch several times.

I read through the code and I don't see why you want to make these
almost threads of a process not actually threads of that process
(like the io_uring threads are).

As a separate process there are many things that will continue to be
disjoint.  If the thread changes cgroups for example your new process
won't follow.

If you want them to be actually processes with an lifetime independent
of the creating process I expect you want to reparent them to the local
init process.  Just so they don't confuse the process tree.  Plus init
processes know how to handle unexpected children.

What are the semantics you are aiming for?

I can see sense in generalizing some of the pieces of create_io_thread
but I think generalizing create_io_thread itself is premature.  The code
lives in kernel/fork.c because it is a very special thing that we want
to keep our eyes on.

Some of your generalization makes it much more difficult to tell what
your code is going to use because you remove hard codes that are there
to simplify the analysis of the situation.

My gut says adding a new create_vhost_worker and putting that in
kernel/fork.c is a lot safer and will allow much better code analysis.

If there a really are commonalities between creating a userspace process
that runs completely in the kernel and creating an additional userspace
thread we refactor the code and simplify things.

I am especially nervous about generalizing the io_uring code as it's
signal handling just barely works, and any generalization will cause it
to break.  So you are in the process of generalizing code that simply
can not handle the general case.  That scares me.

Eric

>
> V6:
> - Rename kernel_worker to user_worker and fix prefixes.
> - Add better patch descriptions.
> V5:
> - Handle kbuild errors by building patchset against current kernel that
>   has all deps merged. Also add patch to remove create_io_thread code as
>   it's not used anymore.
> - Rebase patchset against current kernel and handle a new vm PF_IO_WORKER
>   case added in 5.16-rc1.
> - Add PF_USER_WORKER flag so we can check it later after the initial
>   thread creation for the wake up, vm and singal cses.
> - Added patch to auto reap the worker thread.
> V4:
> - Drop NO_SIG patch and replaced with Christian's SIG_IGN patch.
> - Merged Christian's kernel_worker_flags_valid helpers into patch 5 that
>   added the new kernel worker functions.
> - Fixed extra "i" issue.
> - Added PF_USER_WORKER flag and added check that kernel_worker_start users
>   had that flag set. Also dropped patches that passed worker flags to
>   copy_thread and replaced with PF_USER_WORKER check.
> V3:
> - Add parentheses in p->flag and work_flags check in copy_thread.
> - Fix check in arm/arm64 which was doing the reverse of other archs
>   where it did likely(!flags) instead of unlikely(flags).
> V2:
> - Rename kernel_copy_process to kernel_worker.
> - Instead of exporting functions, make kernel_worker() a proper
>   function/API that does common work for the caller.
> - Instead of adding new fields to kernel_clone_args for each option
>   make it flag based similar to CLONE_*.
> - Drop unused completion struct in vhost.
> - Fix compile warnings by merging vhost cgroup cleanup patch and
>   vhost conversion patch.
> ~                              
>
>
> _______________________________________________
> Virtualization mailing list
> Virtualization@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH V6 01/10] Use copy_process in vhost layer
@ 2021-12-17 19:26   ` Eric W. Biederman
  0 siblings, 0 replies; 57+ messages in thread
From: Eric W. Biederman @ 2021-12-17 19:26 UTC (permalink / raw)
  To: Mike Christie
  Cc: axboe, hdanton, mst, linux-kernel, virtualization, hch, vverma,
	geert, stefanha, christian.brauner

Mike Christie <michael.christie@oracle.com> writes:

> The following patches made over Linus's tree, allow the vhost layer to do
> a copy_process on the thread that does the VHOST_SET_OWNER ioctl like how
> io_uring does a copy_process against its userspace app. This allows the
> vhost layer's worker threads to inherit cgroups, namespaces, address
> space, etc and this worker thread will also be accounted for against that
> owner/parent process's RLIMIT_NPROC limit.
>
> If you are not familiar with qemu and vhost here is more detailed
> problem description:
>
> Qemu will create vhost devices in the kernel which perform network, SCSI,
> etc IO and management operations from worker threads created by the
> kthread API. Because the kthread API does a copy_process on the kthreadd
> thread, the vhost layer has to use kthread_use_mm to access the Qemu
> thread's memory and cgroup_attach_task_all to add itself to the Qemu
> thread's cgroups.
>
> The problem with this approach is that we then have to add new functions/
> args/functionality for every thing we want to inherit. I started doing
> that here:
>
> https://lkml.org/lkml/2021/6/23/1233
>
> for the RLIMIT_NPROC check, but it seems it might be easier to just
> inherit everything from the beginning, becuase I'd need to do something
> like that patch several times.

I read through the code and I don't see why you want to make these
almost threads of a process not actually threads of that process
(like the io_uring threads are).

As a separate process there are many things that will continue to be
disjoint.  If the thread changes cgroups for example your new process
won't follow.

If you want them to be actually processes with an lifetime independent
of the creating process I expect you want to reparent them to the local
init process.  Just so they don't confuse the process tree.  Plus init
processes know how to handle unexpected children.

What are the semantics you are aiming for?

I can see sense in generalizing some of the pieces of create_io_thread
but I think generalizing create_io_thread itself is premature.  The code
lives in kernel/fork.c because it is a very special thing that we want
to keep our eyes on.

Some of your generalization makes it much more difficult to tell what
your code is going to use because you remove hard codes that are there
to simplify the analysis of the situation.

My gut says adding a new create_vhost_worker and putting that in
kernel/fork.c is a lot safer and will allow much better code analysis.

If there a really are commonalities between creating a userspace process
that runs completely in the kernel and creating an additional userspace
thread we refactor the code and simplify things.

I am especially nervous about generalizing the io_uring code as it's
signal handling just barely works, and any generalization will cause it
to break.  So you are in the process of generalizing code that simply
can not handle the general case.  That scares me.

Eric

>
> V6:
> - Rename kernel_worker to user_worker and fix prefixes.
> - Add better patch descriptions.
> V5:
> - Handle kbuild errors by building patchset against current kernel that
>   has all deps merged. Also add patch to remove create_io_thread code as
>   it's not used anymore.
> - Rebase patchset against current kernel and handle a new vm PF_IO_WORKER
>   case added in 5.16-rc1.
> - Add PF_USER_WORKER flag so we can check it later after the initial
>   thread creation for the wake up, vm and singal cses.
> - Added patch to auto reap the worker thread.
> V4:
> - Drop NO_SIG patch and replaced with Christian's SIG_IGN patch.
> - Merged Christian's kernel_worker_flags_valid helpers into patch 5 that
>   added the new kernel worker functions.
> - Fixed extra "i" issue.
> - Added PF_USER_WORKER flag and added check that kernel_worker_start users
>   had that flag set. Also dropped patches that passed worker flags to
>   copy_thread and replaced with PF_USER_WORKER check.
> V3:
> - Add parentheses in p->flag and work_flags check in copy_thread.
> - Fix check in arm/arm64 which was doing the reverse of other archs
>   where it did likely(!flags) instead of unlikely(flags).
> V2:
> - Rename kernel_copy_process to kernel_worker.
> - Instead of exporting functions, make kernel_worker() a proper
>   function/API that does common work for the caller.
> - Instead of adding new fields to kernel_clone_args for each option
>   make it flag based similar to CLONE_*.
> - Drop unused completion struct in vhost.
> - Fix compile warnings by merging vhost cgroup cleanup patch and
>   vhost conversion patch.
> ~                              
>
>
> _______________________________________________
> Virtualization mailing list
> Virtualization@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH V6 01/10] Use copy_process in vhost layer
  2021-12-17 19:26   ` Eric W. Biederman
@ 2021-12-17 22:08     ` michael.christie
  -1 siblings, 0 replies; 57+ messages in thread
From: michael.christie @ 2021-12-17 22:08 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel

On 12/17/21 1:26 PM, Eric W. Biederman wrote:
> Mike Christie <michael.christie@oracle.com> writes:
> 
>> The following patches made over Linus's tree, allow the vhost layer to do
>> a copy_process on the thread that does the VHOST_SET_OWNER ioctl like how
>> io_uring does a copy_process against its userspace app. This allows the
>> vhost layer's worker threads to inherit cgroups, namespaces, address
>> space, etc and this worker thread will also be accounted for against that
>> owner/parent process's RLIMIT_NPROC limit.
>>
>> If you are not familiar with qemu and vhost here is more detailed
>> problem description:
>>
>> Qemu will create vhost devices in the kernel which perform network, SCSI,
>> etc IO and management operations from worker threads created by the
>> kthread API. Because the kthread API does a copy_process on the kthreadd
>> thread, the vhost layer has to use kthread_use_mm to access the Qemu
>> thread's memory and cgroup_attach_task_all to add itself to the Qemu
>> thread's cgroups.
>>
>> The problem with this approach is that we then have to add new functions/
>> args/functionality for every thing we want to inherit. I started doing
>> that here:
>>
>> https://urldefense.com/v3/__https://lkml.org/lkml/2021/6/23/1233__;!!ACWV5N9M2RV99hQ!eIaEe9V8mCgGU6vyvaWTKGi3Zlnz0rgk5Y-0nsBXRbsuVZsM8lYfHr8I8GRuoLYPYrOB$ 
>>
>> for the RLIMIT_NPROC check, but it seems it might be easier to just
>> inherit everything from the beginning, becuase I'd need to do something
>> like that patch several times.
> 
> I read through the code and I don't see why you want to make these
> almost threads of a process not actually threads of that process
> (like the io_uring threads are).
> 
> As a separate process there are many things that will continue to be
> disjoint.  If the thread changes cgroups for example your new process
> won't follow.
> 
> If you want them to be actually processes with an lifetime independent
> of the creating process I expect you want to reparent them to the local
> init process.  Just so they don't confuse the process tree.  Plus init
> processes know how to handle unexpected children.
> 
> What are the semantics you are aiming for?
> 

Hi Eric,

Right now, for vhost we need the thread being created:

1. added to the caller's cgroup.
2. to share the mm struct with the caller.
3. to be accounted for under the caller's nproc rlimit value.

For 1 and 2, we have cgroup_attach_task_all and get_task_mm
already.

This patchset started with me just trying to handle #3 by modifying kthreads
like here:

https://lkml.org/lkml/2021/6/23/1234

So we can use kthreads and the existing helpers and add:

A. a ucounts version of the above patches in the link

or

B. a helper that does something like copy_process's use of
is_ucounts_overlimit and vhost can call that.

instead of this patchset.


Before we even get to the next section below, do you consider items 1 - 3
something we need an API based on copy_process for?

Do you think I should just do A or B above, or do you have another idea? If
so can we get agreement on that from everyone?

I thought my patches in that link were a little hacky in how they passed
around the user/creds info. I thought maybe it shouldn't be passed around like
that, so switched to the copy_process based approach which did everything for
me. And I thought io_uring needed something similar as us so I made it generic.

I don't have a preference. You and Christian are the experts, so I'll leave it
to you guys.


> I can see sense in generalizing some of the pieces of create_io_thread
> but I think generalizing create_io_thread itself is premature.  The code
> lives in kernel/fork.c because it is a very special thing that we want
> to keep our eyes on.
> 
> Some of your generalization makes it much more difficult to tell what
> your code is going to use because you remove hard codes that are there
> to simplify the analysis of the situation.
> 
> My gut says adding a new create_vhost_worker and putting that in
> kernel/fork.c is a lot safer and will allow much better code analysis.
> 
> If there a really are commonalities between creating a userspace process
> that runs completely in the kernel and creating an additional userspace
> thread we refactor the code and simplify things.
> 
> I am especially nervous about generalizing the io_uring code as it's
> signal handling just barely works, and any generalization will cause it
> to break.  So you are in the process of generalizing code that simply
> can not handle the general case.  That scares me

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH V6 01/10] Use copy_process in vhost layer
@ 2021-12-17 22:08     ` michael.christie
  0 siblings, 0 replies; 57+ messages in thread
From: michael.christie @ 2021-12-17 22:08 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: axboe, hdanton, mst, linux-kernel, virtualization, hch, vverma,
	geert, stefanha, christian.brauner

On 12/17/21 1:26 PM, Eric W. Biederman wrote:
> Mike Christie <michael.christie@oracle.com> writes:
> 
>> The following patches made over Linus's tree, allow the vhost layer to do
>> a copy_process on the thread that does the VHOST_SET_OWNER ioctl like how
>> io_uring does a copy_process against its userspace app. This allows the
>> vhost layer's worker threads to inherit cgroups, namespaces, address
>> space, etc and this worker thread will also be accounted for against that
>> owner/parent process's RLIMIT_NPROC limit.
>>
>> If you are not familiar with qemu and vhost here is more detailed
>> problem description:
>>
>> Qemu will create vhost devices in the kernel which perform network, SCSI,
>> etc IO and management operations from worker threads created by the
>> kthread API. Because the kthread API does a copy_process on the kthreadd
>> thread, the vhost layer has to use kthread_use_mm to access the Qemu
>> thread's memory and cgroup_attach_task_all to add itself to the Qemu
>> thread's cgroups.
>>
>> The problem with this approach is that we then have to add new functions/
>> args/functionality for every thing we want to inherit. I started doing
>> that here:
>>
>> https://urldefense.com/v3/__https://lkml.org/lkml/2021/6/23/1233__;!!ACWV5N9M2RV99hQ!eIaEe9V8mCgGU6vyvaWTKGi3Zlnz0rgk5Y-0nsBXRbsuVZsM8lYfHr8I8GRuoLYPYrOB$ 
>>
>> for the RLIMIT_NPROC check, but it seems it might be easier to just
>> inherit everything from the beginning, becuase I'd need to do something
>> like that patch several times.
> 
> I read through the code and I don't see why you want to make these
> almost threads of a process not actually threads of that process
> (like the io_uring threads are).
> 
> As a separate process there are many things that will continue to be
> disjoint.  If the thread changes cgroups for example your new process
> won't follow.
> 
> If you want them to be actually processes with an lifetime independent
> of the creating process I expect you want to reparent them to the local
> init process.  Just so they don't confuse the process tree.  Plus init
> processes know how to handle unexpected children.
> 
> What are the semantics you are aiming for?
> 

Hi Eric,

Right now, for vhost we need the thread being created:

1. added to the caller's cgroup.
2. to share the mm struct with the caller.
3. to be accounted for under the caller's nproc rlimit value.

For 1 and 2, we have cgroup_attach_task_all and get_task_mm
already.

This patchset started with me just trying to handle #3 by modifying kthreads
like here:

https://lkml.org/lkml/2021/6/23/1234

So we can use kthreads and the existing helpers and add:

A. a ucounts version of the above patches in the link

or

B. a helper that does something like copy_process's use of
is_ucounts_overlimit and vhost can call that.

instead of this patchset.


Before we even get to the next section below, do you consider items 1 - 3
something we need an API based on copy_process for?

Do you think I should just do A or B above, or do you have another idea? If
so can we get agreement on that from everyone?

I thought my patches in that link were a little hacky in how they passed
around the user/creds info. I thought maybe it shouldn't be passed around like
that, so switched to the copy_process based approach which did everything for
me. And I thought io_uring needed something similar as us so I made it generic.

I don't have a preference. You and Christian are the experts, so I'll leave it
to you guys.


> I can see sense in generalizing some of the pieces of create_io_thread
> but I think generalizing create_io_thread itself is premature.  The code
> lives in kernel/fork.c because it is a very special thing that we want
> to keep our eyes on.
> 
> Some of your generalization makes it much more difficult to tell what
> your code is going to use because you remove hard codes that are there
> to simplify the analysis of the situation.
> 
> My gut says adding a new create_vhost_worker and putting that in
> kernel/fork.c is a lot safer and will allow much better code analysis.
> 
> If there a really are commonalities between creating a userspace process
> that runs completely in the kernel and creating an additional userspace
> thread we refactor the code and simplify things.
> 
> I am especially nervous about generalizing the io_uring code as it's
> signal handling just barely works, and any generalization will cause it
> to break.  So you are in the process of generalizing code that simply
> can not handle the general case.  That scares me
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH V6 01/10] Use copy_process in vhost layer
  2021-12-17 22:08     ` michael.christie
@ 2021-12-22  0:20       ` Eric W. Biederman
  -1 siblings, 0 replies; 57+ messages in thread
From: Eric W. Biederman @ 2021-12-22  0:20 UTC (permalink / raw)
  To: michael.christie
  Cc: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel

michael.christie@oracle.com writes:

> On 12/17/21 1:26 PM, Eric W. Biederman wrote:
>> Mike Christie <michael.christie@oracle.com> writes:
>> 
>>> The following patches made over Linus's tree, allow the vhost layer to do
>>> a copy_process on the thread that does the VHOST_SET_OWNER ioctl like how
>>> io_uring does a copy_process against its userspace app. This allows the
>>> vhost layer's worker threads to inherit cgroups, namespaces, address
>>> space, etc and this worker thread will also be accounted for against that
>>> owner/parent process's RLIMIT_NPROC limit.
>>>
>>> If you are not familiar with qemu and vhost here is more detailed
>>> problem description:
>>>
>>> Qemu will create vhost devices in the kernel which perform network, SCSI,
>>> etc IO and management operations from worker threads created by the
>>> kthread API. Because the kthread API does a copy_process on the kthreadd
>>> thread, the vhost layer has to use kthread_use_mm to access the Qemu
>>> thread's memory and cgroup_attach_task_all to add itself to the Qemu
>>> thread's cgroups.
>>>
>>> The problem with this approach is that we then have to add new functions/
>>> args/functionality for every thing we want to inherit. I started doing
>>> that here:
>>>
>>> https://urldefense.com/v3/__https://lkml.org/lkml/2021/6/23/1233__;!!ACWV5N9M2RV99hQ!eIaEe9V8mCgGU6vyvaWTKGi3Zlnz0rgk5Y-0nsBXRbsuVZsM8lYfHr8I8GRuoLYPYrOB$ 
>>>
>>> for the RLIMIT_NPROC check, but it seems it might be easier to just
>>> inherit everything from the beginning, becuase I'd need to do something
>>> like that patch several times.
>> 
>> I read through the code and I don't see why you want to make these
>> almost threads of a process not actually threads of that process
>> (like the io_uring threads are).
>> 
>> As a separate process there are many things that will continue to be
>> disjoint.  If the thread changes cgroups for example your new process
>> won't follow.
>> 
>> If you want them to be actually processes with an lifetime independent
>> of the creating process I expect you want to reparent them to the local
>> init process.  Just so they don't confuse the process tree.  Plus init
>> processes know how to handle unexpected children.
>> 
>> What are the semantics you are aiming for?
>> 
>
> Hi Eric,
>
> Right now, for vhost we need the thread being created:
>
> 1. added to the caller's cgroup.
> 2. to share the mm struct with the caller.
> 3. to be accounted for under the caller's nproc rlimit value.
>
> For 1 and 2, we have cgroup_attach_task_all and get_task_mm
> already.
>
> This patchset started with me just trying to handle #3 by modifying kthreads
> like here:
>
> https://lkml.org/lkml/2021/6/23/1234
>
> So we can use kthreads and the existing helpers and add:
>
> A. a ucounts version of the above patches in the link
>
> or
>
> B. a helper that does something like copy_process's use of
> is_ucounts_overlimit and vhost can call that.
>
> instead of this patchset.

I don't fundamentally hate the patchset.  I do have concerns about
the completely broken patch.

With respect this patchset my gut says decide.  Are you a thread of the
process (just use create_io_worker) are you a separate process forked
from the caller (use a cousin of create_io_worker but don't touch
create_io_worker).  I think being a process vs being a thread is such a
fundamental difference we don't want to mix the helpers.

> Before we even get to the next section below, do you consider items 1 - 3
> something we need an API based on copy_process for?

I think 3 staying in the callers nproc strongly suggests you want to
reuse copy_process.  Which gets back to my question do you want
a thread or do you want a process.


For me a key detail is what is the lifetime of the vhost device?

Does the vhost go away when the caller goes away?

  If so you can create a thread in the caller's process that only performs
  work in kernel space.  At which point you are essentially
  create_io_thread.

If the vhost device can live after the caller goes away how is that managed?
  Especially when you are tied to the caller's mm.

  If your device lives on after the caller then you should be a separate
  process forked from the caller.


> Do you think I should just do A or B above, or do you have another idea? If
> so can we get agreement on that from everyone?

Like I said.  I don't hate this patchset.  I think you just tried to be
a little too general.


With kthreads you are fighting very hard to get something that is not
part of the process tree to act like it is part of the process tree.
Which strongly suggests you should be part of the process tree.

As long as you don't have important state you will need to fight to get
separate from the process.

If you do have important state that needs to not come from the process
figuring out how to make a kthread work may be safer.

I don't currently know vhost devices well enough to answer that question.

> I thought my patches in that link were a little hacky in how they passed
> around the user/creds info. I thought maybe it shouldn't be passed around like
> that, so switched to the copy_process based approach which did everything for
> me. And I thought io_uring needed something similar as us so I made it generic.
>
> I don't have a preference. You and Christian are the experts, so I'll leave it
> to you guys.

I hope this gives you some useful direction.

Eric


>> I can see sense in generalizing some of the pieces of create_io_thread
>> but I think generalizing create_io_thread itself is premature.  The code
>> lives in kernel/fork.c because it is a very special thing that we want
>> to keep our eyes on.
>> 
>> Some of your generalization makes it much more difficult to tell what
>> your code is going to use because you remove hard codes that are there
>> to simplify the analysis of the situation.
>> 
>> My gut says adding a new create_vhost_worker and putting that in
>> kernel/fork.c is a lot safer and will allow much better code analysis.
>> 
>> If there a really are commonalities between creating a userspace process
>> that runs completely in the kernel and creating an additional userspace
>> thread we refactor the code and simplify things.
>> 
>> I am especially nervous about generalizing the io_uring code as it's
>> signal handling just barely works, and any generalization will cause it
>> to break.  So you are in the process of generalizing code that simply
>> can not handle the general case.  That scares me

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH V6 01/10] Use copy_process in vhost layer
@ 2021-12-22  0:20       ` Eric W. Biederman
  0 siblings, 0 replies; 57+ messages in thread
From: Eric W. Biederman @ 2021-12-22  0:20 UTC (permalink / raw)
  To: michael.christie
  Cc: axboe, hdanton, mst, linux-kernel, virtualization, hch, vverma,
	geert, stefanha, christian.brauner

michael.christie@oracle.com writes:

> On 12/17/21 1:26 PM, Eric W. Biederman wrote:
>> Mike Christie <michael.christie@oracle.com> writes:
>> 
>>> The following patches made over Linus's tree, allow the vhost layer to do
>>> a copy_process on the thread that does the VHOST_SET_OWNER ioctl like how
>>> io_uring does a copy_process against its userspace app. This allows the
>>> vhost layer's worker threads to inherit cgroups, namespaces, address
>>> space, etc and this worker thread will also be accounted for against that
>>> owner/parent process's RLIMIT_NPROC limit.
>>>
>>> If you are not familiar with qemu and vhost here is more detailed
>>> problem description:
>>>
>>> Qemu will create vhost devices in the kernel which perform network, SCSI,
>>> etc IO and management operations from worker threads created by the
>>> kthread API. Because the kthread API does a copy_process on the kthreadd
>>> thread, the vhost layer has to use kthread_use_mm to access the Qemu
>>> thread's memory and cgroup_attach_task_all to add itself to the Qemu
>>> thread's cgroups.
>>>
>>> The problem with this approach is that we then have to add new functions/
>>> args/functionality for every thing we want to inherit. I started doing
>>> that here:
>>>
>>> https://urldefense.com/v3/__https://lkml.org/lkml/2021/6/23/1233__;!!ACWV5N9M2RV99hQ!eIaEe9V8mCgGU6vyvaWTKGi3Zlnz0rgk5Y-0nsBXRbsuVZsM8lYfHr8I8GRuoLYPYrOB$ 
>>>
>>> for the RLIMIT_NPROC check, but it seems it might be easier to just
>>> inherit everything from the beginning, becuase I'd need to do something
>>> like that patch several times.
>> 
>> I read through the code and I don't see why you want to make these
>> almost threads of a process not actually threads of that process
>> (like the io_uring threads are).
>> 
>> As a separate process there are many things that will continue to be
>> disjoint.  If the thread changes cgroups for example your new process
>> won't follow.
>> 
>> If you want them to be actually processes with an lifetime independent
>> of the creating process I expect you want to reparent them to the local
>> init process.  Just so they don't confuse the process tree.  Plus init
>> processes know how to handle unexpected children.
>> 
>> What are the semantics you are aiming for?
>> 
>
> Hi Eric,
>
> Right now, for vhost we need the thread being created:
>
> 1. added to the caller's cgroup.
> 2. to share the mm struct with the caller.
> 3. to be accounted for under the caller's nproc rlimit value.
>
> For 1 and 2, we have cgroup_attach_task_all and get_task_mm
> already.
>
> This patchset started with me just trying to handle #3 by modifying kthreads
> like here:
>
> https://lkml.org/lkml/2021/6/23/1234
>
> So we can use kthreads and the existing helpers and add:
>
> A. a ucounts version of the above patches in the link
>
> or
>
> B. a helper that does something like copy_process's use of
> is_ucounts_overlimit and vhost can call that.
>
> instead of this patchset.

I don't fundamentally hate the patchset.  I do have concerns about
the completely broken patch.

With respect this patchset my gut says decide.  Are you a thread of the
process (just use create_io_worker) are you a separate process forked
from the caller (use a cousin of create_io_worker but don't touch
create_io_worker).  I think being a process vs being a thread is such a
fundamental difference we don't want to mix the helpers.

> Before we even get to the next section below, do you consider items 1 - 3
> something we need an API based on copy_process for?

I think 3 staying in the callers nproc strongly suggests you want to
reuse copy_process.  Which gets back to my question do you want
a thread or do you want a process.


For me a key detail is what is the lifetime of the vhost device?

Does the vhost go away when the caller goes away?

  If so you can create a thread in the caller's process that only performs
  work in kernel space.  At which point you are essentially
  create_io_thread.

If the vhost device can live after the caller goes away how is that managed?
  Especially when you are tied to the caller's mm.

  If your device lives on after the caller then you should be a separate
  process forked from the caller.


> Do you think I should just do A or B above, or do you have another idea? If
> so can we get agreement on that from everyone?

Like I said.  I don't hate this patchset.  I think you just tried to be
a little too general.


With kthreads you are fighting very hard to get something that is not
part of the process tree to act like it is part of the process tree.
Which strongly suggests you should be part of the process tree.

As long as you don't have important state you will need to fight to get
separate from the process.

If you do have important state that needs to not come from the process
figuring out how to make a kthread work may be safer.

I don't currently know vhost devices well enough to answer that question.

> I thought my patches in that link were a little hacky in how they passed
> around the user/creds info. I thought maybe it shouldn't be passed around like
> that, so switched to the copy_process based approach which did everything for
> me. And I thought io_uring needed something similar as us so I made it generic.
>
> I don't have a preference. You and Christian are the experts, so I'll leave it
> to you guys.

I hope this gives you some useful direction.

Eric


>> I can see sense in generalizing some of the pieces of create_io_thread
>> but I think generalizing create_io_thread itself is premature.  The code
>> lives in kernel/fork.c because it is a very special thing that we want
>> to keep our eyes on.
>> 
>> Some of your generalization makes it much more difficult to tell what
>> your code is going to use because you remove hard codes that are there
>> to simplify the analysis of the situation.
>> 
>> My gut says adding a new create_vhost_worker and putting that in
>> kernel/fork.c is a lot safer and will allow much better code analysis.
>> 
>> If there a really are commonalities between creating a userspace process
>> that runs completely in the kernel and creating an additional userspace
>> thread we refactor the code and simplify things.
>> 
>> I am especially nervous about generalizing the io_uring code as it's
>> signal handling just barely works, and any generalization will cause it
>> to break.  So you are in the process of generalizing code that simply
>> can not handle the general case.  That scares me
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH V6 01/10] Use copy_process in vhost layer
  2021-12-22  0:20       ` Eric W. Biederman
@ 2021-12-22 17:32         ` Mike Christie
  -1 siblings, 0 replies; 57+ messages in thread
From: Mike Christie @ 2021-12-22 17:32 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel

On 12/21/21 6:20 PM, Eric W. Biederman wrote:
> michael.christie@oracle.com writes:
> 
>> On 12/17/21 1:26 PM, Eric W. Biederman wrote:
>>> Mike Christie <michael.christie@oracle.com> writes:
>>>
>>>> The following patches made over Linus's tree, allow the vhost layer to do
>>>> a copy_process on the thread that does the VHOST_SET_OWNER ioctl like how
>>>> io_uring does a copy_process against its userspace app. This allows the
>>>> vhost layer's worker threads to inherit cgroups, namespaces, address
>>>> space, etc and this worker thread will also be accounted for against that
>>>> owner/parent process's RLIMIT_NPROC limit.
>>>>
>>>> If you are not familiar with qemu and vhost here is more detailed
>>>> problem description:
>>>>
>>>> Qemu will create vhost devices in the kernel which perform network, SCSI,
>>>> etc IO and management operations from worker threads created by the
>>>> kthread API. Because the kthread API does a copy_process on the kthreadd
>>>> thread, the vhost layer has to use kthread_use_mm to access the Qemu
>>>> thread's memory and cgroup_attach_task_all to add itself to the Qemu
>>>> thread's cgroups.
>>>>
>>>> The problem with this approach is that we then have to add new functions/
>>>> args/functionality for every thing we want to inherit. I started doing
>>>> that here:
>>>>
>>>> https://urldefense.com/v3/__https://lkml.org/lkml/2021/6/23/1233__;!!ACWV5N9M2RV99hQ!eIaEe9V8mCgGU6vyvaWTKGi3Zlnz0rgk5Y-0nsBXRbsuVZsM8lYfHr8I8GRuoLYPYrOB$ 
>>>>
>>>> for the RLIMIT_NPROC check, but it seems it might be easier to just
>>>> inherit everything from the beginning, becuase I'd need to do something
>>>> like that patch several times.
>>>
>>> I read through the code and I don't see why you want to make these
>>> almost threads of a process not actually threads of that process
>>> (like the io_uring threads are).
>>>
>>> As a separate process there are many things that will continue to be
>>> disjoint.  If the thread changes cgroups for example your new process
>>> won't follow.
>>>
>>> If you want them to be actually processes with an lifetime independent
>>> of the creating process I expect you want to reparent them to the local
>>> init process.  Just so they don't confuse the process tree.  Plus init
>>> processes know how to handle unexpected children.
>>>
>>> What are the semantics you are aiming for?
>>>
>>
>> Hi Eric,
>>
>> Right now, for vhost we need the thread being created:
>>
>> 1. added to the caller's cgroup.
>> 2. to share the mm struct with the caller.
>> 3. to be accounted for under the caller's nproc rlimit value.
>>
>> For 1 and 2, we have cgroup_attach_task_all and get_task_mm
>> already.
>>
>> This patchset started with me just trying to handle #3 by modifying kthreads
>> like here:
>>
>> https://urldefense.com/v3/__https://lkml.org/lkml/2021/6/23/1234__;!!ACWV5N9M2RV99hQ!bvqZOWy7TxQyq18L4I_a5MxP2OX0V2imOYEJrWsc-LkyVTI_zpFzxyV2pM_dgYywwH2y$ 
>>
>> So we can use kthreads and the existing helpers and add:
>>
>> A. a ucounts version of the above patches in the link
>>
>> or
>>
>> B. a helper that does something like copy_process's use of
>> is_ucounts_overlimit and vhost can call that.
>>
>> instead of this patchset.
> 
> I don't fundamentally hate the patchset.  I do have concerns about
> the completely broken patch.
> 
> With respect this patchset my gut says decide.  Are you a thread of the
> process (just use create_io_worker) are you a separate process forked
> from the caller (use a cousin of create_io_worker but don't touch
> create_io_worker).  I think being a process vs being a thread is such a
> fundamental difference we don't want to mix the helpers.
> 
>> Before we even get to the next section below, do you consider items 1 - 3
>> something we need an API based on copy_process for?
> 
> I think 3 staying in the callers nproc strongly suggests you want to
> reuse copy_process.  Which gets back to my question do you want
> a thread or do you want a process.
> 
> 
> For me a key detail is what is the lifetime of the vhost device?
> 
> Does the vhost go away when the caller goes away?

Yes. When the caller, normally qemu in our case, that created the worker
thread exits, then we free the vhost devices and stop and free the worker
threads we are creating in this patchset.

However, I'm not sure if it makes a difference to you, but we also have second
way to free a vhost device and its worker thread. The user can run a command
that instructs the the qemu process to free the vhost device and its worker
thread.

> 
>   If so you can create a thread in the caller's process that only performs
>   work in kernel space.  At which point you are essentially
>   create_io_thread.
> 
> If the vhost device can live after the caller goes away how is that managed?

When the caller goes away we free the devices and their worker threads.

Either before the caller exists it does an explicit close to release the device
which frees the device and its worker thread, or when the process exits and the
kernel does a put on its open devices that will trigger the vhost device's release
function and we free device and its thread at that time.


>   Especially when you are tied to the caller's mm.
> 
>   If your device lives on after the caller then you should be a separate
>   process forked from the caller.
> 
> 
>> Do you think I should just do A or B above, or do you have another idea? If
>> so can we get agreement on that from everyone?
> 
> Like I said.  I don't hate this patchset.  I think you just tried to be
> a little too general.
> > 
> With kthreads you are fighting very hard to get something that is not
> part of the process tree to act like it is part of the process tree.
> Which strongly suggests you should be part of the process tree.
> 
> As long as you don't have important state you will need to fight to get
> separate from the process.
> 
> If you do have important state that needs to not come from the process
> figuring out how to make a kthread work may be safer.
> 
> I don't currently know vhost devices well enough to answer that question.
> 
>> I thought my patches in that link were a little hacky in how they passed
>> around the user/creds info. I thought maybe it shouldn't be passed around like
>> that, so switched to the copy_process based approach which did everything for
>> me. And I thought io_uring needed something similar as us so I made it generic.
>>
>> I don't have a preference. You and Christian are the experts, so I'll leave it
>> to you guys.
> 
> I hope this gives you some useful direction.
> 
> Eric
> 
> 
>>> I can see sense in generalizing some of the pieces of create_io_thread
>>> but I think generalizing create_io_thread itself is premature.  The code
>>> lives in kernel/fork.c because it is a very special thing that we want
>>> to keep our eyes on.
>>>
>>> Some of your generalization makes it much more difficult to tell what
>>> your code is going to use because you remove hard codes that are there
>>> to simplify the analysis of the situation.
>>>
>>> My gut says adding a new create_vhost_worker and putting that in
>>> kernel/fork.c is a lot safer and will allow much better code analysis.
>>>
>>> If there a really are commonalities between creating a userspace process
>>> that runs completely in the kernel and creating an additional userspace
>>> thread we refactor the code and simplify things.
>>>
>>> I am especially nervous about generalizing the io_uring code as it's
>>> signal handling just barely works, and any generalization will cause it
>>> to break.  So you are in the process of generalizing code that simply
>>> can not handle the general case.  That scares me


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH V6 01/10] Use copy_process in vhost layer
@ 2021-12-22 17:32         ` Mike Christie
  0 siblings, 0 replies; 57+ messages in thread
From: Mike Christie @ 2021-12-22 17:32 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: axboe, hdanton, mst, linux-kernel, virtualization, hch, vverma,
	geert, stefanha, christian.brauner

On 12/21/21 6:20 PM, Eric W. Biederman wrote:
> michael.christie@oracle.com writes:
> 
>> On 12/17/21 1:26 PM, Eric W. Biederman wrote:
>>> Mike Christie <michael.christie@oracle.com> writes:
>>>
>>>> The following patches made over Linus's tree, allow the vhost layer to do
>>>> a copy_process on the thread that does the VHOST_SET_OWNER ioctl like how
>>>> io_uring does a copy_process against its userspace app. This allows the
>>>> vhost layer's worker threads to inherit cgroups, namespaces, address
>>>> space, etc and this worker thread will also be accounted for against that
>>>> owner/parent process's RLIMIT_NPROC limit.
>>>>
>>>> If you are not familiar with qemu and vhost here is more detailed
>>>> problem description:
>>>>
>>>> Qemu will create vhost devices in the kernel which perform network, SCSI,
>>>> etc IO and management operations from worker threads created by the
>>>> kthread API. Because the kthread API does a copy_process on the kthreadd
>>>> thread, the vhost layer has to use kthread_use_mm to access the Qemu
>>>> thread's memory and cgroup_attach_task_all to add itself to the Qemu
>>>> thread's cgroups.
>>>>
>>>> The problem with this approach is that we then have to add new functions/
>>>> args/functionality for every thing we want to inherit. I started doing
>>>> that here:
>>>>
>>>> https://urldefense.com/v3/__https://lkml.org/lkml/2021/6/23/1233__;!!ACWV5N9M2RV99hQ!eIaEe9V8mCgGU6vyvaWTKGi3Zlnz0rgk5Y-0nsBXRbsuVZsM8lYfHr8I8GRuoLYPYrOB$ 
>>>>
>>>> for the RLIMIT_NPROC check, but it seems it might be easier to just
>>>> inherit everything from the beginning, becuase I'd need to do something
>>>> like that patch several times.
>>>
>>> I read through the code and I don't see why you want to make these
>>> almost threads of a process not actually threads of that process
>>> (like the io_uring threads are).
>>>
>>> As a separate process there are many things that will continue to be
>>> disjoint.  If the thread changes cgroups for example your new process
>>> won't follow.
>>>
>>> If you want them to be actually processes with an lifetime independent
>>> of the creating process I expect you want to reparent them to the local
>>> init process.  Just so they don't confuse the process tree.  Plus init
>>> processes know how to handle unexpected children.
>>>
>>> What are the semantics you are aiming for?
>>>
>>
>> Hi Eric,
>>
>> Right now, for vhost we need the thread being created:
>>
>> 1. added to the caller's cgroup.
>> 2. to share the mm struct with the caller.
>> 3. to be accounted for under the caller's nproc rlimit value.
>>
>> For 1 and 2, we have cgroup_attach_task_all and get_task_mm
>> already.
>>
>> This patchset started with me just trying to handle #3 by modifying kthreads
>> like here:
>>
>> https://urldefense.com/v3/__https://lkml.org/lkml/2021/6/23/1234__;!!ACWV5N9M2RV99hQ!bvqZOWy7TxQyq18L4I_a5MxP2OX0V2imOYEJrWsc-LkyVTI_zpFzxyV2pM_dgYywwH2y$ 
>>
>> So we can use kthreads and the existing helpers and add:
>>
>> A. a ucounts version of the above patches in the link
>>
>> or
>>
>> B. a helper that does something like copy_process's use of
>> is_ucounts_overlimit and vhost can call that.
>>
>> instead of this patchset.
> 
> I don't fundamentally hate the patchset.  I do have concerns about
> the completely broken patch.
> 
> With respect this patchset my gut says decide.  Are you a thread of the
> process (just use create_io_worker) are you a separate process forked
> from the caller (use a cousin of create_io_worker but don't touch
> create_io_worker).  I think being a process vs being a thread is such a
> fundamental difference we don't want to mix the helpers.
> 
>> Before we even get to the next section below, do you consider items 1 - 3
>> something we need an API based on copy_process for?
> 
> I think 3 staying in the callers nproc strongly suggests you want to
> reuse copy_process.  Which gets back to my question do you want
> a thread or do you want a process.
> 
> 
> For me a key detail is what is the lifetime of the vhost device?
> 
> Does the vhost go away when the caller goes away?

Yes. When the caller, normally qemu in our case, that created the worker
thread exits, then we free the vhost devices and stop and free the worker
threads we are creating in this patchset.

However, I'm not sure if it makes a difference to you, but we also have second
way to free a vhost device and its worker thread. The user can run a command
that instructs the the qemu process to free the vhost device and its worker
thread.

> 
>   If so you can create a thread in the caller's process that only performs
>   work in kernel space.  At which point you are essentially
>   create_io_thread.
> 
> If the vhost device can live after the caller goes away how is that managed?

When the caller goes away we free the devices and their worker threads.

Either before the caller exists it does an explicit close to release the device
which frees the device and its worker thread, or when the process exits and the
kernel does a put on its open devices that will trigger the vhost device's release
function and we free device and its thread at that time.


>   Especially when you are tied to the caller's mm.
> 
>   If your device lives on after the caller then you should be a separate
>   process forked from the caller.
> 
> 
>> Do you think I should just do A or B above, or do you have another idea? If
>> so can we get agreement on that from everyone?
> 
> Like I said.  I don't hate this patchset.  I think you just tried to be
> a little too general.
> > 
> With kthreads you are fighting very hard to get something that is not
> part of the process tree to act like it is part of the process tree.
> Which strongly suggests you should be part of the process tree.
> 
> As long as you don't have important state you will need to fight to get
> separate from the process.
> 
> If you do have important state that needs to not come from the process
> figuring out how to make a kthread work may be safer.
> 
> I don't currently know vhost devices well enough to answer that question.
> 
>> I thought my patches in that link were a little hacky in how they passed
>> around the user/creds info. I thought maybe it shouldn't be passed around like
>> that, so switched to the copy_process based approach which did everything for
>> me. And I thought io_uring needed something similar as us so I made it generic.
>>
>> I don't have a preference. You and Christian are the experts, so I'll leave it
>> to you guys.
> 
> I hope this gives you some useful direction.
> 
> Eric
> 
> 
>>> I can see sense in generalizing some of the pieces of create_io_thread
>>> but I think generalizing create_io_thread itself is premature.  The code
>>> lives in kernel/fork.c because it is a very special thing that we want
>>> to keep our eyes on.
>>>
>>> Some of your generalization makes it much more difficult to tell what
>>> your code is going to use because you remove hard codes that are there
>>> to simplify the analysis of the situation.
>>>
>>> My gut says adding a new create_vhost_worker and putting that in
>>> kernel/fork.c is a lot safer and will allow much better code analysis.
>>>
>>> If there a really are commonalities between creating a userspace process
>>> that runs completely in the kernel and creating an additional userspace
>>> thread we refactor the code and simplify things.
>>>
>>> I am especially nervous about generalizing the io_uring code as it's
>>> signal handling just barely works, and any generalization will cause it
>>> to break.  So you are in the process of generalizing code that simply
>>> can not handle the general case.  That scares me

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH V6 01/10] Use copy_process in vhost layer
  2021-12-22 17:32         ` Mike Christie
@ 2021-12-22 18:24           ` Eric W. Biederman
  -1 siblings, 0 replies; 57+ messages in thread
From: Eric W. Biederman @ 2021-12-22 18:24 UTC (permalink / raw)
  To: Mike Christie
  Cc: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel

Mike Christie <michael.christie@oracle.com> writes:

> On 12/21/21 6:20 PM, Eric W. Biederman wrote:
>> michael.christie@oracle.com writes:
>> 
>>> On 12/17/21 1:26 PM, Eric W. Biederman wrote:
>>>> Mike Christie <michael.christie@oracle.com> writes:
>>>>
>>>>> The following patches made over Linus's tree, allow the vhost layer to do
>>>>> a copy_process on the thread that does the VHOST_SET_OWNER ioctl like how
>>>>> io_uring does a copy_process against its userspace app. This allows the
>>>>> vhost layer's worker threads to inherit cgroups, namespaces, address
>>>>> space, etc and this worker thread will also be accounted for against that
>>>>> owner/parent process's RLIMIT_NPROC limit.
>>>>>
>>>>> If you are not familiar with qemu and vhost here is more detailed
>>>>> problem description:
>>>>>
>>>>> Qemu will create vhost devices in the kernel which perform network, SCSI,
>>>>> etc IO and management operations from worker threads created by the
>>>>> kthread API. Because the kthread API does a copy_process on the kthreadd
>>>>> thread, the vhost layer has to use kthread_use_mm to access the Qemu
>>>>> thread's memory and cgroup_attach_task_all to add itself to the Qemu
>>>>> thread's cgroups.
>>>>>
>>>>> The problem with this approach is that we then have to add new functions/
>>>>> args/functionality for every thing we want to inherit. I started doing
>>>>> that here:
>>>>>
>>>>> https://urldefense.com/v3/__https://lkml.org/lkml/2021/6/23/1233__;!!ACWV5N9M2RV99hQ!eIaEe9V8mCgGU6vyvaWTKGi3Zlnz0rgk5Y-0nsBXRbsuVZsM8lYfHr8I8GRuoLYPYrOB$ 
>>>>>
>>>>> for the RLIMIT_NPROC check, but it seems it might be easier to just
>>>>> inherit everything from the beginning, becuase I'd need to do something
>>>>> like that patch several times.
>>>>
>>>> I read through the code and I don't see why you want to make these
>>>> almost threads of a process not actually threads of that process
>>>> (like the io_uring threads are).
>>>>
>>>> As a separate process there are many things that will continue to be
>>>> disjoint.  If the thread changes cgroups for example your new process
>>>> won't follow.
>>>>
>>>> If you want them to be actually processes with an lifetime independent
>>>> of the creating process I expect you want to reparent them to the local
>>>> init process.  Just so they don't confuse the process tree.  Plus init
>>>> processes know how to handle unexpected children.
>>>>
>>>> What are the semantics you are aiming for?
>>>>
>>>
>>> Hi Eric,
>>>
>>> Right now, for vhost we need the thread being created:
>>>
>>> 1. added to the caller's cgroup.
>>> 2. to share the mm struct with the caller.
>>> 3. to be accounted for under the caller's nproc rlimit value.
>>>
>>> For 1 and 2, we have cgroup_attach_task_all and get_task_mm
>>> already.
>>>
>>> This patchset started with me just trying to handle #3 by modifying kthreads
>>> like here:
>>>
>>> https://urldefense.com/v3/__https://lkml.org/lkml/2021/6/23/1234__;!!ACWV5N9M2RV99hQ!bvqZOWy7TxQyq18L4I_a5MxP2OX0V2imOYEJrWsc-LkyVTI_zpFzxyV2pM_dgYywwH2y$ 
>>>
>>> So we can use kthreads and the existing helpers and add:
>>>
>>> A. a ucounts version of the above patches in the link
>>>
>>> or
>>>
>>> B. a helper that does something like copy_process's use of
>>> is_ucounts_overlimit and vhost can call that.
>>>
>>> instead of this patchset.
>> 
>> I don't fundamentally hate the patchset.  I do have concerns about
>> the completely broken patch.
>> 
>> With respect this patchset my gut says decide.  Are you a thread of the
>> process (just use create_io_worker) are you a separate process forked
>> from the caller (use a cousin of create_io_worker but don't touch
>> create_io_worker).  I think being a process vs being a thread is such a
>> fundamental difference we don't want to mix the helpers.
>> 
>>> Before we even get to the next section below, do you consider items 1 - 3
>>> something we need an API based on copy_process for?
>> 
>> I think 3 staying in the callers nproc strongly suggests you want to
>> reuse copy_process.  Which gets back to my question do you want
>> a thread or do you want a process.
>> 
>> 
>> For me a key detail is what is the lifetime of the vhost device?
>> 
>> Does the vhost go away when the caller goes away?
>
> Yes. When the caller, normally qemu in our case, that created the worker
> thread exits, then we free the vhost devices and stop and free the worker
> threads we are creating in this patchset.
>
> However, I'm not sure if it makes a difference to you, but we also have second
> way to free a vhost device and its worker thread. The user can run a command
> that instructs the the qemu process to free the vhost device and its worker
> thread.

I dug a little deeper to understand how this works, and it appears to be
a standard file descriptor based API.  The last close of the file
descriptor is what causes the vhost_dev_cleanup to be called which shuts
down the thread.

This means that in rare cases the file descriptor can be passed to
another process and be held open there, even after the main process
exits.

This says to me that much as it might be handy your thread does not
strictly share the same lifetime as your qemu process.


>>   If so you can create a thread in the caller's process that only performs
>>   work in kernel space.  At which point you are essentially
>>   create_io_thread.
>> 
>> If the vhost device can live after the caller goes away how is that managed?
>
> When the caller goes away we free the devices and their worker threads.
>
> Either before the caller exists it does an explicit close to release the device
> which frees the device and its worker thread, or when the process exits and the
> kernel does a put on its open devices that will trigger the vhost device's release
> function and we free device and its thread at that time.

All of which says to me that the vhost devices semantically work well as
separate processes (that never run userspace code) not as threads of the
creating userspace process.

So I would recommend creating a minimal version of the kthread api,
using create_process targeted only at the vhost case.  Essentially what
you have done with this patchset, but without any configuration knobs
from the callers perspective.

Which means that you can hard code calling ignore_signals, and the
like, instead of needing to have a separate configuration knob for each
place io_workers are different from vhost_workers.

In the future I can see io_workers evolving into a general user space
thread that only runs code in the kernel abstraction, and I can see
vhost_workers evolving into a general user space process that only runs
code in the kernel abstraction.

For now we don't need that generality so please just create a
create_vhost_process helper akin to create_io_thread that does just what
you need.

I don't know if it is better to base it on kernel_clone or on
copy_process.  All I am certain of is that you need to set
"args->exit_signal = -1;".  This prevents having to play games with
do_notify_parent.

Eric

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH V6 01/10] Use copy_process in vhost layer
@ 2021-12-22 18:24           ` Eric W. Biederman
  0 siblings, 0 replies; 57+ messages in thread
From: Eric W. Biederman @ 2021-12-22 18:24 UTC (permalink / raw)
  To: Mike Christie
  Cc: axboe, hdanton, mst, linux-kernel, virtualization, hch, vverma,
	geert, stefanha, christian.brauner

Mike Christie <michael.christie@oracle.com> writes:

> On 12/21/21 6:20 PM, Eric W. Biederman wrote:
>> michael.christie@oracle.com writes:
>> 
>>> On 12/17/21 1:26 PM, Eric W. Biederman wrote:
>>>> Mike Christie <michael.christie@oracle.com> writes:
>>>>
>>>>> The following patches made over Linus's tree, allow the vhost layer to do
>>>>> a copy_process on the thread that does the VHOST_SET_OWNER ioctl like how
>>>>> io_uring does a copy_process against its userspace app. This allows the
>>>>> vhost layer's worker threads to inherit cgroups, namespaces, address
>>>>> space, etc and this worker thread will also be accounted for against that
>>>>> owner/parent process's RLIMIT_NPROC limit.
>>>>>
>>>>> If you are not familiar with qemu and vhost here is more detailed
>>>>> problem description:
>>>>>
>>>>> Qemu will create vhost devices in the kernel which perform network, SCSI,
>>>>> etc IO and management operations from worker threads created by the
>>>>> kthread API. Because the kthread API does a copy_process on the kthreadd
>>>>> thread, the vhost layer has to use kthread_use_mm to access the Qemu
>>>>> thread's memory and cgroup_attach_task_all to add itself to the Qemu
>>>>> thread's cgroups.
>>>>>
>>>>> The problem with this approach is that we then have to add new functions/
>>>>> args/functionality for every thing we want to inherit. I started doing
>>>>> that here:
>>>>>
>>>>> https://urldefense.com/v3/__https://lkml.org/lkml/2021/6/23/1233__;!!ACWV5N9M2RV99hQ!eIaEe9V8mCgGU6vyvaWTKGi3Zlnz0rgk5Y-0nsBXRbsuVZsM8lYfHr8I8GRuoLYPYrOB$ 
>>>>>
>>>>> for the RLIMIT_NPROC check, but it seems it might be easier to just
>>>>> inherit everything from the beginning, becuase I'd need to do something
>>>>> like that patch several times.
>>>>
>>>> I read through the code and I don't see why you want to make these
>>>> almost threads of a process not actually threads of that process
>>>> (like the io_uring threads are).
>>>>
>>>> As a separate process there are many things that will continue to be
>>>> disjoint.  If the thread changes cgroups for example your new process
>>>> won't follow.
>>>>
>>>> If you want them to be actually processes with an lifetime independent
>>>> of the creating process I expect you want to reparent them to the local
>>>> init process.  Just so they don't confuse the process tree.  Plus init
>>>> processes know how to handle unexpected children.
>>>>
>>>> What are the semantics you are aiming for?
>>>>
>>>
>>> Hi Eric,
>>>
>>> Right now, for vhost we need the thread being created:
>>>
>>> 1. added to the caller's cgroup.
>>> 2. to share the mm struct with the caller.
>>> 3. to be accounted for under the caller's nproc rlimit value.
>>>
>>> For 1 and 2, we have cgroup_attach_task_all and get_task_mm
>>> already.
>>>
>>> This patchset started with me just trying to handle #3 by modifying kthreads
>>> like here:
>>>
>>> https://urldefense.com/v3/__https://lkml.org/lkml/2021/6/23/1234__;!!ACWV5N9M2RV99hQ!bvqZOWy7TxQyq18L4I_a5MxP2OX0V2imOYEJrWsc-LkyVTI_zpFzxyV2pM_dgYywwH2y$ 
>>>
>>> So we can use kthreads and the existing helpers and add:
>>>
>>> A. a ucounts version of the above patches in the link
>>>
>>> or
>>>
>>> B. a helper that does something like copy_process's use of
>>> is_ucounts_overlimit and vhost can call that.
>>>
>>> instead of this patchset.
>> 
>> I don't fundamentally hate the patchset.  I do have concerns about
>> the completely broken patch.
>> 
>> With respect this patchset my gut says decide.  Are you a thread of the
>> process (just use create_io_worker) are you a separate process forked
>> from the caller (use a cousin of create_io_worker but don't touch
>> create_io_worker).  I think being a process vs being a thread is such a
>> fundamental difference we don't want to mix the helpers.
>> 
>>> Before we even get to the next section below, do you consider items 1 - 3
>>> something we need an API based on copy_process for?
>> 
>> I think 3 staying in the callers nproc strongly suggests you want to
>> reuse copy_process.  Which gets back to my question do you want
>> a thread or do you want a process.
>> 
>> 
>> For me a key detail is what is the lifetime of the vhost device?
>> 
>> Does the vhost go away when the caller goes away?
>
> Yes. When the caller, normally qemu in our case, that created the worker
> thread exits, then we free the vhost devices and stop and free the worker
> threads we are creating in this patchset.
>
> However, I'm not sure if it makes a difference to you, but we also have second
> way to free a vhost device and its worker thread. The user can run a command
> that instructs the the qemu process to free the vhost device and its worker
> thread.

I dug a little deeper to understand how this works, and it appears to be
a standard file descriptor based API.  The last close of the file
descriptor is what causes the vhost_dev_cleanup to be called which shuts
down the thread.

This means that in rare cases the file descriptor can be passed to
another process and be held open there, even after the main process
exits.

This says to me that much as it might be handy your thread does not
strictly share the same lifetime as your qemu process.


>>   If so you can create a thread in the caller's process that only performs
>>   work in kernel space.  At which point you are essentially
>>   create_io_thread.
>> 
>> If the vhost device can live after the caller goes away how is that managed?
>
> When the caller goes away we free the devices and their worker threads.
>
> Either before the caller exists it does an explicit close to release the device
> which frees the device and its worker thread, or when the process exits and the
> kernel does a put on its open devices that will trigger the vhost device's release
> function and we free device and its thread at that time.

All of which says to me that the vhost devices semantically work well as
separate processes (that never run userspace code) not as threads of the
creating userspace process.

So I would recommend creating a minimal version of the kthread api,
using create_process targeted only at the vhost case.  Essentially what
you have done with this patchset, but without any configuration knobs
from the callers perspective.

Which means that you can hard code calling ignore_signals, and the
like, instead of needing to have a separate configuration knob for each
place io_workers are different from vhost_workers.

In the future I can see io_workers evolving into a general user space
thread that only runs code in the kernel abstraction, and I can see
vhost_workers evolving into a general user space process that only runs
code in the kernel abstraction.

For now we don't need that generality so please just create a
create_vhost_process helper akin to create_io_thread that does just what
you need.

I don't know if it is better to base it on kernel_clone or on
copy_process.  All I am certain of is that you need to set
"args->exit_signal = -1;".  This prevents having to play games with
do_notify_parent.

Eric
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH V6 01/10] Use copy_process in vhost layer
  2021-12-22 18:24           ` Eric W. Biederman
@ 2021-12-22 20:25             ` Michael S. Tsirkin
  -1 siblings, 0 replies; 57+ messages in thread
From: Michael S. Tsirkin @ 2021-12-22 20:25 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Mike Christie, geert, vverma, hdanton, hch, stefanha, jasowang,
	sgarzare, virtualization, christian.brauner, axboe, linux-kernel

On Wed, Dec 22, 2021 at 12:24:08PM -0600, Eric W. Biederman wrote:
> Mike Christie <michael.christie@oracle.com> writes:
> 
> > On 12/21/21 6:20 PM, Eric W. Biederman wrote:
> >> michael.christie@oracle.com writes:
> >> 
> >>> On 12/17/21 1:26 PM, Eric W. Biederman wrote:
> >>>> Mike Christie <michael.christie@oracle.com> writes:
> >>>>
> >>>>> The following patches made over Linus's tree, allow the vhost layer to do
> >>>>> a copy_process on the thread that does the VHOST_SET_OWNER ioctl like how
> >>>>> io_uring does a copy_process against its userspace app. This allows the
> >>>>> vhost layer's worker threads to inherit cgroups, namespaces, address
> >>>>> space, etc and this worker thread will also be accounted for against that
> >>>>> owner/parent process's RLIMIT_NPROC limit.
> >>>>>
> >>>>> If you are not familiar with qemu and vhost here is more detailed
> >>>>> problem description:
> >>>>>
> >>>>> Qemu will create vhost devices in the kernel which perform network, SCSI,
> >>>>> etc IO and management operations from worker threads created by the
> >>>>> kthread API. Because the kthread API does a copy_process on the kthreadd
> >>>>> thread, the vhost layer has to use kthread_use_mm to access the Qemu
> >>>>> thread's memory and cgroup_attach_task_all to add itself to the Qemu
> >>>>> thread's cgroups.
> >>>>>
> >>>>> The problem with this approach is that we then have to add new functions/
> >>>>> args/functionality for every thing we want to inherit. I started doing
> >>>>> that here:
> >>>>>
> >>>>> https://urldefense.com/v3/__https://lkml.org/lkml/2021/6/23/1233__;!!ACWV5N9M2RV99hQ!eIaEe9V8mCgGU6vyvaWTKGi3Zlnz0rgk5Y-0nsBXRbsuVZsM8lYfHr8I8GRuoLYPYrOB$ 
> >>>>>
> >>>>> for the RLIMIT_NPROC check, but it seems it might be easier to just
> >>>>> inherit everything from the beginning, becuase I'd need to do something
> >>>>> like that patch several times.
> >>>>
> >>>> I read through the code and I don't see why you want to make these
> >>>> almost threads of a process not actually threads of that process
> >>>> (like the io_uring threads are).
> >>>>
> >>>> As a separate process there are many things that will continue to be
> >>>> disjoint.  If the thread changes cgroups for example your new process
> >>>> won't follow.
> >>>>
> >>>> If you want them to be actually processes with an lifetime independent
> >>>> of the creating process I expect you want to reparent them to the local
> >>>> init process.  Just so they don't confuse the process tree.  Plus init
> >>>> processes know how to handle unexpected children.
> >>>>
> >>>> What are the semantics you are aiming for?
> >>>>
> >>>
> >>> Hi Eric,
> >>>
> >>> Right now, for vhost we need the thread being created:
> >>>
> >>> 1. added to the caller's cgroup.
> >>> 2. to share the mm struct with the caller.
> >>> 3. to be accounted for under the caller's nproc rlimit value.
> >>>
> >>> For 1 and 2, we have cgroup_attach_task_all and get_task_mm
> >>> already.
> >>>
> >>> This patchset started with me just trying to handle #3 by modifying kthreads
> >>> like here:
> >>>
> >>> https://urldefense.com/v3/__https://lkml.org/lkml/2021/6/23/1234__;!!ACWV5N9M2RV99hQ!bvqZOWy7TxQyq18L4I_a5MxP2OX0V2imOYEJrWsc-LkyVTI_zpFzxyV2pM_dgYywwH2y$ 
> >>>
> >>> So we can use kthreads and the existing helpers and add:
> >>>
> >>> A. a ucounts version of the above patches in the link
> >>>
> >>> or
> >>>
> >>> B. a helper that does something like copy_process's use of
> >>> is_ucounts_overlimit and vhost can call that.
> >>>
> >>> instead of this patchset.
> >> 
> >> I don't fundamentally hate the patchset.  I do have concerns about
> >> the completely broken patch.
> >> 
> >> With respect this patchset my gut says decide.  Are you a thread of the
> >> process (just use create_io_worker) are you a separate process forked
> >> from the caller (use a cousin of create_io_worker but don't touch
> >> create_io_worker).  I think being a process vs being a thread is such a
> >> fundamental difference we don't want to mix the helpers.
> >> 
> >>> Before we even get to the next section below, do you consider items 1 - 3
> >>> something we need an API based on copy_process for?
> >> 
> >> I think 3 staying in the callers nproc strongly suggests you want to
> >> reuse copy_process.  Which gets back to my question do you want
> >> a thread or do you want a process.
> >> 
> >> 
> >> For me a key detail is what is the lifetime of the vhost device?
> >> 
> >> Does the vhost go away when the caller goes away?
> >
> > Yes. When the caller, normally qemu in our case, that created the worker
> > thread exits, then we free the vhost devices and stop and free the worker
> > threads we are creating in this patchset.
> >
> > However, I'm not sure if it makes a difference to you, but we also have second
> > way to free a vhost device and its worker thread. The user can run a command
> > that instructs the the qemu process to free the vhost device and its worker
> > thread.
> 
> I dug a little deeper to understand how this works, and it appears to be
> a standard file descriptor based API.  The last close of the file
> descriptor is what causes the vhost_dev_cleanup to be called which shuts
> down the thread.
> 
> This means that in rare cases the file descriptor can be passed to
> another process and be held open there, even after the main process
> exits.

It's true enough, however it will keep a reference to the mm
of the original process and keep accessing that.
Which in turn makes this kind of trick useless enough that
I'm pretty sure no one does it in practice, if we want to
change this aspect I think we can without worrying about
breaking some usespace.



> This says to me that much as it might be handy your thread does not
> strictly share the same lifetime as your qemu process.
> 
> 
> >>   If so you can create a thread in the caller's process that only performs
> >>   work in kernel space.  At which point you are essentially
> >>   create_io_thread.
> >> 
> >> If the vhost device can live after the caller goes away how is that managed?
> >
> > When the caller goes away we free the devices and their worker threads.
> >
> > Either before the caller exists it does an explicit close to release the device
> > which frees the device and its worker thread, or when the process exits and the
> > kernel does a put on its open devices that will trigger the vhost device's release
> > function and we free device and its thread at that time.
> 
> All of which says to me that the vhost devices semantically work well as
> separate processes (that never run userspace code) not as threads of the
> creating userspace process.
> 
> So I would recommend creating a minimal version of the kthread api,
> using create_process targeted only at the vhost case.  Essentially what
> you have done with this patchset, but without any configuration knobs
> from the callers perspective.
> 
> Which means that you can hard code calling ignore_signals, and the
> like, instead of needing to have a separate configuration knob for each
> place io_workers are different from vhost_workers.
> 
> In the future I can see io_workers evolving into a general user space
> thread that only runs code in the kernel abstraction, and I can see
> vhost_workers evolving into a general user space process that only runs
> code in the kernel abstraction.
> 
> For now we don't need that generality so please just create a
> create_vhost_process helper akin to create_io_thread that does just what
> you need.
> 
> I don't know if it is better to base it on kernel_clone or on
> copy_process.  All I am certain of is that you need to set
> "args->exit_signal = -1;".  This prevents having to play games with
> do_notify_parent.
> 
> Eric


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH V6 01/10] Use copy_process in vhost layer
@ 2021-12-22 20:25             ` Michael S. Tsirkin
  0 siblings, 0 replies; 57+ messages in thread
From: Michael S. Tsirkin @ 2021-12-22 20:25 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: axboe, hdanton, linux-kernel, virtualization, hch, vverma, geert,
	stefanha, christian.brauner

On Wed, Dec 22, 2021 at 12:24:08PM -0600, Eric W. Biederman wrote:
> Mike Christie <michael.christie@oracle.com> writes:
> 
> > On 12/21/21 6:20 PM, Eric W. Biederman wrote:
> >> michael.christie@oracle.com writes:
> >> 
> >>> On 12/17/21 1:26 PM, Eric W. Biederman wrote:
> >>>> Mike Christie <michael.christie@oracle.com> writes:
> >>>>
> >>>>> The following patches made over Linus's tree, allow the vhost layer to do
> >>>>> a copy_process on the thread that does the VHOST_SET_OWNER ioctl like how
> >>>>> io_uring does a copy_process against its userspace app. This allows the
> >>>>> vhost layer's worker threads to inherit cgroups, namespaces, address
> >>>>> space, etc and this worker thread will also be accounted for against that
> >>>>> owner/parent process's RLIMIT_NPROC limit.
> >>>>>
> >>>>> If you are not familiar with qemu and vhost here is more detailed
> >>>>> problem description:
> >>>>>
> >>>>> Qemu will create vhost devices in the kernel which perform network, SCSI,
> >>>>> etc IO and management operations from worker threads created by the
> >>>>> kthread API. Because the kthread API does a copy_process on the kthreadd
> >>>>> thread, the vhost layer has to use kthread_use_mm to access the Qemu
> >>>>> thread's memory and cgroup_attach_task_all to add itself to the Qemu
> >>>>> thread's cgroups.
> >>>>>
> >>>>> The problem with this approach is that we then have to add new functions/
> >>>>> args/functionality for every thing we want to inherit. I started doing
> >>>>> that here:
> >>>>>
> >>>>> https://urldefense.com/v3/__https://lkml.org/lkml/2021/6/23/1233__;!!ACWV5N9M2RV99hQ!eIaEe9V8mCgGU6vyvaWTKGi3Zlnz0rgk5Y-0nsBXRbsuVZsM8lYfHr8I8GRuoLYPYrOB$ 
> >>>>>
> >>>>> for the RLIMIT_NPROC check, but it seems it might be easier to just
> >>>>> inherit everything from the beginning, becuase I'd need to do something
> >>>>> like that patch several times.
> >>>>
> >>>> I read through the code and I don't see why you want to make these
> >>>> almost threads of a process not actually threads of that process
> >>>> (like the io_uring threads are).
> >>>>
> >>>> As a separate process there are many things that will continue to be
> >>>> disjoint.  If the thread changes cgroups for example your new process
> >>>> won't follow.
> >>>>
> >>>> If you want them to be actually processes with an lifetime independent
> >>>> of the creating process I expect you want to reparent them to the local
> >>>> init process.  Just so they don't confuse the process tree.  Plus init
> >>>> processes know how to handle unexpected children.
> >>>>
> >>>> What are the semantics you are aiming for?
> >>>>
> >>>
> >>> Hi Eric,
> >>>
> >>> Right now, for vhost we need the thread being created:
> >>>
> >>> 1. added to the caller's cgroup.
> >>> 2. to share the mm struct with the caller.
> >>> 3. to be accounted for under the caller's nproc rlimit value.
> >>>
> >>> For 1 and 2, we have cgroup_attach_task_all and get_task_mm
> >>> already.
> >>>
> >>> This patchset started with me just trying to handle #3 by modifying kthreads
> >>> like here:
> >>>
> >>> https://urldefense.com/v3/__https://lkml.org/lkml/2021/6/23/1234__;!!ACWV5N9M2RV99hQ!bvqZOWy7TxQyq18L4I_a5MxP2OX0V2imOYEJrWsc-LkyVTI_zpFzxyV2pM_dgYywwH2y$ 
> >>>
> >>> So we can use kthreads and the existing helpers and add:
> >>>
> >>> A. a ucounts version of the above patches in the link
> >>>
> >>> or
> >>>
> >>> B. a helper that does something like copy_process's use of
> >>> is_ucounts_overlimit and vhost can call that.
> >>>
> >>> instead of this patchset.
> >> 
> >> I don't fundamentally hate the patchset.  I do have concerns about
> >> the completely broken patch.
> >> 
> >> With respect this patchset my gut says decide.  Are you a thread of the
> >> process (just use create_io_worker) are you a separate process forked
> >> from the caller (use a cousin of create_io_worker but don't touch
> >> create_io_worker).  I think being a process vs being a thread is such a
> >> fundamental difference we don't want to mix the helpers.
> >> 
> >>> Before we even get to the next section below, do you consider items 1 - 3
> >>> something we need an API based on copy_process for?
> >> 
> >> I think 3 staying in the callers nproc strongly suggests you want to
> >> reuse copy_process.  Which gets back to my question do you want
> >> a thread or do you want a process.
> >> 
> >> 
> >> For me a key detail is what is the lifetime of the vhost device?
> >> 
> >> Does the vhost go away when the caller goes away?
> >
> > Yes. When the caller, normally qemu in our case, that created the worker
> > thread exits, then we free the vhost devices and stop and free the worker
> > threads we are creating in this patchset.
> >
> > However, I'm not sure if it makes a difference to you, but we also have second
> > way to free a vhost device and its worker thread. The user can run a command
> > that instructs the the qemu process to free the vhost device and its worker
> > thread.
> 
> I dug a little deeper to understand how this works, and it appears to be
> a standard file descriptor based API.  The last close of the file
> descriptor is what causes the vhost_dev_cleanup to be called which shuts
> down the thread.
> 
> This means that in rare cases the file descriptor can be passed to
> another process and be held open there, even after the main process
> exits.

It's true enough, however it will keep a reference to the mm
of the original process and keep accessing that.
Which in turn makes this kind of trick useless enough that
I'm pretty sure no one does it in practice, if we want to
change this aspect I think we can without worrying about
breaking some usespace.



> This says to me that much as it might be handy your thread does not
> strictly share the same lifetime as your qemu process.
> 
> 
> >>   If so you can create a thread in the caller's process that only performs
> >>   work in kernel space.  At which point you are essentially
> >>   create_io_thread.
> >> 
> >> If the vhost device can live after the caller goes away how is that managed?
> >
> > When the caller goes away we free the devices and their worker threads.
> >
> > Either before the caller exists it does an explicit close to release the device
> > which frees the device and its worker thread, or when the process exits and the
> > kernel does a put on its open devices that will trigger the vhost device's release
> > function and we free device and its thread at that time.
> 
> All of which says to me that the vhost devices semantically work well as
> separate processes (that never run userspace code) not as threads of the
> creating userspace process.
> 
> So I would recommend creating a minimal version of the kthread api,
> using create_process targeted only at the vhost case.  Essentially what
> you have done with this patchset, but without any configuration knobs
> from the callers perspective.
> 
> Which means that you can hard code calling ignore_signals, and the
> like, instead of needing to have a separate configuration knob for each
> place io_workers are different from vhost_workers.
> 
> In the future I can see io_workers evolving into a general user space
> thread that only runs code in the kernel abstraction, and I can see
> vhost_workers evolving into a general user space process that only runs
> code in the kernel abstraction.
> 
> For now we don't need that generality so please just create a
> create_vhost_process helper akin to create_io_thread that does just what
> you need.
> 
> I don't know if it is better to base it on kernel_clone or on
> copy_process.  All I am certain of is that you need to set
> "args->exit_signal = -1;".  This prevents having to play games with
> do_notify_parent.
> 
> Eric

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH V6 01/10] Use copy_process in vhost layer
  2021-12-22 18:24           ` Eric W. Biederman
@ 2022-01-17 16:41             ` Mike Christie
  -1 siblings, 0 replies; 57+ messages in thread
From: Mike Christie @ 2022-01-17 16:41 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel

On 12/22/21 12:24 PM, Eric W. Biederman wrote:
> All I am certain of is that you need to set
> "args->exit_signal = -1;".  This prevents having to play games with
> do_notify_parent.

Hi Eric,

I have all your review comments handled except this one. It's looking like it's
more difficult than just setting the exit_signal=-1, so I wanted to check that
I understood you.

Here is what I'm currently looking at:

1. I can't just set args->exit_signal to -1, because we end up with a task_struct
that's partially setup like a CLONE_THREAD task. What happens is copy_process
will set the task's exit_signal to -1 and then thread_group_leader() will return
false. When code like the thread_group_leader check in copy_process runs, we will
then go down the CLONE_THREAD paths which are not setup and hit crashes.

We would need changes like the following which does not crash anymore but is not
correct for many reasons. I am just posting this code as an example of the issue
I am hitting.

@@ -1637,11 +1637,13 @@ static void posix_cpu_timers_init_group(struct signal_struct *sig)
 	posix_cputimers_group_init(pct, cpu_limit);
 }
 
-static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
+static int copy_signal(unsigned long clone_flags, struct task_struct *tsk,
+		       struct kernel_clone_args *args)
 {
 	struct signal_struct *sig;
 
-	if (clone_flags & CLONE_THREAD)
+	if (clone_flags & CLONE_THREAD || args->exit_signal == -1)
 		return 0;
 
 	sig = kmem_cache_zalloc(signal_cachep, GFP_KERNEL);
@@ -2194,7 +2244,7 @@ static __latent_entropy struct task_struct *copy_process(
 	retval = copy_sighand(clone_flags, p);
 	if (retval)
 		goto bad_fork_cleanup_fs;
-	retval = copy_signal(clone_flags, p);
+	retval = copy_signal(clone_flags, p, args);
 	if (retval)
 		goto bad_fork_cleanup_sighand;
 	retval = copy_mm(clone_flags, p);
@@ -2277,6 +2327,9 @@ static __latent_entropy struct task_struct *copy_process(
 	if (clone_flags & CLONE_THREAD) {
 		p->group_leader = current->group_leader;
 		p->tgid = current->tgid;
+	} else if (args->exit_signal == -1) {
+		p->group_leader = current->group_leader;
+		p->tgid = p->pid;
 	} else {
 		p->group_leader = p;
 		p->tgid = p->pid;


2. Instead of #1, I could add some code where we just set
task_struct->exit_signal to -1. We could do this twords the end of copy_process
or after it has returned, but before we do do_exit. However, hat will have similar
issues as #1 during the exit handling.

For example, __exit_signal will call thread_group_leader which would return false.
__unhash_process would then not detach the pid and we would later hit crashes due
to the task_struct being freed already. I could add code like above to the exit related
code paths, but it gets messy like above.

3. I thought I could separate the leader detection from the exit signal by adding
a flag/field to kernel_clone_args and task_struct. But then I get to the point
where I just need a check for USER/VHOST_WORKER tasks in exit_notify which is
similar to the patch you didn't like where I added the check in do_notify_parent.
So I thought you might not like this approach.

Note:
We can't set our task's exit_signal to SIGCHLD and get autoreaped like suggested in
another mail. The original idea for the do_notify_parent was we wanted the behavior
that kthreads have where they get autoreaped on exit. kthreads get autoreaped there
because the threadd task that is the parent ignores all signals and so we hit the
parent SIG_IGN check:

        psig = tsk->parent->sighand;
        spin_lock_irqsave(&psig->siglock, flags);
        if (!tsk->ptrace && sig == SIGCHLD &&
            (psig->action[SIGCHLD-1].sa.sa_handler == SIG_IGN ||
             (psig->action[SIGCHLD-1].sa.sa_flags & SA_NOCLDWAIT))) {

Our parent, the qemu task, does not ignore SIGCHLD and so will not hit the code above.

4. Maybe I am going in the wrong direction and we need kthreads. I could add a:

if (!is_ucounts_overlimit(task_ucounts(p), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)))
	inc_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1);

to vhost.c or to kthread.c when some new arg is passed in.


What do you think?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH V6 01/10] Use copy_process in vhost layer
@ 2022-01-17 16:41             ` Mike Christie
  0 siblings, 0 replies; 57+ messages in thread
From: Mike Christie @ 2022-01-17 16:41 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: axboe, hdanton, mst, linux-kernel, virtualization, hch, vverma,
	geert, stefanha, christian.brauner

On 12/22/21 12:24 PM, Eric W. Biederman wrote:
> All I am certain of is that you need to set
> "args->exit_signal = -1;".  This prevents having to play games with
> do_notify_parent.

Hi Eric,

I have all your review comments handled except this one. It's looking like it's
more difficult than just setting the exit_signal=-1, so I wanted to check that
I understood you.

Here is what I'm currently looking at:

1. I can't just set args->exit_signal to -1, because we end up with a task_struct
that's partially setup like a CLONE_THREAD task. What happens is copy_process
will set the task's exit_signal to -1 and then thread_group_leader() will return
false. When code like the thread_group_leader check in copy_process runs, we will
then go down the CLONE_THREAD paths which are not setup and hit crashes.

We would need changes like the following which does not crash anymore but is not
correct for many reasons. I am just posting this code as an example of the issue
I am hitting.

@@ -1637,11 +1637,13 @@ static void posix_cpu_timers_init_group(struct signal_struct *sig)
 	posix_cputimers_group_init(pct, cpu_limit);
 }
 
-static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
+static int copy_signal(unsigned long clone_flags, struct task_struct *tsk,
+		       struct kernel_clone_args *args)
 {
 	struct signal_struct *sig;
 
-	if (clone_flags & CLONE_THREAD)
+	if (clone_flags & CLONE_THREAD || args->exit_signal == -1)
 		return 0;
 
 	sig = kmem_cache_zalloc(signal_cachep, GFP_KERNEL);
@@ -2194,7 +2244,7 @@ static __latent_entropy struct task_struct *copy_process(
 	retval = copy_sighand(clone_flags, p);
 	if (retval)
 		goto bad_fork_cleanup_fs;
-	retval = copy_signal(clone_flags, p);
+	retval = copy_signal(clone_flags, p, args);
 	if (retval)
 		goto bad_fork_cleanup_sighand;
 	retval = copy_mm(clone_flags, p);
@@ -2277,6 +2327,9 @@ static __latent_entropy struct task_struct *copy_process(
 	if (clone_flags & CLONE_THREAD) {
 		p->group_leader = current->group_leader;
 		p->tgid = current->tgid;
+	} else if (args->exit_signal == -1) {
+		p->group_leader = current->group_leader;
+		p->tgid = p->pid;
 	} else {
 		p->group_leader = p;
 		p->tgid = p->pid;


2. Instead of #1, I could add some code where we just set
task_struct->exit_signal to -1. We could do this twords the end of copy_process
or after it has returned, but before we do do_exit. However, hat will have similar
issues as #1 during the exit handling.

For example, __exit_signal will call thread_group_leader which would return false.
__unhash_process would then not detach the pid and we would later hit crashes due
to the task_struct being freed already. I could add code like above to the exit related
code paths, but it gets messy like above.

3. I thought I could separate the leader detection from the exit signal by adding
a flag/field to kernel_clone_args and task_struct. But then I get to the point
where I just need a check for USER/VHOST_WORKER tasks in exit_notify which is
similar to the patch you didn't like where I added the check in do_notify_parent.
So I thought you might not like this approach.

Note:
We can't set our task's exit_signal to SIGCHLD and get autoreaped like suggested in
another mail. The original idea for the do_notify_parent was we wanted the behavior
that kthreads have where they get autoreaped on exit. kthreads get autoreaped there
because the threadd task that is the parent ignores all signals and so we hit the
parent SIG_IGN check:

        psig = tsk->parent->sighand;
        spin_lock_irqsave(&psig->siglock, flags);
        if (!tsk->ptrace && sig == SIGCHLD &&
            (psig->action[SIGCHLD-1].sa.sa_handler == SIG_IGN ||
             (psig->action[SIGCHLD-1].sa.sa_flags & SA_NOCLDWAIT))) {

Our parent, the qemu task, does not ignore SIGCHLD and so will not hit the code above.

4. Maybe I am going in the wrong direction and we need kthreads. I could add a:

if (!is_ucounts_overlimit(task_ucounts(p), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)))
	inc_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1);

to vhost.c or to kthread.c when some new arg is passed in.


What do you think?
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH V6 01/10] Use copy_process in vhost layer
  2022-01-17 16:41             ` Mike Christie
@ 2022-01-17 17:31               ` Eric W. Biederman
  -1 siblings, 0 replies; 57+ messages in thread
From: Eric W. Biederman @ 2022-01-17 17:31 UTC (permalink / raw)
  To: Mike Christie
  Cc: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel

Mike Christie <michael.christie@oracle.com> writes:

> On 12/22/21 12:24 PM, Eric W. Biederman wrote:
>> All I am certain of is that you need to set
>> "args->exit_signal = -1;".  This prevents having to play games with
>> do_notify_parent.
>
> Hi Eric,
>
> I have all your review comments handled except this one. It's looking like it's
> more difficult than just setting the exit_signal=-1, so I wanted to check that
> I understood you.

[snip problems with exit_signal = -1]

>
> What do you think?

I was wrong.  I appear to have confused the thread and the non-thread
cases.

Perhaps I meant "args->exit_signal = 0".  That looks like
do_notify_parent won't send it, and thread_group_leader continues to do
the right thing.

Baring any additional confusion on my part that cleanly solves the
problem of how not to send a signal from a child process cleanly.

My apologies for sending you on a wild goose chase.

Eric



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH V6 01/10] Use copy_process in vhost layer
@ 2022-01-17 17:31               ` Eric W. Biederman
  0 siblings, 0 replies; 57+ messages in thread
From: Eric W. Biederman @ 2022-01-17 17:31 UTC (permalink / raw)
  To: Mike Christie
  Cc: axboe, hdanton, mst, linux-kernel, virtualization, hch, vverma,
	geert, stefanha, christian.brauner

Mike Christie <michael.christie@oracle.com> writes:

> On 12/22/21 12:24 PM, Eric W. Biederman wrote:
>> All I am certain of is that you need to set
>> "args->exit_signal = -1;".  This prevents having to play games with
>> do_notify_parent.
>
> Hi Eric,
>
> I have all your review comments handled except this one. It's looking like it's
> more difficult than just setting the exit_signal=-1, so I wanted to check that
> I understood you.

[snip problems with exit_signal = -1]

>
> What do you think?

I was wrong.  I appear to have confused the thread and the non-thread
cases.

Perhaps I meant "args->exit_signal = 0".  That looks like
do_notify_parent won't send it, and thread_group_leader continues to do
the right thing.

Baring any additional confusion on my part that cleanly solves the
problem of how not to send a signal from a child process cleanly.

My apologies for sending you on a wild goose chase.

Eric


_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH V6 01/10] Use copy_process in vhost layer
  2022-01-17 17:31               ` Eric W. Biederman
@ 2022-01-18 18:51                 ` Mike Christie
  -1 siblings, 0 replies; 57+ messages in thread
From: Mike Christie @ 2022-01-18 18:51 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: axboe, hdanton, mst, linux-kernel, virtualization, hch, vverma,
	geert, stefanha, christian.brauner

On 1/17/22 11:31 AM, Eric W. Biederman wrote:
> Mike Christie <michael.christie@oracle.com> writes:
> 
>> On 12/22/21 12:24 PM, Eric W. Biederman wrote:
>>> All I am certain of is that you need to set
>>> "args->exit_signal = -1;".  This prevents having to play games with
>>> do_notify_parent.
>>
>> Hi Eric,
>>
>> I have all your review comments handled except this one. It's looking like it's
>> more difficult than just setting the exit_signal=-1, so I wanted to check that
>> I understood you.
> 
> [snip problems with exit_signal = -1]
> 
>>
>> What do you think?
> 
> I was wrong.  I appear to have confused the thread and the non-thread
> cases.
> 
> Perhaps I meant "args->exit_signal = 0".  That looks like
> do_notify_parent won't send it, and thread_group_leader continues to do
> the right thing.

That doesn't work too. exit_notify will call do_notify_parent but 
our parent, qemu, does not ignore SIGCHILD so we will not drop
down in into this chunk:

        psig = tsk->parent->sighand;
        spin_lock_irqsave(&psig->siglock, flags);
        if (!tsk->ptrace && sig == SIGCHLD &&
            (psig->action[SIGCHLD-1].sa.sa_handler == SIG_IGN ||
             (psig->action[SIGCHLD-1].sa.sa_flags & SA_NOCLDWAIT))) {

do_notify_parent will return false and so autoreap in exit_notify will
be false.



> 
> Baring any additional confusion on my part that cleanly solves the
> problem of how not to send a signal from a child process cleanly.
> 
> My apologies for sending you on a wild goose chase.
> 

It's no problem. I learned a lot about signal handling :)
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH V6 01/10] Use copy_process in vhost layer
@ 2022-01-18 18:51                 ` Mike Christie
  0 siblings, 0 replies; 57+ messages in thread
From: Mike Christie @ 2022-01-18 18:51 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel

On 1/17/22 11:31 AM, Eric W. Biederman wrote:
> Mike Christie <michael.christie@oracle.com> writes:
> 
>> On 12/22/21 12:24 PM, Eric W. Biederman wrote:
>>> All I am certain of is that you need to set
>>> "args->exit_signal = -1;".  This prevents having to play games with
>>> do_notify_parent.
>>
>> Hi Eric,
>>
>> I have all your review comments handled except this one. It's looking like it's
>> more difficult than just setting the exit_signal=-1, so I wanted to check that
>> I understood you.
> 
> [snip problems with exit_signal = -1]
> 
>>
>> What do you think?
> 
> I was wrong.  I appear to have confused the thread and the non-thread
> cases.
> 
> Perhaps I meant "args->exit_signal = 0".  That looks like
> do_notify_parent won't send it, and thread_group_leader continues to do
> the right thing.

That doesn't work too. exit_notify will call do_notify_parent but 
our parent, qemu, does not ignore SIGCHILD so we will not drop
down in into this chunk:

        psig = tsk->parent->sighand;
        spin_lock_irqsave(&psig->siglock, flags);
        if (!tsk->ptrace && sig == SIGCHLD &&
            (psig->action[SIGCHLD-1].sa.sa_handler == SIG_IGN ||
             (psig->action[SIGCHLD-1].sa.sa_flags & SA_NOCLDWAIT))) {

do_notify_parent will return false and so autoreap in exit_notify will
be false.



> 
> Baring any additional confusion on my part that cleanly solves the
> problem of how not to send a signal from a child process cleanly.
> 
> My apologies for sending you on a wild goose chase.
> 

It's no problem. I learned a lot about signal handling :)

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH V6 01/10] Use copy_process in vhost layer
  2022-01-18 18:51                 ` Mike Christie
@ 2022-01-18 19:00                   ` Mike Christie
  -1 siblings, 0 replies; 57+ messages in thread
From: Mike Christie @ 2022-01-18 19:00 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel

On 1/18/22 12:51 PM, Mike Christie wrote:
> On 1/17/22 11:31 AM, Eric W. Biederman wrote:
>> Mike Christie <michael.christie@oracle.com> writes:
>>
>>> On 12/22/21 12:24 PM, Eric W. Biederman wrote:
>>>> All I am certain of is that you need to set
>>>> "args->exit_signal = -1;".  This prevents having to play games with
>>>> do_notify_parent.
>>>
>>> Hi Eric,
>>>
>>> I have all your review comments handled except this one. It's looking like it's
>>> more difficult than just setting the exit_signal=-1, so I wanted to check that
>>> I understood you.
>>
>> [snip problems with exit_signal = -1]
>>
>>>
>>> What do you think?
>>
>> I was wrong.  I appear to have confused the thread and the non-thread
>> cases.
>>
>> Perhaps I meant "args->exit_signal = 0".  That looks like
>> do_notify_parent won't send it, and thread_group_leader continues to do
>> the right thing.
> 
> That doesn't work too. exit_notify will call do_notify_parent but 
> our parent, qemu, does not ignore SIGCHILD so we will not drop
> down in into this chunk:
> 
>         psig = tsk->parent->sighand;
>         spin_lock_irqsave(&psig->siglock, flags);
>         if (!tsk->ptrace && sig == SIGCHLD &&
>             (psig->action[SIGCHLD-1].sa.sa_handler == SIG_IGN ||
>              (psig->action[SIGCHLD-1].sa.sa_flags & SA_NOCLDWAIT))) {
> 
> do_notify_parent will return false and so autoreap in exit_notify will
> be false.
> 
> 
> 
>>
>> Baring any additional confusion on my part that cleanly solves the
>> problem of how not to send a signal from a child process cleanly.
>>


Oh yeah, maybe we are thinking about different things.

The issue I am hitting is that the parent, qemu, does not know about
these task_structs. And, userspace can add/delete these vhost devices
dynamically. So qemu could continue to run, but the vhost device could
be deleted. In this case I want to free up the task_struct resources
since we are no longer using it.

With kthreads do_notify_parent returns true and exit_notify calls
release_task, so it's handled for me. I had stuck in the USER/VHOST
check in the patch you didn't like to get that behavior.

If you prefer not adding the VHOST/USER task check in exit_notify then
instead of auto reaping, would another possible alternative be to add a
modified kernel_wait like function in the vhost layer to reap the task?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH V6 01/10] Use copy_process in vhost layer
@ 2022-01-18 19:00                   ` Mike Christie
  0 siblings, 0 replies; 57+ messages in thread
From: Mike Christie @ 2022-01-18 19:00 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: axboe, hdanton, mst, linux-kernel, virtualization, hch, vverma,
	geert, stefanha, christian.brauner

On 1/18/22 12:51 PM, Mike Christie wrote:
> On 1/17/22 11:31 AM, Eric W. Biederman wrote:
>> Mike Christie <michael.christie@oracle.com> writes:
>>
>>> On 12/22/21 12:24 PM, Eric W. Biederman wrote:
>>>> All I am certain of is that you need to set
>>>> "args->exit_signal = -1;".  This prevents having to play games with
>>>> do_notify_parent.
>>>
>>> Hi Eric,
>>>
>>> I have all your review comments handled except this one. It's looking like it's
>>> more difficult than just setting the exit_signal=-1, so I wanted to check that
>>> I understood you.
>>
>> [snip problems with exit_signal = -1]
>>
>>>
>>> What do you think?
>>
>> I was wrong.  I appear to have confused the thread and the non-thread
>> cases.
>>
>> Perhaps I meant "args->exit_signal = 0".  That looks like
>> do_notify_parent won't send it, and thread_group_leader continues to do
>> the right thing.
> 
> That doesn't work too. exit_notify will call do_notify_parent but 
> our parent, qemu, does not ignore SIGCHILD so we will not drop
> down in into this chunk:
> 
>         psig = tsk->parent->sighand;
>         spin_lock_irqsave(&psig->siglock, flags);
>         if (!tsk->ptrace && sig == SIGCHLD &&
>             (psig->action[SIGCHLD-1].sa.sa_handler == SIG_IGN ||
>              (psig->action[SIGCHLD-1].sa.sa_flags & SA_NOCLDWAIT))) {
> 
> do_notify_parent will return false and so autoreap in exit_notify will
> be false.
> 
> 
> 
>>
>> Baring any additional confusion on my part that cleanly solves the
>> problem of how not to send a signal from a child process cleanly.
>>


Oh yeah, maybe we are thinking about different things.

The issue I am hitting is that the parent, qemu, does not know about
these task_structs. And, userspace can add/delete these vhost devices
dynamically. So qemu could continue to run, but the vhost device could
be deleted. In this case I want to free up the task_struct resources
since we are no longer using it.

With kthreads do_notify_parent returns true and exit_notify calls
release_task, so it's handled for me. I had stuck in the USER/VHOST
check in the patch you didn't like to get that behavior.

If you prefer not adding the VHOST/USER task check in exit_notify then
instead of auto reaping, would another possible alternative be to add a
modified kernel_wait like function in the vhost layer to reap the task?
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH V6 01/10] Use copy_process in vhost layer
  2022-01-18 18:51                 ` Mike Christie
@ 2022-01-18 19:12                   ` Eric W. Biederman
  -1 siblings, 0 replies; 57+ messages in thread
From: Eric W. Biederman @ 2022-01-18 19:12 UTC (permalink / raw)
  To: Mike Christie
  Cc: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel

Mike Christie <michael.christie@oracle.com> writes:

> On 1/17/22 11:31 AM, Eric W. Biederman wrote:
>> Mike Christie <michael.christie@oracle.com> writes:
>> 
>>> On 12/22/21 12:24 PM, Eric W. Biederman wrote:
>>>> All I am certain of is that you need to set
>>>> "args->exit_signal = -1;".  This prevents having to play games with
>>>> do_notify_parent.
>>>
>>> Hi Eric,
>>>
>>> I have all your review comments handled except this one. It's looking like it's
>>> more difficult than just setting the exit_signal=-1, so I wanted to check that
>>> I understood you.
>> 
>> [snip problems with exit_signal = -1]
>> 
>>>
>>> What do you think?
>> 
>> I was wrong.  I appear to have confused the thread and the non-thread
>> cases.
>> 
>> Perhaps I meant "args->exit_signal = 0".  That looks like
>> do_notify_parent won't send it, and thread_group_leader continues to do
>> the right thing.
>
> That doesn't work too. exit_notify will call do_notify_parent but 
> our parent, qemu, does not ignore SIGCHILD so we will not drop
> down in into this chunk:
>
>         psig = tsk->parent->sighand;
>         spin_lock_irqsave(&psig->siglock, flags);
>         if (!tsk->ptrace && sig == SIGCHLD &&
>             (psig->action[SIGCHLD-1].sa.sa_handler == SIG_IGN ||
>              (psig->action[SIGCHLD-1].sa.sa_flags & SA_NOCLDWAIT))) {
>
> do_notify_parent will return false and so autoreap in exit_notify will
> be false.

Bah good point.  We won't send the signal but you won't autoreap either.

I think we could legitimately change this bit:

	/*
	 * Send with __send_signal as si_pid and si_uid are in the
	 * parent's namespaces.
	 */
	if (valid_signal(sig) && sig)
		__send_signal(sig, &info, tsk->parent, PIDTYPE_TGID, false);

To add:
	else
        	/* We don't notify the parent so just autoreap */
        	autoreap = true;


I expect we could make that change all on it's own, it sort of breaks my
head that requesting not signaling a parent does not also trigger
autoreap behavior.

We certainly need some mechanism to trigger autoreap behavior or the
technical details of being an extra thread of the process need to be
solved.

Eric

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH V6 01/10] Use copy_process in vhost layer
@ 2022-01-18 19:12                   ` Eric W. Biederman
  0 siblings, 0 replies; 57+ messages in thread
From: Eric W. Biederman @ 2022-01-18 19:12 UTC (permalink / raw)
  To: Mike Christie
  Cc: axboe, hdanton, mst, linux-kernel, virtualization, hch, vverma,
	geert, stefanha, christian.brauner

Mike Christie <michael.christie@oracle.com> writes:

> On 1/17/22 11:31 AM, Eric W. Biederman wrote:
>> Mike Christie <michael.christie@oracle.com> writes:
>> 
>>> On 12/22/21 12:24 PM, Eric W. Biederman wrote:
>>>> All I am certain of is that you need to set
>>>> "args->exit_signal = -1;".  This prevents having to play games with
>>>> do_notify_parent.
>>>
>>> Hi Eric,
>>>
>>> I have all your review comments handled except this one. It's looking like it's
>>> more difficult than just setting the exit_signal=-1, so I wanted to check that
>>> I understood you.
>> 
>> [snip problems with exit_signal = -1]
>> 
>>>
>>> What do you think?
>> 
>> I was wrong.  I appear to have confused the thread and the non-thread
>> cases.
>> 
>> Perhaps I meant "args->exit_signal = 0".  That looks like
>> do_notify_parent won't send it, and thread_group_leader continues to do
>> the right thing.
>
> That doesn't work too. exit_notify will call do_notify_parent but 
> our parent, qemu, does not ignore SIGCHILD so we will not drop
> down in into this chunk:
>
>         psig = tsk->parent->sighand;
>         spin_lock_irqsave(&psig->siglock, flags);
>         if (!tsk->ptrace && sig == SIGCHLD &&
>             (psig->action[SIGCHLD-1].sa.sa_handler == SIG_IGN ||
>              (psig->action[SIGCHLD-1].sa.sa_flags & SA_NOCLDWAIT))) {
>
> do_notify_parent will return false and so autoreap in exit_notify will
> be false.

Bah good point.  We won't send the signal but you won't autoreap either.

I think we could legitimately change this bit:

	/*
	 * Send with __send_signal as si_pid and si_uid are in the
	 * parent's namespaces.
	 */
	if (valid_signal(sig) && sig)
		__send_signal(sig, &info, tsk->parent, PIDTYPE_TGID, false);

To add:
	else
        	/* We don't notify the parent so just autoreap */
        	autoreap = true;


I expect we could make that change all on it's own, it sort of breaks my
head that requesting not signaling a parent does not also trigger
autoreap behavior.

We certainly need some mechanism to trigger autoreap behavior or the
technical details of being an extra thread of the process need to be
solved.

Eric
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH V6 01/10] Use copy_process in vhost layer
  2022-01-18 19:12                   ` Eric W. Biederman
@ 2022-02-02 21:02                     ` Mike Christie
  -1 siblings, 0 replies; 57+ messages in thread
From: Mike Christie @ 2022-02-02 21:02 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: geert, vverma, hdanton, hch, stefanha, jasowang, mst, sgarzare,
	virtualization, christian.brauner, axboe, linux-kernel

On 1/18/22 1:12 PM, Eric W. Biederman wrote:
> Mike Christie <michael.christie@oracle.com> writes:
> 
>> On 1/17/22 11:31 AM, Eric W. Biederman wrote:
>>> Mike Christie <michael.christie@oracle.com> writes:
>>>
>>>> On 12/22/21 12:24 PM, Eric W. Biederman wrote:
>>>>> All I am certain of is that you need to set
>>>>> "args->exit_signal = -1;".  This prevents having to play games with
>>>>> do_notify_parent.
>>>>
>>>> Hi Eric,
>>>>
>>>> I have all your review comments handled except this one. It's looking like it's
>>>> more difficult than just setting the exit_signal=-1, so I wanted to check that
>>>> I understood you.
>>>
>>> [snip problems with exit_signal = -1]
>>>
>>>>
>>>> What do you think?
>>>
>>> I was wrong.  I appear to have confused the thread and the non-thread
>>> cases.
>>>
>>> Perhaps I meant "args->exit_signal = 0".  That looks like
>>> do_notify_parent won't send it, and thread_group_leader continues to do
>>> the right thing.
>>
>> That doesn't work too. exit_notify will call do_notify_parent but 
>> our parent, qemu, does not ignore SIGCHILD so we will not drop
>> down in into this chunk:
>>
>>         psig = tsk->parent->sighand;
>>         spin_lock_irqsave(&psig->siglock, flags);
>>         if (!tsk->ptrace && sig == SIGCHLD &&
>>             (psig->action[SIGCHLD-1].sa.sa_handler == SIG_IGN ||
>>              (psig->action[SIGCHLD-1].sa.sa_flags & SA_NOCLDWAIT))) {
>>
>> do_notify_parent will return false and so autoreap in exit_notify will
>> be false.
> 
> Bah good point.  We won't send the signal but you won't autoreap either.
> 
> I think we could legitimately change this bit:
> 
> 	/*
> 	 * Send with __send_signal as si_pid and si_uid are in the
> 	 * parent's namespaces.
> 	 */
> 	if (valid_signal(sig) && sig)
> 		__send_signal(sig, &info, tsk->parent, PIDTYPE_TGID, false);
> 
> To add:
> 	else
>         	/* We don't notify the parent so just autoreap */
>         	autoreap = true;
> 

Hey,

This works for me, but I think we might have issues where threads now get
reaped too soon when they are being ptraced.

I think I found a simple solution by just using kernel_wait in the vhost
task code since I want to wait for the thread to exit when I'm removing
a device already. I posted a patchset so you can check it out.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH V6 01/10] Use copy_process in vhost layer
@ 2022-02-02 21:02                     ` Mike Christie
  0 siblings, 0 replies; 57+ messages in thread
From: Mike Christie @ 2022-02-02 21:02 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: axboe, hdanton, mst, linux-kernel, virtualization, hch, vverma,
	geert, stefanha, christian.brauner

On 1/18/22 1:12 PM, Eric W. Biederman wrote:
> Mike Christie <michael.christie@oracle.com> writes:
> 
>> On 1/17/22 11:31 AM, Eric W. Biederman wrote:
>>> Mike Christie <michael.christie@oracle.com> writes:
>>>
>>>> On 12/22/21 12:24 PM, Eric W. Biederman wrote:
>>>>> All I am certain of is that you need to set
>>>>> "args->exit_signal = -1;".  This prevents having to play games with
>>>>> do_notify_parent.
>>>>
>>>> Hi Eric,
>>>>
>>>> I have all your review comments handled except this one. It's looking like it's
>>>> more difficult than just setting the exit_signal=-1, so I wanted to check that
>>>> I understood you.
>>>
>>> [snip problems with exit_signal = -1]
>>>
>>>>
>>>> What do you think?
>>>
>>> I was wrong.  I appear to have confused the thread and the non-thread
>>> cases.
>>>
>>> Perhaps I meant "args->exit_signal = 0".  That looks like
>>> do_notify_parent won't send it, and thread_group_leader continues to do
>>> the right thing.
>>
>> That doesn't work too. exit_notify will call do_notify_parent but 
>> our parent, qemu, does not ignore SIGCHILD so we will not drop
>> down in into this chunk:
>>
>>         psig = tsk->parent->sighand;
>>         spin_lock_irqsave(&psig->siglock, flags);
>>         if (!tsk->ptrace && sig == SIGCHLD &&
>>             (psig->action[SIGCHLD-1].sa.sa_handler == SIG_IGN ||
>>              (psig->action[SIGCHLD-1].sa.sa_flags & SA_NOCLDWAIT))) {
>>
>> do_notify_parent will return false and so autoreap in exit_notify will
>> be false.
> 
> Bah good point.  We won't send the signal but you won't autoreap either.
> 
> I think we could legitimately change this bit:
> 
> 	/*
> 	 * Send with __send_signal as si_pid and si_uid are in the
> 	 * parent's namespaces.
> 	 */
> 	if (valid_signal(sig) && sig)
> 		__send_signal(sig, &info, tsk->parent, PIDTYPE_TGID, false);
> 
> To add:
> 	else
>         	/* We don't notify the parent so just autoreap */
>         	autoreap = true;
> 

Hey,

This works for me, but I think we might have issues where threads now get
reaped too soon when they are being ptraced.

I think I found a simple solution by just using kernel_wait in the vhost
task code since I want to wait for the thread to exit when I'm removing
a device already. I posted a patchset so you can check it out.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2022-02-02 21:03 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-29 19:46 [PATCH V6 01/10] Use copy_process in vhost layer Mike Christie
2021-11-29 19:46 ` Mike Christie
2021-11-29 19:46 ` [PATCH V6 01/10] fork: Make IO worker options flag based Mike Christie
2021-11-29 19:46   ` Mike Christie
2021-11-29 19:46 ` [PATCH V6 02/10] fork/vm: Move common PF_IO_WORKER behavior to new flag Mike Christie
2021-11-29 19:46   ` Mike Christie
2021-11-29 19:47 ` [PATCH V6 03/10] fork: add USER_WORKER flag to not dup/clone files Mike Christie
2021-11-29 19:47   ` Mike Christie
2021-11-29 19:47 ` [PATCH V6 04/10] fork: Add USER_WORKER flag to ignore signals Mike Christie
2021-11-29 19:47   ` Mike Christie
2021-11-29 19:47 ` [PATCH V6 05/10] signal: Perfom autoreap for PF_USER_WORKER Mike Christie
2021-11-29 19:47   ` Mike Christie
2021-12-17 18:42   ` Eric W. Biederman
2021-12-17 18:42     ` Eric W. Biederman
2021-11-29 19:47 ` [PATCH V6 06/10] fork: add helpers to clone a process for kernel use Mike Christie
2021-11-29 19:47   ` Mike Christie
2021-12-17 18:53   ` Eric W. Biederman
2021-12-17 18:53     ` Eric W. Biederman
2021-11-29 19:47 ` [PATCH V6 07/10] io_uring: switch to user_worker Mike Christie
2021-11-29 19:47   ` Mike Christie
2021-11-29 19:47 ` [PATCH V6 08/10] fork: remove create_io_thread Mike Christie
2021-11-29 19:47   ` Mike Christie
2021-11-29 19:47 ` [PATCH V6 09/10] vhost: move worker thread fields to new struct Mike Christie
2021-11-29 19:47   ` Mike Christie
2021-11-29 19:47 ` [PATCH V6 10/10] vhost: use user_worker to check RLIMITs Mike Christie
2021-11-29 19:47   ` Mike Christie
2021-12-17 19:01   ` Eric W. Biederman
2021-12-17 19:01     ` Eric W. Biederman
2021-12-08 20:34 ` [PATCH V6 01/10] Use copy_process in vhost layer Michael S. Tsirkin
2021-12-08 20:34   ` Michael S. Tsirkin
2021-12-08 22:13   ` michael.christie
2021-12-08 22:13     ` michael.christie
2021-12-09  9:32     ` Christian Brauner
2021-12-17 19:26 ` Eric W. Biederman
2021-12-17 19:26   ` Eric W. Biederman
2021-12-17 22:08   ` michael.christie
2021-12-17 22:08     ` michael.christie
2021-12-22  0:20     ` Eric W. Biederman
2021-12-22  0:20       ` Eric W. Biederman
2021-12-22 17:32       ` Mike Christie
2021-12-22 17:32         ` Mike Christie
2021-12-22 18:24         ` Eric W. Biederman
2021-12-22 18:24           ` Eric W. Biederman
2021-12-22 20:25           ` Michael S. Tsirkin
2021-12-22 20:25             ` Michael S. Tsirkin
2022-01-17 16:41           ` Mike Christie
2022-01-17 16:41             ` Mike Christie
2022-01-17 17:31             ` Eric W. Biederman
2022-01-17 17:31               ` Eric W. Biederman
2022-01-18 18:51               ` Mike Christie
2022-01-18 18:51                 ` Mike Christie
2022-01-18 19:00                 ` Mike Christie
2022-01-18 19:00                   ` Mike Christie
2022-01-18 19:12                 ` Eric W. Biederman
2022-01-18 19:12                   ` Eric W. Biederman
2022-02-02 21:02                   ` Mike Christie
2022-02-02 21:02                     ` Mike Christie

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.