All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting
@ 2024-03-16  0:46 Mike Christie
  2024-03-16  0:46 ` [PATCH 1/9] vhost-scsi: Handle vhost_vq_work_queue failures for events Mike Christie
                   ` (10 more replies)
  0 siblings, 11 replies; 32+ messages in thread
From: Mike Christie @ 2024-03-16  0:46 UTC (permalink / raw)
  To: oleg, ebiederm, virtualization, mst, sgarzare, jasowang,
	stefanha, brauner

The following patches were made over Linus's tree and also apply over
mst's vhost branch. The patches add the ability for vhost_tasks to
handle SIGKILL by flushing queued works, stop new works from being
queued, and prepare the task for an early exit.

This removes the need for the signal/coredump hacks added in:

Commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression")

when the vhost_task patches were initially merged and fix the issue
in this thread:

https://lore.kernel.org/all/000000000000a41b82060e875721@google.com/

Long Background:

The original vhost worker code didn't support any signals. If the
userspace application that owned the worker got a SIGKILL, the app/
process would exit dropping all references to the device and then the
file operation's release function would be called. From there we would
wait on running IO then cleanup the device's memory.

When we switched to vhost_tasks being a thread in the owner's process we
added some hacks to the signal/coredump code so we could continue to
wait on running IO and process it from the vhost_task. The idea was that
we would eventually remove the hacks. We recently hit this bug:

https://lore.kernel.org/all/000000000000a41b82060e875721@google.com/

It turns out only vhost-scsi had an issue where it would send a command
to the block/LIO layer, wait for a response and then process in the vhost
task. So patches 1-5 prepares vhost-scsi to handle when the vhost_task
is killed while we still have commands outstanding. The next patches then
prepare and convert the vhost and vhost_task layers to handle SIGKILL
by flushing running works, marking the vhost_task as dead so there's
no future uses, then exiting.




^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/9] vhost-scsi: Handle vhost_vq_work_queue failures for events
  2024-03-16  0:46 [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting Mike Christie
@ 2024-03-16  0:46 ` Mike Christie
  2024-03-16  0:47 ` [PATCH 2/9] vhost-scsi: Handle vhost_vq_work_queue failures for cmds Mike Christie
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Mike Christie @ 2024-03-16  0:46 UTC (permalink / raw)
  To: oleg, ebiederm, virtualization, mst, sgarzare, jasowang,
	stefanha, brauner
  Cc: Mike Christie

Currently, we can try to queue an event's work before the vhost_task is
created. When this happens we just drop it in vhost_scsi_do_plug before
even calling vhost_vq_work_queue. During a device shutdown we do the
same thing after vhost_scsi_clear_endpoint has cleared the backends.

In the next patches we will be able to kill the vhost_task before we
have cleared the endpoint. In that case, vhost_vq_work_queue can fail
and we will leak the event's memory. This has handle the failure by
just freeing the event. This is safe to do, because
vhost_vq_work_queue will only return failure for us when the vhost_task
is killed and so userspace will not be able to handle events if we
sent them.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/scsi.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 282aac45c690..f34f9895b898 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -497,10 +497,8 @@ vhost_scsi_do_evt_work(struct vhost_scsi *vs, struct vhost_scsi_evt *evt)
 		vq_err(vq, "Faulted on vhost_scsi_send_event\n");
 }
 
-static void vhost_scsi_evt_work(struct vhost_work *work)
+static void vhost_scsi_complete_events(struct vhost_scsi *vs, bool drop)
 {
-	struct vhost_scsi *vs = container_of(work, struct vhost_scsi,
-					vs_event_work);
 	struct vhost_virtqueue *vq = &vs->vqs[VHOST_SCSI_VQ_EVT].vq;
 	struct vhost_scsi_evt *evt, *t;
 	struct llist_node *llnode;
@@ -508,12 +506,20 @@ static void vhost_scsi_evt_work(struct vhost_work *work)
 	mutex_lock(&vq->mutex);
 	llnode = llist_del_all(&vs->vs_event_list);
 	llist_for_each_entry_safe(evt, t, llnode, list) {
-		vhost_scsi_do_evt_work(vs, evt);
+		if (!drop)
+			vhost_scsi_do_evt_work(vs, evt);
 		vhost_scsi_free_evt(vs, evt);
 	}
 	mutex_unlock(&vq->mutex);
 }
 
+static void vhost_scsi_evt_work(struct vhost_work *work)
+{
+	struct vhost_scsi *vs = container_of(work, struct vhost_scsi,
+					     vs_event_work);
+	vhost_scsi_complete_events(vs, false);
+}
+
 static int vhost_scsi_copy_sgl_to_iov(struct vhost_scsi_cmd *cmd)
 {
 	struct iov_iter *iter = &cmd->saved_iter;
@@ -1509,7 +1515,8 @@ vhost_scsi_send_evt(struct vhost_scsi *vs, struct vhost_virtqueue *vq,
 	}
 
 	llist_add(&evt->list, &vs->vs_event_list);
-	vhost_vq_work_queue(vq, &vs->vs_event_work);
+	if (!vhost_vq_work_queue(vq, &vs->vs_event_work))
+		vhost_scsi_complete_events(vs, true);
 }
 
 static void vhost_scsi_evt_handle_kick(struct vhost_work *work)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/9] vhost-scsi: Handle vhost_vq_work_queue failures for cmds
  2024-03-16  0:46 [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting Mike Christie
  2024-03-16  0:46 ` [PATCH 1/9] vhost-scsi: Handle vhost_vq_work_queue failures for events Mike Christie
@ 2024-03-16  0:47 ` Mike Christie
  2024-03-16  0:47 ` [PATCH 3/9] vhost-scsi: Use system wq to flush dev for TMFs Mike Christie
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Mike Christie @ 2024-03-16  0:47 UTC (permalink / raw)
  To: oleg, ebiederm, virtualization, mst, sgarzare, jasowang,
	stefanha, brauner
  Cc: Mike Christie

In the next patches we will support the vhost_task being killed while in
use. The problem for vhost-scsi is that we can't free some structs until
we get responses for commands we have submitted to the target layer and
we currently process the responses from the vhost_task.

This has just drop the responses and free the command's resources. When
all commands have completed then operations like flush will be woken up
and we can complete device release and endpoint cleanup.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/scsi.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index f34f9895b898..6ec1abe7364f 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -358,6 +358,16 @@ static void vhost_scsi_release_tmf_res(struct vhost_scsi_tmf *tmf)
 	vhost_scsi_put_inflight(inflight);
 }
 
+static void vhost_scsi_drop_cmds(struct vhost_scsi_virtqueue *svq)
+{
+	struct vhost_scsi_cmd *cmd, *t;
+	struct llist_node *llnode;
+
+	llnode = llist_del_all(&svq->completion_list);
+	llist_for_each_entry_safe(cmd, t, llnode, tvc_completion_list)
+		vhost_scsi_release_cmd_res(&cmd->tvc_se_cmd);
+}
+
 static void vhost_scsi_release_cmd(struct se_cmd *se_cmd)
 {
 	if (se_cmd->se_cmd_flags & SCF_SCSI_TMR_CDB) {
@@ -373,7 +383,8 @@ static void vhost_scsi_release_cmd(struct se_cmd *se_cmd)
 					struct vhost_scsi_virtqueue, vq);
 
 		llist_add(&cmd->tvc_completion_list, &svq->completion_list);
-		vhost_vq_work_queue(&svq->vq, &svq->completion_work);
+		if (!vhost_vq_work_queue(&svq->vq, &svq->completion_work))
+			vhost_scsi_drop_cmds(svq);
 	}
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 3/9] vhost-scsi: Use system wq to flush dev for TMFs
  2024-03-16  0:46 [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting Mike Christie
  2024-03-16  0:46 ` [PATCH 1/9] vhost-scsi: Handle vhost_vq_work_queue failures for events Mike Christie
  2024-03-16  0:47 ` [PATCH 2/9] vhost-scsi: Handle vhost_vq_work_queue failures for cmds Mike Christie
@ 2024-03-16  0:47 ` Mike Christie
  2024-03-16  0:47 ` [PATCH 4/9] vhost: Remove vhost_vq_flush Mike Christie
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Mike Christie @ 2024-03-16  0:47 UTC (permalink / raw)
  To: oleg, ebiederm, virtualization, mst, sgarzare, jasowang,
	stefanha, brauner
  Cc: Mike Christie

We flush all the workers that are not also used by the ctl vq to make
sure that responses queued by LIO before the TMF response are sent
before the TMF response. This requires a special vhost_vq_flush
function which, in the next patches where we handle SIGKILL killing
workers while in use, will require extra locking/complexity. To avoid
that, this patch has us flush the entire device from the system work
queue, then queue up sending the response from there.

This is a little less optimal since we now flush all workers but this
will be ok since commands have already timed out and perf is not a
concern.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/scsi.c | 39 +++++++++++++++++++--------------------
 1 file changed, 19 insertions(+), 20 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 6ec1abe7364f..04e0d3f1bd77 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -210,6 +210,7 @@ struct vhost_scsi {
 
 struct vhost_scsi_tmf {
 	struct vhost_work vwork;
+	struct work_struct flush_work;
 	struct vhost_scsi *vhost;
 	struct vhost_scsi_virtqueue *svq;
 
@@ -373,9 +374,8 @@ static void vhost_scsi_release_cmd(struct se_cmd *se_cmd)
 	if (se_cmd->se_cmd_flags & SCF_SCSI_TMR_CDB) {
 		struct vhost_scsi_tmf *tmf = container_of(se_cmd,
 					struct vhost_scsi_tmf, se_cmd);
-		struct vhost_virtqueue *vq = &tmf->svq->vq;
 
-		vhost_vq_work_queue(vq, &tmf->vwork);
+		schedule_work(&tmf->flush_work);
 	} else {
 		struct vhost_scsi_cmd *cmd = container_of(se_cmd,
 					struct vhost_scsi_cmd, tvc_se_cmd);
@@ -1287,33 +1287,31 @@ static void vhost_scsi_tmf_resp_work(struct vhost_work *work)
 {
 	struct vhost_scsi_tmf *tmf = container_of(work, struct vhost_scsi_tmf,
 						  vwork);
-	struct vhost_virtqueue *ctl_vq, *vq;
-	int resp_code, i;
-
-	if (tmf->scsi_resp == TMR_FUNCTION_COMPLETE) {
-		/*
-		 * Flush IO vqs that don't share a worker with the ctl to make
-		 * sure they have sent their responses before us.
-		 */
-		ctl_vq = &tmf->vhost->vqs[VHOST_SCSI_VQ_CTL].vq;
-		for (i = VHOST_SCSI_VQ_IO; i < tmf->vhost->dev.nvqs; i++) {
-			vq = &tmf->vhost->vqs[i].vq;
-
-			if (vhost_vq_is_setup(vq) &&
-			    vq->worker != ctl_vq->worker)
-				vhost_vq_flush(vq);
-		}
+	int resp_code;
 
+	if (tmf->scsi_resp == TMR_FUNCTION_COMPLETE)
 		resp_code = VIRTIO_SCSI_S_FUNCTION_SUCCEEDED;
-	} else {
+	else
 		resp_code = VIRTIO_SCSI_S_FUNCTION_REJECTED;
-	}
 
 	vhost_scsi_send_tmf_resp(tmf->vhost, &tmf->svq->vq, tmf->in_iovs,
 				 tmf->vq_desc, &tmf->resp_iov, resp_code);
 	vhost_scsi_release_tmf_res(tmf);
 }
 
+static void vhost_scsi_tmf_flush_work(struct work_struct *work)
+{
+	struct vhost_scsi_tmf *tmf = container_of(work, struct vhost_scsi_tmf,
+						 flush_work);
+	struct vhost_virtqueue *vq = &tmf->svq->vq;
+	/*
+	 * Make sure we have sent responses for other commands before we
+	 * send our response.
+	 */
+	vhost_dev_flush(vq->dev);
+	vhost_vq_work_queue(vq, &tmf->vwork);
+}
+
 static void
 vhost_scsi_handle_tmf(struct vhost_scsi *vs, struct vhost_scsi_tpg *tpg,
 		      struct vhost_virtqueue *vq,
@@ -1337,6 +1335,7 @@ vhost_scsi_handle_tmf(struct vhost_scsi *vs, struct vhost_scsi_tpg *tpg,
 	if (!tmf)
 		goto send_reject;
 
+	INIT_WORK(&tmf->flush_work, vhost_scsi_tmf_flush_work);
 	vhost_work_init(&tmf->vwork, vhost_scsi_tmf_resp_work);
 	tmf->vhost = vs;
 	tmf->svq = svq;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 4/9] vhost: Remove vhost_vq_flush
  2024-03-16  0:46 [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting Mike Christie
                   ` (2 preceding siblings ...)
  2024-03-16  0:47 ` [PATCH 3/9] vhost-scsi: Use system wq to flush dev for TMFs Mike Christie
@ 2024-03-16  0:47 ` Mike Christie
  2024-03-16  0:47 ` [PATCH 5/9] vhost_scsi: Handle vhost_vq_work_queue failures for TMFs Mike Christie
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Mike Christie @ 2024-03-16  0:47 UTC (permalink / raw)
  To: oleg, ebiederm, virtualization, mst, sgarzare, jasowang,
	stefanha, brauner
  Cc: Mike Christie

vhost_vq_flush is no longer used so remove it.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/vhost.c | 12 ------------
 drivers/vhost/vhost.h |  1 -
 2 files changed, 13 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 045f666b4f12..cd79075da294 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -263,18 +263,6 @@ bool vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work)
 }
 EXPORT_SYMBOL_GPL(vhost_vq_work_queue);
 
-void vhost_vq_flush(struct vhost_virtqueue *vq)
-{
-	struct vhost_flush_struct flush;
-
-	init_completion(&flush.wait_event);
-	vhost_work_init(&flush.work, vhost_flush_work);
-
-	if (vhost_vq_work_queue(vq, &flush.work))
-		wait_for_completion(&flush.wait_event);
-}
-EXPORT_SYMBOL_GPL(vhost_vq_flush);
-
 /**
  * vhost_worker_flush - flush a worker
  * @worker: worker to flush
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 9e942fcda5c3..91ade037f08e 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -205,7 +205,6 @@ int vhost_get_vq_desc(struct vhost_virtqueue *,
 		      struct vhost_log *log, unsigned int *log_num);
 void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
 
-void vhost_vq_flush(struct vhost_virtqueue *vq);
 bool vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work);
 bool vhost_vq_has_work(struct vhost_virtqueue *vq);
 bool vhost_vq_is_setup(struct vhost_virtqueue *vq);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 5/9] vhost_scsi: Handle vhost_vq_work_queue failures for TMFs
  2024-03-16  0:46 [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting Mike Christie
                   ` (3 preceding siblings ...)
  2024-03-16  0:47 ` [PATCH 4/9] vhost: Remove vhost_vq_flush Mike Christie
@ 2024-03-16  0:47 ` Mike Christie
  2024-03-16  0:47 ` [PATCH 6/9] vhost: Use virtqueue mutex for swapping worker Mike Christie
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Mike Christie @ 2024-03-16  0:47 UTC (permalink / raw)
  To: oleg, ebiederm, virtualization, mst, sgarzare, jasowang,
	stefanha, brauner
  Cc: Mike Christie

vhost_vq_work_queue will never fail when queueing the TMF's response
handling because a guest can only send us TMFs when the device is fully
setup so there is always a worker at that time. In the next patches we
will modify the worker code so it handles SIGKILL by exiting before
outstanding commands/TMFs have sent their responses. In that case
vhost_vq_work_queue can fail when we try to send a response.

This has us just free the TMF's resources since at this time the guest
won't be able to get a response even if we could send it.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/scsi.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 04e0d3f1bd77..006ffacf1c56 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1309,7 +1309,8 @@ static void vhost_scsi_tmf_flush_work(struct work_struct *work)
 	 * send our response.
 	 */
 	vhost_dev_flush(vq->dev);
-	vhost_vq_work_queue(vq, &tmf->vwork);
+	if (!vhost_vq_work_queue(vq, &tmf->vwork))
+		vhost_scsi_release_tmf_res(tmf);
 }
 
 static void
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 6/9] vhost: Use virtqueue mutex for swapping worker
  2024-03-16  0:46 [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting Mike Christie
                   ` (4 preceding siblings ...)
  2024-03-16  0:47 ` [PATCH 5/9] vhost_scsi: Handle vhost_vq_work_queue failures for TMFs Mike Christie
@ 2024-03-16  0:47 ` Mike Christie
  2024-03-16  0:47 ` [PATCH 7/9] vhost: Release worker mutex during flushes Mike Christie
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Mike Christie @ 2024-03-16  0:47 UTC (permalink / raw)
  To: oleg, ebiederm, virtualization, mst, sgarzare, jasowang,
	stefanha, brauner
  Cc: Mike Christie

__vhost_vq_attach_worker uses the vhost_dev mutex to serialize the
swapping of a virtqueue's worker. This was done for simplicity because
we are already holding that mutex.

In the next patches where the worker can be killed while in use, we need
finer grained locking because some drivers will hold the vhost_dev mutex
while flushing. However in the SIGKILL handler in the next patches, we
will need to be able to swap workers (set current one to NULL), kill
queued works and stop new flushes while flushes are in progress.

To prepare us, this has us use the virtqueue mutex for swapping workers
instead of the vhost_dev one.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/vhost.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index cd79075da294..4252c3b827ca 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -652,16 +652,22 @@ static void __vhost_vq_attach_worker(struct vhost_virtqueue *vq,
 {
 	struct vhost_worker *old_worker;
 
-	old_worker = rcu_dereference_check(vq->worker,
-					   lockdep_is_held(&vq->dev->mutex));
-
 	mutex_lock(&worker->mutex);
-	worker->attachment_cnt++;
-	mutex_unlock(&worker->mutex);
+	mutex_lock(&vq->mutex);
+
+	old_worker = rcu_dereference_check(vq->worker,
+					   lockdep_is_held(&vq->mutex));
 	rcu_assign_pointer(vq->worker, worker);
+	worker->attachment_cnt++;
 
-	if (!old_worker)
+	if (!old_worker) {
+		mutex_unlock(&vq->mutex);
+		mutex_unlock(&worker->mutex);
 		return;
+	}
+	mutex_unlock(&vq->mutex);
+	mutex_unlock(&worker->mutex);
+
 	/*
 	 * Take the worker mutex to make sure we see the work queued from
 	 * device wide flushes which doesn't use RCU for execution.
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 7/9] vhost: Release worker mutex during flushes
  2024-03-16  0:46 [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting Mike Christie
                   ` (5 preceding siblings ...)
  2024-03-16  0:47 ` [PATCH 6/9] vhost: Use virtqueue mutex for swapping worker Mike Christie
@ 2024-03-16  0:47 ` Mike Christie
  2024-03-16  0:47 ` [PATCH 8/9] vhost_task: Handle SIGKILL by flushing work and exiting Mike Christie
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Mike Christie @ 2024-03-16  0:47 UTC (permalink / raw)
  To: oleg, ebiederm, virtualization, mst, sgarzare, jasowang,
	stefanha, brauner
  Cc: Mike Christie

In the next patches where the worker can be killed while in use, we
need to be able to take the worker mutex and kill queued works for
new IO and flushes, and set some new flags to prevent new
__vhost_vq_attach_worker calls from swapping in/out killed workers.

If we are holding the worker mutex during a flush and the flush's work
is still in the queue, the worker code that will handle the SIGKILL
cleanup won't be able to take the mutex and perform it's cleanup. So
this patch has us drop the worker mutex while waiting for the flush
to complete.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/vhost.c | 44 +++++++++++++++++++++++++++++--------------
 1 file changed, 30 insertions(+), 14 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 4252c3b827ca..b2012993e9fa 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -264,21 +264,36 @@ bool vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work)
 EXPORT_SYMBOL_GPL(vhost_vq_work_queue);
 
 /**
- * vhost_worker_flush - flush a worker
+ * __vhost_worker_flush - flush a worker
  * @worker: worker to flush
  *
- * This does not use RCU to protect the worker, so the device or worker
- * mutex must be held.
+ * The worker's flush_mutex must be held.
  */
-static void vhost_worker_flush(struct vhost_worker *worker)
+static void __vhost_worker_flush(struct vhost_worker *worker)
 {
 	struct vhost_flush_struct flush;
 
+	if (!worker->attachment_cnt)
+		return;
+
 	init_completion(&flush.wait_event);
 	vhost_work_init(&flush.work, vhost_flush_work);
 
 	vhost_worker_queue(worker, &flush.work);
+	/*
+	 * Drop mutex in case our worker is killed and it needs to take the
+	 * mutex to force cleanup.
+	 */
+	mutex_unlock(&worker->mutex);
 	wait_for_completion(&flush.wait_event);
+	mutex_lock(&worker->mutex);
+}
+
+static void vhost_worker_flush(struct vhost_worker *worker)
+{
+	mutex_lock(&worker->mutex);
+	__vhost_worker_flush(worker);
+	mutex_unlock(&worker->mutex);
 }
 
 void vhost_dev_flush(struct vhost_dev *dev)
@@ -286,15 +301,8 @@ void vhost_dev_flush(struct vhost_dev *dev)
 	struct vhost_worker *worker;
 	unsigned long i;
 
-	xa_for_each(&dev->worker_xa, i, worker) {
-		mutex_lock(&worker->mutex);
-		if (!worker->attachment_cnt) {
-			mutex_unlock(&worker->mutex);
-			continue;
-		}
+	xa_for_each(&dev->worker_xa, i, worker)
 		vhost_worker_flush(worker);
-		mutex_unlock(&worker->mutex);
-	}
 }
 EXPORT_SYMBOL_GPL(vhost_dev_flush);
 
@@ -673,7 +681,6 @@ static void __vhost_vq_attach_worker(struct vhost_virtqueue *vq,
 	 * device wide flushes which doesn't use RCU for execution.
 	 */
 	mutex_lock(&old_worker->mutex);
-	old_worker->attachment_cnt--;
 	/*
 	 * We don't want to call synchronize_rcu for every vq during setup
 	 * because it will slow down VM startup. If we haven't done
@@ -684,6 +691,8 @@ static void __vhost_vq_attach_worker(struct vhost_virtqueue *vq,
 	mutex_lock(&vq->mutex);
 	if (!vhost_vq_get_backend(vq) && !vq->kick) {
 		mutex_unlock(&vq->mutex);
+
+		old_worker->attachment_cnt--;
 		mutex_unlock(&old_worker->mutex);
 		/*
 		 * vsock can queue anytime after VHOST_VSOCK_SET_GUEST_CID.
@@ -699,7 +708,8 @@ static void __vhost_vq_attach_worker(struct vhost_virtqueue *vq,
 	/* Make sure new vq queue/flush/poll calls see the new worker */
 	synchronize_rcu();
 	/* Make sure whatever was queued gets run */
-	vhost_worker_flush(old_worker);
+	__vhost_worker_flush(old_worker);
+	old_worker->attachment_cnt--;
 	mutex_unlock(&old_worker->mutex);
 }
 
@@ -752,6 +762,12 @@ static int vhost_free_worker(struct vhost_dev *dev,
 		mutex_unlock(&worker->mutex);
 		return -EBUSY;
 	}
+	/*
+	 * A flush might have raced and snuck in before attachment_cnt was set
+	 * to zero. Make sure flushes are flushed from the queue before
+	 * freeing.
+	 */
+	__vhost_worker_flush(worker);
 	mutex_unlock(&worker->mutex);
 
 	vhost_worker_destroy(dev, worker);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 8/9] vhost_task: Handle SIGKILL by flushing work and exiting
  2024-03-16  0:46 [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting Mike Christie
                   ` (6 preceding siblings ...)
  2024-03-16  0:47 ` [PATCH 7/9] vhost: Release worker mutex during flushes Mike Christie
@ 2024-03-16  0:47 ` Mike Christie
  2024-03-16  0:47 ` [PATCH 9/9] kernel: Remove signal hacks for vhost_tasks Mike Christie
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Mike Christie @ 2024-03-16  0:47 UTC (permalink / raw)
  To: oleg, ebiederm, virtualization, mst, sgarzare, jasowang,
	stefanha, brauner
  Cc: Mike Christie

Instead of lingering until the device is closed, this has us handle
SIGKILL by:

1. marking the worker as killed so we no longer try to use it with
new virtqueues and new flush operations.
2. setting the virtqueue to worker mapping so no new works are queued.
3. running all the exiting works.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/vhost.c            | 54 +++++++++++++++++++++++++++++---
 drivers/vhost/vhost.h            |  2 ++
 include/linux/sched/vhost_task.h |  3 +-
 kernel/vhost_task.c              | 53 ++++++++++++++++++++-----------
 4 files changed, 88 insertions(+), 24 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index b2012993e9fa..d64f2526c6ac 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -273,7 +273,7 @@ static void __vhost_worker_flush(struct vhost_worker *worker)
 {
 	struct vhost_flush_struct flush;
 
-	if (!worker->attachment_cnt)
+	if (!worker->attachment_cnt || worker->killed)
 		return;
 
 	init_completion(&flush.wait_event);
@@ -388,7 +388,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 	__vhost_vq_meta_reset(vq);
 }
 
-static bool vhost_worker(void *data)
+static bool vhost_run_work_list(void *data)
 {
 	struct vhost_worker *worker = data;
 	struct vhost_work *work, *work_next;
@@ -413,6 +413,40 @@ static bool vhost_worker(void *data)
 	return !!node;
 }
 
+static void vhost_worker_killed(void *data)
+{
+	struct vhost_worker *worker = data;
+	struct vhost_dev *dev = worker->dev;
+	struct vhost_virtqueue *vq;
+	int i, attach_cnt = 0;
+
+	mutex_lock(&worker->mutex);
+	worker->killed = true;
+
+	for (i = 0; i < dev->nvqs; i++) {
+		vq = dev->vqs[i];
+
+		mutex_lock(&vq->mutex);
+		if (worker ==
+		    rcu_dereference_check(vq->worker,
+					  lockdep_is_held(&vq->mutex))) {
+			rcu_assign_pointer(vq->worker, NULL);
+			attach_cnt++;
+		}
+		mutex_unlock(&vq->mutex);
+	}
+
+	worker->attachment_cnt -= attach_cnt;
+	if (attach_cnt)
+		synchronize_rcu();
+	/*
+	 * Finish vhost_worker_flush calls and any other works that snuck in
+	 * before the synchronize_rcu.
+	 */
+	vhost_run_work_list(worker);
+	mutex_unlock(&worker->mutex);
+}
+
 static void vhost_vq_free_iovecs(struct vhost_virtqueue *vq)
 {
 	kfree(vq->indirect);
@@ -627,9 +661,11 @@ static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev)
 	if (!worker)
 		return NULL;
 
+	worker->dev = dev;
 	snprintf(name, sizeof(name), "vhost-%d", current->pid);
 
-	vtsk = vhost_task_create(vhost_worker, worker, name);
+	vtsk = vhost_task_create(vhost_run_work_list, vhost_worker_killed,
+				 worker, name);
 	if (!vtsk)
 		goto free_worker;
 
@@ -661,6 +697,11 @@ static void __vhost_vq_attach_worker(struct vhost_virtqueue *vq,
 	struct vhost_worker *old_worker;
 
 	mutex_lock(&worker->mutex);
+	if (worker->killed) {
+		mutex_unlock(&worker->mutex);
+		return;
+	}
+
 	mutex_lock(&vq->mutex);
 
 	old_worker = rcu_dereference_check(vq->worker,
@@ -681,6 +722,11 @@ static void __vhost_vq_attach_worker(struct vhost_virtqueue *vq,
 	 * device wide flushes which doesn't use RCU for execution.
 	 */
 	mutex_lock(&old_worker->mutex);
+	if (old_worker->killed) {
+		mutex_unlock(&old_worker->mutex);
+		return;
+	}
+
 	/*
 	 * We don't want to call synchronize_rcu for every vq during setup
 	 * because it will slow down VM startup. If we haven't done
@@ -758,7 +804,7 @@ static int vhost_free_worker(struct vhost_dev *dev,
 		return -ENODEV;
 
 	mutex_lock(&worker->mutex);
-	if (worker->attachment_cnt) {
+	if (worker->attachment_cnt || worker->killed) {
 		mutex_unlock(&worker->mutex);
 		return -EBUSY;
 	}
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 91ade037f08e..bb75a292d50c 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -28,12 +28,14 @@ struct vhost_work {
 
 struct vhost_worker {
 	struct vhost_task	*vtsk;
+	struct vhost_dev	*dev;
 	/* Used to serialize device wide flushing with worker swapping. */
 	struct mutex		mutex;
 	struct llist_head	work_list;
 	u64			kcov_handle;
 	u32			id;
 	int			attachment_cnt;
+	bool			killed;
 };
 
 /* Poll a file (eventfd or socket) */
diff --git a/include/linux/sched/vhost_task.h b/include/linux/sched/vhost_task.h
index bc60243d43b3..25446c5d3508 100644
--- a/include/linux/sched/vhost_task.h
+++ b/include/linux/sched/vhost_task.h
@@ -4,7 +4,8 @@
 
 struct vhost_task;
 
-struct vhost_task *vhost_task_create(bool (*fn)(void *), void *arg,
+struct vhost_task *vhost_task_create(bool (*fn)(void *),
+				     void (*handle_kill)(void *), void *arg,
 				     const char *name);
 void vhost_task_start(struct vhost_task *vtsk);
 void vhost_task_stop(struct vhost_task *vtsk);
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index da35e5b7f047..48c289947b99 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -10,38 +10,32 @@
 
 enum vhost_task_flags {
 	VHOST_TASK_FLAGS_STOP,
+	VHOST_TASK_FLAGS_KILLED,
 };
 
 struct vhost_task {
 	bool (*fn)(void *data);
+	void (*handle_sigkill)(void *data);
 	void *data;
 	struct completion exited;
 	unsigned long flags;
 	struct task_struct *task;
+	/* serialize SIGKILL and vhost_task_stop calls */
+	struct mutex exit_mutex;
 };
 
 static int vhost_task_fn(void *data)
 {
 	struct vhost_task *vtsk = data;
-	bool dead = false;
 
 	for (;;) {
 		bool did_work;
 
-		if (!dead && signal_pending(current)) {
+		if (signal_pending(current)) {
 			struct ksignal ksig;
-			/*
-			 * Calling get_signal will block in SIGSTOP,
-			 * or clear fatal_signal_pending, but remember
-			 * what was set.
-			 *
-			 * This thread won't actually exit until all
-			 * of the file descriptors are closed, and
-			 * the release function is called.
-			 */
-			dead = get_signal(&ksig);
-			if (dead)
-				clear_thread_flag(TIF_SIGPENDING);
+
+			if (get_signal(&ksig))
+				break;
 		}
 
 		/* mb paired w/ vhost_task_stop */
@@ -57,7 +51,19 @@ static int vhost_task_fn(void *data)
 			schedule();
 	}
 
+	mutex_lock(&vtsk->exit_mutex);
+	/*
+	 * If a vhost_task_stop and SIGKILL race, we can ignore the SIGKILL.
+	 * When the vhost layer has called vhost_task_stop it's already stopped
+	 * new work and flushed.
+	 */
+	if (!test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags)) {
+		set_bit(VHOST_TASK_FLAGS_KILLED, &vtsk->flags);
+		vtsk->handle_sigkill(vtsk->data);
+	}
 	complete(&vtsk->exited);
+	mutex_unlock(&vtsk->exit_mutex);
+
 	do_exit(0);
 }
 
@@ -78,12 +84,17 @@ EXPORT_SYMBOL_GPL(vhost_task_wake);
  * @vtsk: vhost_task to stop
  *
  * vhost_task_fn ensures the worker thread exits after
- * VHOST_TASK_FLAGS_SOP becomes true.
+ * VHOST_TASK_FLAGS_STOP becomes true.
  */
 void vhost_task_stop(struct vhost_task *vtsk)
 {
-	set_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags);
-	vhost_task_wake(vtsk);
+	mutex_lock(&vtsk->exit_mutex);
+	if (!test_bit(VHOST_TASK_FLAGS_KILLED, &vtsk->flags)) {
+		set_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags);
+		vhost_task_wake(vtsk);
+	}
+	mutex_unlock(&vtsk->exit_mutex);
+
 	/*
 	 * Make sure vhost_task_fn is no longer accessing the vhost_task before
 	 * freeing it below.
@@ -96,14 +107,16 @@ EXPORT_SYMBOL_GPL(vhost_task_stop);
 /**
  * vhost_task_create - create a copy of a task to be used by the kernel
  * @fn: vhost worker function
- * @arg: data to be passed to fn
+ * @handle_sigkill: vhost function to handle when we are killed
+ * @arg: data to be passed to fn and handled_kill
  * @name: the thread's name
  *
  * This returns a specialized task for use by the vhost layer or NULL on
  * failure. The returned task is inactive, and the caller must fire it up
  * through vhost_task_start().
  */
-struct vhost_task *vhost_task_create(bool (*fn)(void *), void *arg,
+struct vhost_task *vhost_task_create(bool (*fn)(void *),
+				     void (*handle_sigkill)(void *), void *arg,
 				     const char *name)
 {
 	struct kernel_clone_args args = {
@@ -122,8 +135,10 @@ struct vhost_task *vhost_task_create(bool (*fn)(void *), void *arg,
 	if (!vtsk)
 		return NULL;
 	init_completion(&vtsk->exited);
+	mutex_init(&vtsk->exit_mutex);
 	vtsk->data = arg;
 	vtsk->fn = fn;
+	vtsk->handle_sigkill = handle_sigkill;
 
 	args.fn_arg = vtsk;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 9/9] kernel: Remove signal hacks for vhost_tasks
  2024-03-16  0:46 [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting Mike Christie
                   ` (7 preceding siblings ...)
  2024-03-16  0:47 ` [PATCH 8/9] vhost_task: Handle SIGKILL by flushing work and exiting Mike Christie
@ 2024-03-16  0:47 ` Mike Christie
  2024-04-09  4:16 ` [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting Jason Wang
  2024-04-11  8:39 ` Jason Wang
  10 siblings, 0 replies; 32+ messages in thread
From: Mike Christie @ 2024-03-16  0:47 UTC (permalink / raw)
  To: oleg, ebiederm, virtualization, mst, sgarzare, jasowang,
	stefanha, brauner
  Cc: Mike Christie

This removes the signal/coredump hacks added for vhost_tasks in:

Commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression")

When that patch was added vhost_tasks did not handle SIGKILL and would
try to ignore/clear the signal and continue on until the device's close
function was called. In the previous patches vhost_tasks and the vhost
drivers were converted to support SIGKILL by cleaning themselves up and
exiting. The hacks are no longer needed so this removes them.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 fs/coredump.c   | 4 +---
 kernel/exit.c   | 5 +----
 kernel/signal.c | 4 +---
 3 files changed, 3 insertions(+), 10 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index be6403b4b14b..8eae24afb3cb 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -371,9 +371,7 @@ static int zap_process(struct task_struct *start, int exit_code)
 		if (t != current && !(t->flags & PF_POSTCOREDUMP)) {
 			sigaddset(&t->pending.signal, SIGKILL);
 			signal_wake_up(t, 1);
-			/* The vhost_worker does not particpate in coredumps */
-			if ((t->flags & (PF_USER_WORKER | PF_IO_WORKER)) != PF_USER_WORKER)
-				nr++;
+			nr++;
 		}
 	}
 
diff --git a/kernel/exit.c b/kernel/exit.c
index dfb963d2f862..a8098b29a722 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -414,10 +414,7 @@ static void coredump_task_exit(struct task_struct *tsk)
 	tsk->flags |= PF_POSTCOREDUMP;
 	core_state = tsk->signal->core_state;
 	spin_unlock_irq(&tsk->sighand->siglock);
-
-	/* The vhost_worker does not particpate in coredumps */
-	if (core_state &&
-	    ((tsk->flags & (PF_IO_WORKER | PF_USER_WORKER)) != PF_USER_WORKER)) {
+	if (core_state) {
 		struct core_thread self;
 
 		self.task = current;
diff --git a/kernel/signal.c b/kernel/signal.c
index c9c57d053ce4..0fdca333207b 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1374,9 +1374,7 @@ int zap_other_threads(struct task_struct *p)
 
 	for_other_threads(p, t) {
 		task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
-		/* Don't require de_thread to wait for the vhost_worker */
-		if ((t->flags & (PF_IO_WORKER | PF_USER_WORKER)) != PF_USER_WORKER)
-			count++;
+		count++;
 
 		/* Don't bother with already dead threads */
 		if (t->exit_state)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting
  2024-03-16  0:46 [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting Mike Christie
                   ` (8 preceding siblings ...)
  2024-03-16  0:47 ` [PATCH 9/9] kernel: Remove signal hacks for vhost_tasks Mike Christie
@ 2024-04-09  4:16 ` Jason Wang
  2024-04-09 14:57   ` Mike Christie
  2024-04-18  7:10   ` Michael S. Tsirkin
  2024-04-11  8:39 ` Jason Wang
  10 siblings, 2 replies; 32+ messages in thread
From: Jason Wang @ 2024-04-09  4:16 UTC (permalink / raw)
  To: Mike Christie
  Cc: oleg, ebiederm, virtualization, mst, sgarzare, stefanha, brauner,
	Laurent Vivier

On Sat, Mar 16, 2024 at 8:47 AM Mike Christie
<michael.christie@oracle.com> wrote:
>
> The following patches were made over Linus's tree and also apply over
> mst's vhost branch. The patches add the ability for vhost_tasks to
> handle SIGKILL by flushing queued works, stop new works from being
> queued, and prepare the task for an early exit.
>
> This removes the need for the signal/coredump hacks added in:
>
> Commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression")
>
> when the vhost_task patches were initially merged and fix the issue
> in this thread:
>
> https://lore.kernel.org/all/000000000000a41b82060e875721@google.com/
>
> Long Background:
>
> The original vhost worker code didn't support any signals. If the
> userspace application that owned the worker got a SIGKILL, the app/
> process would exit dropping all references to the device and then the
> file operation's release function would be called. From there we would
> wait on running IO then cleanup the device's memory.
>
> When we switched to vhost_tasks being a thread in the owner's process we
> added some hacks to the signal/coredump code so we could continue to
> wait on running IO and process it from the vhost_task. The idea was that
> we would eventually remove the hacks. We recently hit this bug:
>
> https://lore.kernel.org/all/000000000000a41b82060e875721@google.com/
>
> It turns out only vhost-scsi had an issue where it would send a command
> to the block/LIO layer, wait for a response and then process in the vhost
> task.

Vhost-net TX zerocopy code did the same:

It sends zerocopy packets to the under layer and waits for the
underlayer. When the DMA is completed, vhost_zerocopy_callback will be
called to schedule vq work for used ring updating.

> So patches 1-5 prepares vhost-scsi to handle when the vhost_task
> is killed while we still have commands outstanding. The next patches then
> prepare and convert the vhost and vhost_task layers to handle SIGKILL
> by flushing running works, marking the vhost_task as dead so there's
> no future uses, then exiting.

Thanks

>
>
>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting
  2024-04-09  4:16 ` [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting Jason Wang
@ 2024-04-09 14:57   ` Mike Christie
  2024-04-09 16:40     ` Michael S. Tsirkin
  2024-04-18  7:10   ` Michael S. Tsirkin
  1 sibling, 1 reply; 32+ messages in thread
From: Mike Christie @ 2024-04-09 14:57 UTC (permalink / raw)
  To: Jason Wang
  Cc: oleg, ebiederm, virtualization, mst, sgarzare, stefanha, brauner,
	Laurent Vivier

On 4/8/24 11:16 PM, Jason Wang wrote:
>> It turns out only vhost-scsi had an issue where it would send a command
>> to the block/LIO layer, wait for a response and then process in the vhost
>> task.
> Vhost-net TX zerocopy code did the same:
> 
> It sends zerocopy packets to the under layer and waits for the
> underlayer. When the DMA is completed, vhost_zerocopy_callback will be
> called to schedule vq work for used ring updating.

I think it might be slightly different and vhost-net is ok.

Above I meant, vhost-scsi would do the equivalent of:

vhost_zerocopy_callback ->  vhost_net_ubuf_put

from vhost_task/thread.

For vhost-net then, if we get an early exit via a signal the patches will flush
queued works and prevent new ones from being queued. vhost_net_release will run
due to the process/thread exit code releasing it's refcounts on the device.

vhost_net_release will then do:

vhost_net_flush -> vhost_net_ubuf_put_and_wait

and wait for vhost_zerocopy_callback -> vhost_net_ubuf_put calls.

For vhost-scsi we would hang. We would do our equivalent of vhost_net_ubuf_put
from the vhost task/thread. So, when our release function, vhost_scsi_release,
is called we will also do a similar wait. However, because the task/thread is
killed, we will hang there since the "put" will never be done.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting
  2024-04-09 14:57   ` Mike Christie
@ 2024-04-09 16:40     ` Michael S. Tsirkin
  2024-04-09 21:55       ` michael.christie
  0 siblings, 1 reply; 32+ messages in thread
From: Michael S. Tsirkin @ 2024-04-09 16:40 UTC (permalink / raw)
  To: Mike Christie
  Cc: Jason Wang, oleg, ebiederm, virtualization, sgarzare, stefanha,
	brauner, Laurent Vivier

On Tue, Apr 09, 2024 at 09:57:00AM -0500, Mike Christie wrote:
> On 4/8/24 11:16 PM, Jason Wang wrote:
> >> It turns out only vhost-scsi had an issue where it would send a command
> >> to the block/LIO layer, wait for a response and then process in the vhost
> >> task.
> > Vhost-net TX zerocopy code did the same:
> > 
> > It sends zerocopy packets to the under layer and waits for the
> > underlayer. When the DMA is completed, vhost_zerocopy_callback will be
> > called to schedule vq work for used ring updating.
> 
> I think it might be slightly different and vhost-net is ok.
> 
> Above I meant, vhost-scsi would do the equivalent of:
> 
> vhost_zerocopy_callback ->  vhost_net_ubuf_put
> 
> from vhost_task/thread.
> 
> For vhost-net then, if we get an early exit via a signal the patches will flush
> queued works and prevent new ones from being queued. vhost_net_release will run
> due to the process/thread exit code releasing it's refcounts on the device.
> 
> vhost_net_release will then do:
> 
> vhost_net_flush -> vhost_net_ubuf_put_and_wait
> 
> and wait for vhost_zerocopy_callback -> vhost_net_ubuf_put calls.
> 
> For vhost-scsi we would hang. We would do our equivalent of vhost_net_ubuf_put
> from the vhost task/thread. So, when our release function, vhost_scsi_release,
> is called we will also do a similar wait. However, because the task/thread is
> killed, we will hang there since the "put" will never be done.

If you can find a way to cancel them instead of flushing out,
I think it would be better for net, too.

-- 
MST


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting
  2024-04-09 16:40     ` Michael S. Tsirkin
@ 2024-04-09 21:55       ` michael.christie
  2024-04-10  4:21         ` Michael S. Tsirkin
  0 siblings, 1 reply; 32+ messages in thread
From: michael.christie @ 2024-04-09 21:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, oleg, ebiederm, virtualization, sgarzare, stefanha,
	brauner, Laurent Vivier

On 4/9/24 11:40 AM, Michael S. Tsirkin wrote:
> On Tue, Apr 09, 2024 at 09:57:00AM -0500, Mike Christie wrote:
>> On 4/8/24 11:16 PM, Jason Wang wrote:
>>>> It turns out only vhost-scsi had an issue where it would send a command
>>>> to the block/LIO layer, wait for a response and then process in the vhost
>>>> task.
>>> Vhost-net TX zerocopy code did the same:
>>>
>>> It sends zerocopy packets to the under layer and waits for the
>>> underlayer. When the DMA is completed, vhost_zerocopy_callback will be
>>> called to schedule vq work for used ring updating.
>>
>> I think it might be slightly different and vhost-net is ok.
>>
>> Above I meant, vhost-scsi would do the equivalent of:
>>
>> vhost_zerocopy_callback ->  vhost_net_ubuf_put
>>
>> from vhost_task/thread.
>>
>> For vhost-net then, if we get an early exit via a signal the patches will flush
>> queued works and prevent new ones from being queued. vhost_net_release will run
>> due to the process/thread exit code releasing it's refcounts on the device.
>>
>> vhost_net_release will then do:
>>
>> vhost_net_flush -> vhost_net_ubuf_put_and_wait
>>
>> and wait for vhost_zerocopy_callback -> vhost_net_ubuf_put calls.
>>
>> For vhost-scsi we would hang. We would do our equivalent of vhost_net_ubuf_put
>> from the vhost task/thread. So, when our release function, vhost_scsi_release,
>> is called we will also do a similar wait. However, because the task/thread is
>> killed, we will hang there since the "put" will never be done.
> 
> If you can find a way to cancel them instead of flushing out,
> I think it would be better for net, too.

I don't think canceling block requests will be possible any time soon.
It's one of those things that people have wanted for decades but it gets
messy when you are dealing multiple layers (fs, md/md, request queue,
device driver, etc) that some don't even use the same struct sometimes
(bios vs requests) and you don't want to hurt performance.

So for vhost-scsi we can't cancel running block layer IOs because there
is no interface for it. We have to sit around and wait for them to
complete or timeout.

If you are saying cancel at just the vhost/vhost-net/vhost-scsi layers
then I can do that. However, I think it might be messier. We would clean
up some of the vhost related resource, then return from the
file_operations->release functions. Then we would have to have some
workqueue that just waits for driver specific responses (block layer
responses for vhost-scsi). When it gets them, then it does "puts" on
it's structs and eventually they are freed.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting
  2024-04-09 21:55       ` michael.christie
@ 2024-04-10  4:21         ` Michael S. Tsirkin
  0 siblings, 0 replies; 32+ messages in thread
From: Michael S. Tsirkin @ 2024-04-10  4:21 UTC (permalink / raw)
  To: michael.christie
  Cc: Jason Wang, oleg, ebiederm, virtualization, sgarzare, stefanha,
	brauner, Laurent Vivier

On Tue, Apr 09, 2024 at 04:55:25PM -0500, michael.christie@oracle.com wrote:
> On 4/9/24 11:40 AM, Michael S. Tsirkin wrote:
> > On Tue, Apr 09, 2024 at 09:57:00AM -0500, Mike Christie wrote:
> >> On 4/8/24 11:16 PM, Jason Wang wrote:
> >>>> It turns out only vhost-scsi had an issue where it would send a command
> >>>> to the block/LIO layer, wait for a response and then process in the vhost
> >>>> task.
> >>> Vhost-net TX zerocopy code did the same:
> >>>
> >>> It sends zerocopy packets to the under layer and waits for the
> >>> underlayer. When the DMA is completed, vhost_zerocopy_callback will be
> >>> called to schedule vq work for used ring updating.
> >>
> >> I think it might be slightly different and vhost-net is ok.
> >>
> >> Above I meant, vhost-scsi would do the equivalent of:
> >>
> >> vhost_zerocopy_callback ->  vhost_net_ubuf_put
> >>
> >> from vhost_task/thread.
> >>
> >> For vhost-net then, if we get an early exit via a signal the patches will flush
> >> queued works and prevent new ones from being queued. vhost_net_release will run
> >> due to the process/thread exit code releasing it's refcounts on the device.
> >>
> >> vhost_net_release will then do:
> >>
> >> vhost_net_flush -> vhost_net_ubuf_put_and_wait
> >>
> >> and wait for vhost_zerocopy_callback -> vhost_net_ubuf_put calls.
> >>
> >> For vhost-scsi we would hang. We would do our equivalent of vhost_net_ubuf_put
> >> from the vhost task/thread. So, when our release function, vhost_scsi_release,
> >> is called we will also do a similar wait. However, because the task/thread is
> >> killed, we will hang there since the "put" will never be done.
> > 
> > If you can find a way to cancel them instead of flushing out,
> > I think it would be better for net, too.
> 
> I don't think canceling block requests will be possible any time soon.
> It's one of those things that people have wanted for decades but it gets
> messy when you are dealing multiple layers (fs, md/md, request queue,
> device driver, etc) that some don't even use the same struct sometimes
> (bios vs requests) and you don't want to hurt performance.
> 
> So for vhost-scsi we can't cancel running block layer IOs because there
> is no interface for it. We have to sit around and wait for them to
> complete or timeout.
> 
> If you are saying cancel at just the vhost/vhost-net/vhost-scsi layers
> then I can do that. However, I think it might be messier.

I think that'd be kind of pointless ...

> We would clean
> up some of the vhost related resource, then return from the
> file_operations->release functions. Then we would have to have some
> workqueue that just waits for driver specific responses (block layer
> responses for vhost-scsi). When it gets them, then it does "puts" on
> it's structs and eventually they are freed.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting
  2024-03-16  0:46 [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting Mike Christie
                   ` (9 preceding siblings ...)
  2024-04-09  4:16 ` [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting Jason Wang
@ 2024-04-11  8:39 ` Jason Wang
  2024-04-11 16:19   ` Mike Christie
  10 siblings, 1 reply; 32+ messages in thread
From: Jason Wang @ 2024-04-11  8:39 UTC (permalink / raw)
  To: Mike Christie
  Cc: oleg, ebiederm, virtualization, mst, sgarzare, stefanha, brauner

On Sat, Mar 16, 2024 at 8:47 AM Mike Christie
<michael.christie@oracle.com> wrote:
>
> The following patches were made over Linus's tree and also apply over
> mst's vhost branch. The patches add the ability for vhost_tasks to
> handle SIGKILL by flushing queued works, stop new works from being
> queued, and prepare the task for an early exit.
>
> This removes the need for the signal/coredump hacks added in:
>
> Commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression")
>
> when the vhost_task patches were initially merged and fix the issue
> in this thread:
>
> https://lore.kernel.org/all/000000000000a41b82060e875721@google.com/
>
> Long Background:
>
> The original vhost worker code didn't support any signals. If the
> userspace application that owned the worker got a SIGKILL, the app/
> process would exit dropping all references to the device and then the
> file operation's release function would be called. From there we would
> wait on running IO then cleanup the device's memory.

A dumb question.

Is this a user space noticeable change? For example, with this series
a SIGKILL may shutdown the datapath ...

Thanks

>
> When we switched to vhost_tasks being a thread in the owner's process we
> added some hacks to the signal/coredump code so we could continue to
> wait on running IO and process it from the vhost_task. The idea was that
> we would eventually remove the hacks. We recently hit this bug:
>
> https://lore.kernel.org/all/000000000000a41b82060e875721@google.com/
>
> It turns out only vhost-scsi had an issue where it would send a command
> to the block/LIO layer, wait for a response and then process in the vhost
> task. So patches 1-5 prepares vhost-scsi to handle when the vhost_task
> is killed while we still have commands outstanding. The next patches then
> prepare and convert the vhost and vhost_task layers to handle SIGKILL
> by flushing running works, marking the vhost_task as dead so there's
> no future uses, then exiting.
>
>
>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting
  2024-04-11  8:39 ` Jason Wang
@ 2024-04-11 16:19   ` Mike Christie
  2024-04-12  3:28     ` Jason Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Mike Christie @ 2024-04-11 16:19 UTC (permalink / raw)
  To: Jason Wang
  Cc: oleg, ebiederm, virtualization, mst, sgarzare, stefanha, brauner

On 4/11/24 3:39 AM, Jason Wang wrote:
> On Sat, Mar 16, 2024 at 8:47 AM Mike Christie
> <michael.christie@oracle.com> wrote:
>>
>> The following patches were made over Linus's tree and also apply over
>> mst's vhost branch. The patches add the ability for vhost_tasks to
>> handle SIGKILL by flushing queued works, stop new works from being
>> queued, and prepare the task for an early exit.
>>
>> This removes the need for the signal/coredump hacks added in:
>>
>> Commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression")
>>
>> when the vhost_task patches were initially merged and fix the issue
>> in this thread:
>>
>> https://lore.kernel.org/all/000000000000a41b82060e875721@google.com/
>>
>> Long Background:
>>
>> The original vhost worker code didn't support any signals. If the
>> userspace application that owned the worker got a SIGKILL, the app/
>> process would exit dropping all references to the device and then the
>> file operation's release function would be called. From there we would
>> wait on running IO then cleanup the device's memory.
> 
> A dumb question.
> 
> Is this a user space noticeable change? For example, with this series
> a SIGKILL may shutdown the datapath ...

It already changed in 6.4. We basically added a new interface to shutdown
everything (userspace and vhost kernel parts). So we won't just shutdown
the data path while userspace is still running. We will shutdown everything
now if you send a SIGKILL to a vhost worker's thread.

Here are a lots of details:

- Pre-6.4 kernel, when vhost workers used kthreads, if you sent any signal
to a vhost worker, we ignore it. Nothing happens. kthreads are special and
can ignore all signals.

You could think of it as the worker is a completely different process than
qemu/userspace so they have completely different signal handlers. The
vhost worker signal handler ignores all signals even SIGKILL.

If you send a SIGKILL to a qemu thread, then it just exits right away. We
don't get to do an explicit close() on the vhost device and we don't get
to do ioctls like VHOST_NET_SET_BACKEND to clear backends. The kernel exit
code runs and releases refcounts on the device/file, then the vhost device's
file_operations->release function is called. vhost_dev_cleanup then stops
the vhost worker.

- In 6.4 and newer kernels, vhost workers use vhost_tasks, so the worker
can be thought of as a thread within the userspace process. With that
change we have the same signal handler as the userspace process.

If you send a SIGKILL to a qemu thread then it works like above.

If you send a SIGKILL to a vhost worker, the vhost worker still sort of
ignores it (that is the hack that I mentioned at the beginning of this
thread). kernel/vhost_task.c:vhost_task_fn will see the signal and
then just continue to process works until file_operations->release
calls 

However, the change in behavior is that because the worker is just a
thread within qemu, qemu is going to exit since they share a signal
handler and userspace can't ignore SIGKILL. We then run perform the
steps above like in the pre-6.4 kernel description as if you sent a
SIGKILL directly to a userspace thread.

- With the patches in this thread there is no major difference in behavior
with 6.4 and newer kernels. We might exit a little faster. Instead of the
vhost thread trying to do it's hacked up version of ignoring the signal
and waiting for userspace to exit and call file_operations->release, the
vhost worker thread will exit after it flushes works and stops new ones.
So now, you can have the vhost thread exiting in parallel with the
userspace thread.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting
  2024-04-11 16:19   ` Mike Christie
@ 2024-04-12  3:28     ` Jason Wang
  2024-04-12 16:52       ` michael.christie
  0 siblings, 1 reply; 32+ messages in thread
From: Jason Wang @ 2024-04-12  3:28 UTC (permalink / raw)
  To: Mike Christie
  Cc: oleg, ebiederm, virtualization, mst, sgarzare, stefanha, brauner

On Fri, Apr 12, 2024 at 12:19 AM Mike Christie
<michael.christie@oracle.com> wrote:
>
> On 4/11/24 3:39 AM, Jason Wang wrote:
> > On Sat, Mar 16, 2024 at 8:47 AM Mike Christie
> > <michael.christie@oracle.com> wrote:
> >>
> >> The following patches were made over Linus's tree and also apply over
> >> mst's vhost branch. The patches add the ability for vhost_tasks to
> >> handle SIGKILL by flushing queued works, stop new works from being
> >> queued, and prepare the task for an early exit.
> >>
> >> This removes the need for the signal/coredump hacks added in:
> >>
> >> Commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression")
> >>
> >> when the vhost_task patches were initially merged and fix the issue
> >> in this thread:
> >>
> >> https://lore.kernel.org/all/000000000000a41b82060e875721@google.com/
> >>
> >> Long Background:
> >>
> >> The original vhost worker code didn't support any signals. If the
> >> userspace application that owned the worker got a SIGKILL, the app/
> >> process would exit dropping all references to the device and then the
> >> file operation's release function would be called. From there we would
> >> wait on running IO then cleanup the device's memory.
> >
> > A dumb question.
> >
> > Is this a user space noticeable change? For example, with this series
> > a SIGKILL may shutdown the datapath ...
>
> It already changed in 6.4. We basically added a new interface to shutdown
> everything (userspace and vhost kernel parts). So we won't just shutdown
> the data path while userspace is still running. We will shutdown everything
> now if you send a SIGKILL to a vhost worker's thread.

If I understand correctly, for example Qemu can still live is SIGKILL
is just send to vhost thread.

If this is correct, guests may detect this (for example virtio-net has
a watchdog).

>
> Here are a lots of details:
>
> - Pre-6.4 kernel, when vhost workers used kthreads, if you sent any signal
> to a vhost worker, we ignore it. Nothing happens. kthreads are special and
> can ignore all signals.
>
> You could think of it as the worker is a completely different process than
> qemu/userspace so they have completely different signal handlers. The
> vhost worker signal handler ignores all signals even SIGKILL.

Yes.

>
> If you send a SIGKILL to a qemu thread, then it just exits right away. We
> don't get to do an explicit close() on the vhost device and we don't get
> to do ioctls like VHOST_NET_SET_BACKEND to clear backends. The kernel exit
> code runs and releases refcounts on the device/file, then the vhost device's
> file_operations->release function is called. vhost_dev_cleanup then stops
> the vhost worker.

Right.

>
> - In 6.4 and newer kernels, vhost workers use vhost_tasks, so the worker
> can be thought of as a thread within the userspace process. With that
> change we have the same signal handler as the userspace process.
>
> If you send a SIGKILL to a qemu thread then it works like above.
>
> If you send a SIGKILL to a vhost worker, the vhost worker still sort of
> ignores it (that is the hack that I mentioned at the beginning of this
> thread). kernel/vhost_task.c:vhost_task_fn will see the signal and
> then just continue to process works until file_operations->release
> calls

Yes, so this sticks to the behaviour before vhost_tasks.

>
> However, the change in behavior is that because the worker is just a
> thread within qemu, qemu is going to exit since they share a signal
> handler and userspace can't ignore SIGKILL.
> We then run perform the
> steps above like in the pre-6.4 kernel description as if you sent a
> SIGKILL directly to a userspace thread.
>
> - With the patches in this thread there is no major difference in behavior
> with 6.4 and newer kernels. We might exit a little faster. Instead of the
> vhost thread trying to do it's hacked up version of ignoring the signal
> and waiting for userspace to exit and call file_operations->release, the
> vhost worker thread will exit after it flushes works and stops new ones.
> So now, you can have the vhost thread exiting in parallel with the
> userspace thread.

Everything would be fine if Qemu wanted to quit but I meant there
could be a case where SIGKILL was sent to the vhost thread but not
Qemu.

Thanks

>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting
  2024-04-12  3:28     ` Jason Wang
@ 2024-04-12 16:52       ` michael.christie
  2024-04-15  8:52         ` Jason Wang
  0 siblings, 1 reply; 32+ messages in thread
From: michael.christie @ 2024-04-12 16:52 UTC (permalink / raw)
  To: Jason Wang
  Cc: oleg, ebiederm, virtualization, mst, sgarzare, stefanha, brauner

On 4/11/24 10:28 PM, Jason Wang wrote:
> On Fri, Apr 12, 2024 at 12:19 AM Mike Christie
> <michael.christie@oracle.com> wrote:
>>
>> On 4/11/24 3:39 AM, Jason Wang wrote:
>>> On Sat, Mar 16, 2024 at 8:47 AM Mike Christie
>>> <michael.christie@oracle.com> wrote:
>>>>
>>>> The following patches were made over Linus's tree and also apply over
>>>> mst's vhost branch. The patches add the ability for vhost_tasks to
>>>> handle SIGKILL by flushing queued works, stop new works from being
>>>> queued, and prepare the task for an early exit.
>>>>
>>>> This removes the need for the signal/coredump hacks added in:
>>>>
>>>> Commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression")
>>>>
>>>> when the vhost_task patches were initially merged and fix the issue
>>>> in this thread:
>>>>
>>>> https://lore.kernel.org/all/000000000000a41b82060e875721@google.com/
>>>>
>>>> Long Background:
>>>>
>>>> The original vhost worker code didn't support any signals. If the
>>>> userspace application that owned the worker got a SIGKILL, the app/
>>>> process would exit dropping all references to the device and then the
>>>> file operation's release function would be called. From there we would
>>>> wait on running IO then cleanup the device's memory.
>>>
>>> A dumb question.
>>>
>>> Is this a user space noticeable change? For example, with this series
>>> a SIGKILL may shutdown the datapath ...
>>
>> It already changed in 6.4. We basically added a new interface to shutdown
>> everything (userspace and vhost kernel parts). So we won't just shutdown
>> the data path while userspace is still running. We will shutdown everything
>> now if you send a SIGKILL to a vhost worker's thread.
> 
> If I understand correctly, for example Qemu can still live is SIGKILL
> is just send to vhost thread.

Pre-6.4 qemu could still survive if only the vhost thread got a SIGKILL.
We used kthreads which are special and can ignore it like how userspace
can ignore SIGHUP.

6.4 and newer kernels cannot survive. Even if the vhost thread sort of
ignores it like I described below where, the signal is still delivered
to the other qemu threads due to the shared signal handler. Userspace
can't ignore SIGKILL. It doesn't have any say in the matter, and the
kernel forces them to exit.

> 
> If this is correct, guests may detect this (for example virtio-net has
> a watchdog).
> 

What did you mean by that part? Do you mean if the vhost thread were to
exit, so drivers/vhost/net.c couldn't process IO, then the watchdog in
the guest (virtio-net driver in the guest kernel) would detect that? Or
are you saying the watchdog in the guest can detect signals that the
host gets?


>>
>> Here are a lots of details:
>>
>> - Pre-6.4 kernel, when vhost workers used kthreads, if you sent any signal
>> to a vhost worker, we ignore it. Nothing happens. kthreads are special and
>> can ignore all signals.
>>
>> You could think of it as the worker is a completely different process than
>> qemu/userspace so they have completely different signal handlers. The
>> vhost worker signal handler ignores all signals even SIGKILL.
> 
> Yes.
> 
>>
>> If you send a SIGKILL to a qemu thread, then it just exits right away. We
>> don't get to do an explicit close() on the vhost device and we don't get
>> to do ioctls like VHOST_NET_SET_BACKEND to clear backends. The kernel exit
>> code runs and releases refcounts on the device/file, then the vhost device's
>> file_operations->release function is called. vhost_dev_cleanup then stops
>> the vhost worker.
> 
> Right.
> 
>>
>> - In 6.4 and newer kernels, vhost workers use vhost_tasks, so the worker
>> can be thought of as a thread within the userspace process. With that
>> change we have the same signal handler as the userspace process.
>>
>> If you send a SIGKILL to a qemu thread then it works like above.
>>
>> If you send a SIGKILL to a vhost worker, the vhost worker still sort of
>> ignores it (that is the hack that I mentioned at the beginning of this
>> thread). kernel/vhost_task.c:vhost_task_fn will see the signal and
>> then just continue to process works until file_operations->release
>> calls
> 
> Yes, so this sticks to the behaviour before vhost_tasks.

Not exactly. The vhost_task stays alive temporarily.

The signal is still delivered to the userspace threads and they will
exit due to getting the SIGKILL also. SIGKILL goes to all the threads in
the process and all userspace threads exit like normal because the vhost
task and normal old userspace threads share a signal handler. When
userspace exits, the kernel force drops the refcounts on the vhost
devices and that runs the release function so the vhost_task will then exit.

So what I'm trying to say is that in 6.4 we already changed the behavior.

> 
>>
>> However, the change in behavior is that because the worker is just a
>> thread within qemu, qemu is going to exit since they share a signal
>> handler and userspace can't ignore SIGKILL.
>> We then run perform the
>> steps above like in the pre-6.4 kernel description as if you sent a
>> SIGKILL directly to a userspace thread.
>>
>> - With the patches in this thread there is no major difference in behavior
>> with 6.4 and newer kernels. We might exit a little faster. Instead of the
>> vhost thread trying to do it's hacked up version of ignoring the signal
>> and waiting for userspace to exit and call file_operations->release, the
>> vhost worker thread will exit after it flushes works and stops new ones.
>> So now, you can have the vhost thread exiting in parallel with the
>> userspace thread.
> 
> Everything would be fine if Qemu wanted to quit but I meant there
> could be a case where SIGKILL was sent to the vhost thread but not
> Qemu.


Yeah, in the last mail, I was trying to say that behavior already
changed in 6.4. In 6.4 we basically added a new interface to the vhost
layer that uses signals/SIGKILL to tell the vhost layer and userspace to
completely shutdown.

This patchset is just removing the hacks Eric and Oleg allowed for us in
6.4. That code where we do our hacked up version of ignoring signals
(clear it and just go on) combined with the signals/coredump related
hacks done in this patch:

commit f9010dbdce911ee1f1af1398a24b1f9f992e0080
Author: Mike Christie <michael.christie@oracle.com>
Date:   Thu Jun 1 13:32:32 2023 -0500

    fork, vhost: Use CLONE_THREAD to fix freezer/ps regression


is causing this bug:

https://lore.kernel.org/all/000000000000a41b82060e875721@google.com/

Basically, as Eric mentions in that thread the signal/kernel developers
considered the signal/coredump hacks were meant to be temp, and we were
supposed to convert to work more like io_uring's io_worker_exit. That's
what I'm doing in this patchset. Patch 9/9 in this set then removes the
hacks from the core kernel/signal code.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting
  2024-04-12 16:52       ` michael.christie
@ 2024-04-15  8:52         ` Jason Wang
  2024-04-17  3:50           ` Jason Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Jason Wang @ 2024-04-15  8:52 UTC (permalink / raw)
  To: michael.christie
  Cc: oleg, ebiederm, virtualization, mst, sgarzare, stefanha, brauner

On Sat, Apr 13, 2024 at 12:53 AM <michael.christie@oracle.com> wrote:
>
> On 4/11/24 10:28 PM, Jason Wang wrote:
> > On Fri, Apr 12, 2024 at 12:19 AM Mike Christie
> > <michael.christie@oracle.com> wrote:
> >>
> >> On 4/11/24 3:39 AM, Jason Wang wrote:
> >>> On Sat, Mar 16, 2024 at 8:47 AM Mike Christie
> >>> <michael.christie@oracle.com> wrote:
> >>>>
> >>>> The following patches were made over Linus's tree and also apply over
> >>>> mst's vhost branch. The patches add the ability for vhost_tasks to
> >>>> handle SIGKILL by flushing queued works, stop new works from being
> >>>> queued, and prepare the task for an early exit.
> >>>>
> >>>> This removes the need for the signal/coredump hacks added in:
> >>>>
> >>>> Commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression")
> >>>>
> >>>> when the vhost_task patches were initially merged and fix the issue
> >>>> in this thread:
> >>>>
> >>>> https://lore.kernel.org/all/000000000000a41b82060e875721@google.com/
> >>>>
> >>>> Long Background:
> >>>>
> >>>> The original vhost worker code didn't support any signals. If the
> >>>> userspace application that owned the worker got a SIGKILL, the app/
> >>>> process would exit dropping all references to the device and then the
> >>>> file operation's release function would be called. From there we would
> >>>> wait on running IO then cleanup the device's memory.
> >>>
> >>> A dumb question.
> >>>
> >>> Is this a user space noticeable change? For example, with this series
> >>> a SIGKILL may shutdown the datapath ...
> >>
> >> It already changed in 6.4. We basically added a new interface to shutdown
> >> everything (userspace and vhost kernel parts). So we won't just shutdown
> >> the data path while userspace is still running. We will shutdown everything
> >> now if you send a SIGKILL to a vhost worker's thread.
> >
> > If I understand correctly, for example Qemu can still live is SIGKILL
> > is just send to vhost thread.
>
> Pre-6.4 qemu could still survive if only the vhost thread got a SIGKILL.
> We used kthreads which are special and can ignore it like how userspace
> can ignore SIGHUP.
>
> 6.4 and newer kernels cannot survive. Even if the vhost thread sort of
> ignores it like I described below where, the signal is still delivered
> to the other qemu threads due to the shared signal handler. Userspace
> can't ignore SIGKILL. It doesn't have any say in the matter, and the
> kernel forces them to exit.

Ok, I see, so the reason is that vhost belongs to the same thread
group as the owner now.

>
> >
> > If this is correct, guests may detect this (for example virtio-net has
> > a watchdog).
> >
>
> What did you mean by that part? Do you mean if the vhost thread were to
> exit, so drivers/vhost/net.c couldn't process IO, then the watchdog in
> the guest (virtio-net driver in the guest kernel) would detect that?

I meant this one. But since we are using CLONE_THREAD, we won't see these.

> Or
> are you saying the watchdog in the guest can detect signals that the
> host gets?
>
>
> >>
> >> Here are a lots of details:
> >>
> >> - Pre-6.4 kernel, when vhost workers used kthreads, if you sent any signal
> >> to a vhost worker, we ignore it. Nothing happens. kthreads are special and
> >> can ignore all signals.
> >>
> >> You could think of it as the worker is a completely different process than
> >> qemu/userspace so they have completely different signal handlers. The
> >> vhost worker signal handler ignores all signals even SIGKILL.
> >
> > Yes.
> >
> >>
> >> If you send a SIGKILL to a qemu thread, then it just exits right away. We
> >> don't get to do an explicit close() on the vhost device and we don't get
> >> to do ioctls like VHOST_NET_SET_BACKEND to clear backends. The kernel exit
> >> code runs and releases refcounts on the device/file, then the vhost device's
> >> file_operations->release function is called. vhost_dev_cleanup then stops
> >> the vhost worker.
> >
> > Right.
> >
> >>
> >> - In 6.4 and newer kernels, vhost workers use vhost_tasks, so the worker
> >> can be thought of as a thread within the userspace process. With that
> >> change we have the same signal handler as the userspace process.
> >>
> >> If you send a SIGKILL to a qemu thread then it works like above.
> >>
> >> If you send a SIGKILL to a vhost worker, the vhost worker still sort of
> >> ignores it (that is the hack that I mentioned at the beginning of this
> >> thread). kernel/vhost_task.c:vhost_task_fn will see the signal and
> >> then just continue to process works until file_operations->release
> >> calls
> >
> > Yes, so this sticks to the behaviour before vhost_tasks.
>
> Not exactly. The vhost_task stays alive temporarily.
>
> The signal is still delivered to the userspace threads and they will
> exit due to getting the SIGKILL also. SIGKILL goes to all the threads in
> the process and all userspace threads exit like normal because the vhost
> task and normal old userspace threads share a signal handler. When
> userspace exits, the kernel force drops the refcounts on the vhost
> devices and that runs the release function so the vhost_task will then exit.
>
> So what I'm trying to say is that in 6.4 we already changed the behavior.

Yes. To say the truth, it looks even worse but it might be too late to fix.

>
> >
> >>
> >> However, the change in behavior is that because the worker is just a
> >> thread within qemu, qemu is going to exit since they share a signal
> >> handler and userspace can't ignore SIGKILL.
> >> We then run perform the
> >> steps above like in the pre-6.4 kernel description as if you sent a
> >> SIGKILL directly to a userspace thread.
> >>
> >> - With the patches in this thread there is no major difference in behavior
> >> with 6.4 and newer kernels. We might exit a little faster. Instead of the
> >> vhost thread trying to do it's hacked up version of ignoring the signal
> >> and waiting for userspace to exit and call file_operations->release, the
> >> vhost worker thread will exit after it flushes works and stops new ones.
> >> So now, you can have the vhost thread exiting in parallel with the
> >> userspace thread.
> >
> > Everything would be fine if Qemu wanted to quit but I meant there
> > could be a case where SIGKILL was sent to the vhost thread but not
> > Qemu.
>
>
> Yeah, in the last mail, I was trying to say that behavior already
> changed in 6.4. In 6.4 we basically added a new interface to the vhost
> layer that uses signals/SIGKILL to tell the vhost layer and userspace to
> completely shutdown.
>
> This patchset is just removing the hacks Eric and Oleg allowed for us in
> 6.4. That code where we do our hacked up version of ignoring signals
> (clear it and just go on) combined with the signals/coredump related
> hacks done in this patch:
>
> commit f9010dbdce911ee1f1af1398a24b1f9f992e0080
> Author: Mike Christie <michael.christie@oracle.com>
> Date:   Thu Jun 1 13:32:32 2023 -0500
>
>     fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
>
>
> is causing this bug:
>
> https://lore.kernel.org/all/000000000000a41b82060e875721@google.com/
>
> Basically, as Eric mentions in that thread the signal/kernel developers
> considered the signal/coredump hacks were meant to be temp, and we were
> supposed to convert to work more like io_uring's io_worker_exit. That's
> what I'm doing in this patchset. Patch 9/9 in this set then removes the
> hacks from the core kernel/signal code.
>

Ok.

Thanks


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting
  2024-04-15  8:52         ` Jason Wang
@ 2024-04-17  3:50           ` Jason Wang
  2024-04-17 16:03             ` Mike Christie
  0 siblings, 1 reply; 32+ messages in thread
From: Jason Wang @ 2024-04-17  3:50 UTC (permalink / raw)
  To: michael.christie
  Cc: oleg, ebiederm, virtualization, mst, sgarzare, stefanha, brauner,
	Andreas Karis, Laurent Vivier

On Mon, Apr 15, 2024 at 4:52 PM Jason Wang <jasowang@redhat.com> wrote:
>
> On Sat, Apr 13, 2024 at 12:53 AM <michael.christie@oracle.com> wrote:
> >
> > On 4/11/24 10:28 PM, Jason Wang wrote:
> > > On Fri, Apr 12, 2024 at 12:19 AM Mike Christie
> > > <michael.christie@oracle.com> wrote:
> > >>
> > >> On 4/11/24 3:39 AM, Jason Wang wrote:
> > >>> On Sat, Mar 16, 2024 at 8:47 AM Mike Christie
> > >>> <michael.christie@oracle.com> wrote:
> > >>>>
> > >>>> The following patches were made over Linus's tree and also apply over
> > >>>> mst's vhost branch. The patches add the ability for vhost_tasks to
> > >>>> handle SIGKILL by flushing queued works, stop new works from being
> > >>>> queued, and prepare the task for an early exit.
> > >>>>
> > >>>> This removes the need for the signal/coredump hacks added in:
> > >>>>
> > >>>> Commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression")
> > >>>>
> > >>>> when the vhost_task patches were initially merged and fix the issue
> > >>>> in this thread:
> > >>>>
> > >>>> https://lore.kernel.org/all/000000000000a41b82060e875721@google.com/
> > >>>>
> > >>>> Long Background:
> > >>>>
> > >>>> The original vhost worker code didn't support any signals. If the
> > >>>> userspace application that owned the worker got a SIGKILL, the app/
> > >>>> process would exit dropping all references to the device and then the
> > >>>> file operation's release function would be called. From there we would
> > >>>> wait on running IO then cleanup the device's memory.
> > >>>
> > >>> A dumb question.
> > >>>
> > >>> Is this a user space noticeable change? For example, with this series
> > >>> a SIGKILL may shutdown the datapath ...
> > >>
> > >> It already changed in 6.4. We basically added a new interface to shutdown
> > >> everything (userspace and vhost kernel parts). So we won't just shutdown
> > >> the data path while userspace is still running. We will shutdown everything
> > >> now if you send a SIGKILL to a vhost worker's thread.
> > >
> > > If I understand correctly, for example Qemu can still live is SIGKILL
> > > is just send to vhost thread.
> >
> > Pre-6.4 qemu could still survive if only the vhost thread got a SIGKILL.
> > We used kthreads which are special and can ignore it like how userspace
> > can ignore SIGHUP.
> >
> > 6.4 and newer kernels cannot survive. Even if the vhost thread sort of
> > ignores it like I described below where, the signal is still delivered
> > to the other qemu threads due to the shared signal handler. Userspace
> > can't ignore SIGKILL. It doesn't have any say in the matter, and the
> > kernel forces them to exit.
>
> Ok, I see, so the reason is that vhost belongs to the same thread
> group as the owner now.
>
> >
> > >
> > > If this is correct, guests may detect this (for example virtio-net has
> > > a watchdog).
> > >
> >
> > What did you mean by that part? Do you mean if the vhost thread were to
> > exit, so drivers/vhost/net.c couldn't process IO, then the watchdog in
> > the guest (virtio-net driver in the guest kernel) would detect that?
>
> I meant this one. But since we are using CLONE_THREAD, we won't see these.
>
> > Or
> > are you saying the watchdog in the guest can detect signals that the
> > host gets?
> >
> >
> > >>
> > >> Here are a lots of details:
> > >>
> > >> - Pre-6.4 kernel, when vhost workers used kthreads, if you sent any signal
> > >> to a vhost worker, we ignore it. Nothing happens. kthreads are special and
> > >> can ignore all signals.
> > >>
> > >> You could think of it as the worker is a completely different process than
> > >> qemu/userspace so they have completely different signal handlers. The
> > >> vhost worker signal handler ignores all signals even SIGKILL.
> > >
> > > Yes.
> > >
> > >>
> > >> If you send a SIGKILL to a qemu thread, then it just exits right away. We
> > >> don't get to do an explicit close() on the vhost device and we don't get
> > >> to do ioctls like VHOST_NET_SET_BACKEND to clear backends. The kernel exit
> > >> code runs and releases refcounts on the device/file, then the vhost device's
> > >> file_operations->release function is called. vhost_dev_cleanup then stops
> > >> the vhost worker.
> > >
> > > Right.
> > >
> > >>
> > >> - In 6.4 and newer kernels, vhost workers use vhost_tasks, so the worker
> > >> can be thought of as a thread within the userspace process. With that
> > >> change we have the same signal handler as the userspace process.
> > >>
> > >> If you send a SIGKILL to a qemu thread then it works like above.
> > >>
> > >> If you send a SIGKILL to a vhost worker, the vhost worker still sort of
> > >> ignores it (that is the hack that I mentioned at the beginning of this
> > >> thread). kernel/vhost_task.c:vhost_task_fn will see the signal and
> > >> then just continue to process works until file_operations->release
> > >> calls
> > >
> > > Yes, so this sticks to the behaviour before vhost_tasks.
> >
> > Not exactly. The vhost_task stays alive temporarily.
> >
> > The signal is still delivered to the userspace threads and they will
> > exit due to getting the SIGKILL also. SIGKILL goes to all the threads in
> > the process and all userspace threads exit like normal because the vhost
> > task and normal old userspace threads share a signal handler. When
> > userspace exits, the kernel force drops the refcounts on the vhost
> > devices and that runs the release function so the vhost_task will then exit.
> >
> > So what I'm trying to say is that in 6.4 we already changed the behavior.
>
> Yes. To say the truth, it looks even worse but it might be too late to fix.

Andres (cced) has identified two other possible changes:

1) doesn't run in the global PID namespace but run in the namespace of owner
2) doesn't inherit kthreadd's scheduling attributes but the owner

Though such a change makes more sense for some use cases, it may break others.

I wonder if we need to introduce a new flag and bring the old kthread
codes if the flag is not set? Then we would not end up trying to align
the behaviour?

Thanks


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting
  2024-04-17  3:50           ` Jason Wang
@ 2024-04-17 16:03             ` Mike Christie
  2024-04-18  4:08               ` Jason Wang
  2024-04-18  7:01               ` Michael S. Tsirkin
  0 siblings, 2 replies; 32+ messages in thread
From: Mike Christie @ 2024-04-17 16:03 UTC (permalink / raw)
  To: Jason Wang
  Cc: oleg, ebiederm, virtualization, mst, sgarzare, stefanha, brauner,
	Andreas Karis, Laurent Vivier

On 4/16/24 10:50 PM, Jason Wang wrote:
> On Mon, Apr 15, 2024 at 4:52 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On Sat, Apr 13, 2024 at 12:53 AM <michael.christie@oracle.com> wrote:
>>>
>>> On 4/11/24 10:28 PM, Jason Wang wrote:
>>>> On Fri, Apr 12, 2024 at 12:19 AM Mike Christie
>>>> <michael.christie@oracle.com> wrote:
>>>>>
>>>>> On 4/11/24 3:39 AM, Jason Wang wrote:
>>>>>> On Sat, Mar 16, 2024 at 8:47 AM Mike Christie
>>>>>> <michael.christie@oracle.com> wrote:
>>>>>>>
>>>>>>> The following patches were made over Linus's tree and also apply over
>>>>>>> mst's vhost branch. The patches add the ability for vhost_tasks to
>>>>>>> handle SIGKILL by flushing queued works, stop new works from being
>>>>>>> queued, and prepare the task for an early exit.
>>>>>>>
>>>>>>> This removes the need for the signal/coredump hacks added in:
>>>>>>>
>>>>>>> Commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression")
>>>>>>>
>>>>>>> when the vhost_task patches were initially merged and fix the issue
>>>>>>> in this thread:
>>>>>>>
>>>>>>> https://lore.kernel.org/all/000000000000a41b82060e875721@google.com/
>>>>>>>
>>>>>>> Long Background:
>>>>>>>
>>>>>>> The original vhost worker code didn't support any signals. If the
>>>>>>> userspace application that owned the worker got a SIGKILL, the app/
>>>>>>> process would exit dropping all references to the device and then the
>>>>>>> file operation's release function would be called. From there we would
>>>>>>> wait on running IO then cleanup the device's memory.
>>>>>>
>>>>>> A dumb question.
>>>>>>
>>>>>> Is this a user space noticeable change? For example, with this series
>>>>>> a SIGKILL may shutdown the datapath ...
>>>>>
>>>>> It already changed in 6.4. We basically added a new interface to shutdown
>>>>> everything (userspace and vhost kernel parts). So we won't just shutdown
>>>>> the data path while userspace is still running. We will shutdown everything
>>>>> now if you send a SIGKILL to a vhost worker's thread.
>>>>
>>>> If I understand correctly, for example Qemu can still live is SIGKILL
>>>> is just send to vhost thread.
>>>
>>> Pre-6.4 qemu could still survive if only the vhost thread got a SIGKILL.
>>> We used kthreads which are special and can ignore it like how userspace
>>> can ignore SIGHUP.
>>>
>>> 6.4 and newer kernels cannot survive. Even if the vhost thread sort of
>>> ignores it like I described below where, the signal is still delivered
>>> to the other qemu threads due to the shared signal handler. Userspace
>>> can't ignore SIGKILL. It doesn't have any say in the matter, and the
>>> kernel forces them to exit.
>>
>> Ok, I see, so the reason is that vhost belongs to the same thread
>> group as the owner now.
>>
>>>
>>>>
>>>> If this is correct, guests may detect this (for example virtio-net has
>>>> a watchdog).
>>>>
>>>
>>> What did you mean by that part? Do you mean if the vhost thread were to
>>> exit, so drivers/vhost/net.c couldn't process IO, then the watchdog in
>>> the guest (virtio-net driver in the guest kernel) would detect that?
>>
>> I meant this one. But since we are using CLONE_THREAD, we won't see these.
>>
>>> Or
>>> are you saying the watchdog in the guest can detect signals that the
>>> host gets?
>>>
>>>
>>>>>
>>>>> Here are a lots of details:
>>>>>
>>>>> - Pre-6.4 kernel, when vhost workers used kthreads, if you sent any signal
>>>>> to a vhost worker, we ignore it. Nothing happens. kthreads are special and
>>>>> can ignore all signals.
>>>>>
>>>>> You could think of it as the worker is a completely different process than
>>>>> qemu/userspace so they have completely different signal handlers. The
>>>>> vhost worker signal handler ignores all signals even SIGKILL.
>>>>
>>>> Yes.
>>>>
>>>>>
>>>>> If you send a SIGKILL to a qemu thread, then it just exits right away. We
>>>>> don't get to do an explicit close() on the vhost device and we don't get
>>>>> to do ioctls like VHOST_NET_SET_BACKEND to clear backends. The kernel exit
>>>>> code runs and releases refcounts on the device/file, then the vhost device's
>>>>> file_operations->release function is called. vhost_dev_cleanup then stops
>>>>> the vhost worker.
>>>>
>>>> Right.
>>>>
>>>>>
>>>>> - In 6.4 and newer kernels, vhost workers use vhost_tasks, so the worker
>>>>> can be thought of as a thread within the userspace process. With that
>>>>> change we have the same signal handler as the userspace process.
>>>>>
>>>>> If you send a SIGKILL to a qemu thread then it works like above.
>>>>>
>>>>> If you send a SIGKILL to a vhost worker, the vhost worker still sort of
>>>>> ignores it (that is the hack that I mentioned at the beginning of this
>>>>> thread). kernel/vhost_task.c:vhost_task_fn will see the signal and
>>>>> then just continue to process works until file_operations->release
>>>>> calls
>>>>
>>>> Yes, so this sticks to the behaviour before vhost_tasks.
>>>
>>> Not exactly. The vhost_task stays alive temporarily.
>>>
>>> The signal is still delivered to the userspace threads and they will
>>> exit due to getting the SIGKILL also. SIGKILL goes to all the threads in
>>> the process and all userspace threads exit like normal because the vhost
>>> task and normal old userspace threads share a signal handler. When
>>> userspace exits, the kernel force drops the refcounts on the vhost
>>> devices and that runs the release function so the vhost_task will then exit.
>>>
>>> So what I'm trying to say is that in 6.4 we already changed the behavior.
>>
>> Yes. To say the truth, it looks even worse but it might be too late to fix.
> 
> Andres (cced) has identified two other possible changes:
> 
> 1) doesn't run in the global PID namespace but run in the namespace of owner

Yeah, I mentioned that one in vhost.h like it's a feature and when posting
the patches I mentioned it as a possible fix. I mean I thought we wanted it
to work like qemu and iothreads where the iothread would inherit all those
values automatically.

At the time, I thought we didn't inherit the namespace, like we did the cgroup,
because there was no kernel function for it (like how we didn't inherit v2
cgroups until recently when someone added some code for that).

I don't know if it's allowed to have something like qemu in namespace N but then
have it's children (vhost thread in this case) in the global namespace. I'll
look into it.

> 2) doesn't inherit kthreadd's scheduling attributes but the owner

Same as above for this one. I thought I was fixing a bug where before
we had to manually tune the vhost thread's values but for iothreads they
automatically got setup.

Just to clarify this one. When we used kthreads, kthread() will reset the
scheduler priority for the kthread that's created, so we got the default
values instead of inheriting kthreadd's values.  So we would want:

+	sched_setscheduler_nocheck(current, SCHED_NORMAL, &param);

in vhost_task_fn() instead of inheriting kthreadd's values.

> 
> Though such a change makes more sense for some use cases, it may break others.
> 
> I wonder if we need to introduce a new flag and bring the old kthread

Do you mean something like a module param?

> codes if the flag is not set? Then we would not end up trying to align
> the behaviour?
>

Let me know what you guys prefer. The sched part is easy. The namespace
part might be more difficult, but I will look into it if you want it.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting
  2024-04-17 16:03             ` Mike Christie
@ 2024-04-18  4:08               ` Jason Wang
  2024-04-18  7:07                 ` Michael S. Tsirkin
  2024-04-18  7:01               ` Michael S. Tsirkin
  1 sibling, 1 reply; 32+ messages in thread
From: Jason Wang @ 2024-04-18  4:08 UTC (permalink / raw)
  To: Mike Christie
  Cc: oleg, ebiederm, virtualization, mst, sgarzare, stefanha, brauner,
	Andreas Karis, Laurent Vivier

On Thu, Apr 18, 2024 at 12:10 AM Mike Christie
<michael.christie@oracle.com> wrote:
>
> On 4/16/24 10:50 PM, Jason Wang wrote:
> > On Mon, Apr 15, 2024 at 4:52 PM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On Sat, Apr 13, 2024 at 12:53 AM <michael.christie@oracle.com> wrote:
> >>>
> >>> On 4/11/24 10:28 PM, Jason Wang wrote:
> >>>> On Fri, Apr 12, 2024 at 12:19 AM Mike Christie
> >>>> <michael.christie@oracle.com> wrote:
> >>>>>
> >>>>> On 4/11/24 3:39 AM, Jason Wang wrote:
> >>>>>> On Sat, Mar 16, 2024 at 8:47 AM Mike Christie
> >>>>>> <michael.christie@oracle.com> wrote:
> >>>>>>>
> >>>>>>> The following patches were made over Linus's tree and also apply over
> >>>>>>> mst's vhost branch. The patches add the ability for vhost_tasks to
> >>>>>>> handle SIGKILL by flushing queued works, stop new works from being
> >>>>>>> queued, and prepare the task for an early exit.
> >>>>>>>
> >>>>>>> This removes the need for the signal/coredump hacks added in:
> >>>>>>>
> >>>>>>> Commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression")
> >>>>>>>
> >>>>>>> when the vhost_task patches were initially merged and fix the issue
> >>>>>>> in this thread:
> >>>>>>>
> >>>>>>> https://lore.kernel.org/all/000000000000a41b82060e875721@google.com/
> >>>>>>>
> >>>>>>> Long Background:
> >>>>>>>
> >>>>>>> The original vhost worker code didn't support any signals. If the
> >>>>>>> userspace application that owned the worker got a SIGKILL, the app/
> >>>>>>> process would exit dropping all references to the device and then the
> >>>>>>> file operation's release function would be called. From there we would
> >>>>>>> wait on running IO then cleanup the device's memory.
> >>>>>>
> >>>>>> A dumb question.
> >>>>>>
> >>>>>> Is this a user space noticeable change? For example, with this series
> >>>>>> a SIGKILL may shutdown the datapath ...
> >>>>>
> >>>>> It already changed in 6.4. We basically added a new interface to shutdown
> >>>>> everything (userspace and vhost kernel parts). So we won't just shutdown
> >>>>> the data path while userspace is still running. We will shutdown everything
> >>>>> now if you send a SIGKILL to a vhost worker's thread.
> >>>>
> >>>> If I understand correctly, for example Qemu can still live is SIGKILL
> >>>> is just send to vhost thread.
> >>>
> >>> Pre-6.4 qemu could still survive if only the vhost thread got a SIGKILL.
> >>> We used kthreads which are special and can ignore it like how userspace
> >>> can ignore SIGHUP.
> >>>
> >>> 6.4 and newer kernels cannot survive. Even if the vhost thread sort of
> >>> ignores it like I described below where, the signal is still delivered
> >>> to the other qemu threads due to the shared signal handler. Userspace
> >>> can't ignore SIGKILL. It doesn't have any say in the matter, and the
> >>> kernel forces them to exit.
> >>
> >> Ok, I see, so the reason is that vhost belongs to the same thread
> >> group as the owner now.
> >>
> >>>
> >>>>
> >>>> If this is correct, guests may detect this (for example virtio-net has
> >>>> a watchdog).
> >>>>
> >>>
> >>> What did you mean by that part? Do you mean if the vhost thread were to
> >>> exit, so drivers/vhost/net.c couldn't process IO, then the watchdog in
> >>> the guest (virtio-net driver in the guest kernel) would detect that?
> >>
> >> I meant this one. But since we are using CLONE_THREAD, we won't see these.
> >>
> >>> Or
> >>> are you saying the watchdog in the guest can detect signals that the
> >>> host gets?
> >>>
> >>>
> >>>>>
> >>>>> Here are a lots of details:
> >>>>>
> >>>>> - Pre-6.4 kernel, when vhost workers used kthreads, if you sent any signal
> >>>>> to a vhost worker, we ignore it. Nothing happens. kthreads are special and
> >>>>> can ignore all signals.
> >>>>>
> >>>>> You could think of it as the worker is a completely different process than
> >>>>> qemu/userspace so they have completely different signal handlers. The
> >>>>> vhost worker signal handler ignores all signals even SIGKILL.
> >>>>
> >>>> Yes.
> >>>>
> >>>>>
> >>>>> If you send a SIGKILL to a qemu thread, then it just exits right away. We
> >>>>> don't get to do an explicit close() on the vhost device and we don't get
> >>>>> to do ioctls like VHOST_NET_SET_BACKEND to clear backends. The kernel exit
> >>>>> code runs and releases refcounts on the device/file, then the vhost device's
> >>>>> file_operations->release function is called. vhost_dev_cleanup then stops
> >>>>> the vhost worker.
> >>>>
> >>>> Right.
> >>>>
> >>>>>
> >>>>> - In 6.4 and newer kernels, vhost workers use vhost_tasks, so the worker
> >>>>> can be thought of as a thread within the userspace process. With that
> >>>>> change we have the same signal handler as the userspace process.
> >>>>>
> >>>>> If you send a SIGKILL to a qemu thread then it works like above.
> >>>>>
> >>>>> If you send a SIGKILL to a vhost worker, the vhost worker still sort of
> >>>>> ignores it (that is the hack that I mentioned at the beginning of this
> >>>>> thread). kernel/vhost_task.c:vhost_task_fn will see the signal and
> >>>>> then just continue to process works until file_operations->release
> >>>>> calls
> >>>>
> >>>> Yes, so this sticks to the behaviour before vhost_tasks.
> >>>
> >>> Not exactly. The vhost_task stays alive temporarily.
> >>>
> >>> The signal is still delivered to the userspace threads and they will
> >>> exit due to getting the SIGKILL also. SIGKILL goes to all the threads in
> >>> the process and all userspace threads exit like normal because the vhost
> >>> task and normal old userspace threads share a signal handler. When
> >>> userspace exits, the kernel force drops the refcounts on the vhost
> >>> devices and that runs the release function so the vhost_task will then exit.
> >>>
> >>> So what I'm trying to say is that in 6.4 we already changed the behavior.
> >>
> >> Yes. To say the truth, it looks even worse but it might be too late to fix.
> >
> > Andres (cced) has identified two other possible changes:
> >
> > 1) doesn't run in the global PID namespace but run in the namespace of owner
>
> Yeah, I mentioned that one in vhost.h like it's a feature and when posting
> the patches I mentioned it as a possible fix. I mean I thought we wanted it
> to work like qemu and iothreads where the iothread would inherit all those
> values automatically.

Right, but it could be noticed by the userspace, especially for the
one that tries to do tweak on the performance.

The root cause is the that now we do copy_processs() in the process of
Qemu instead of the kthreadd. Which result of the the differences of
namespace (I think PID namespace is not the only one we see
difference) and others for the vhost task.

>
> At the time, I thought we didn't inherit the namespace, like we did the cgroup,
> because there was no kernel function for it (like how we didn't inherit v2
> cgroups until recently when someone added some code for that).
>
> I don't know if it's allowed to have something like qemu in namespace N but then
> have it's children (vhost thread in this case) in the global namespace.
> I'll
> look into it.

Instead of moving vhost thread between difference namespaces, I wonder
if the following is simpler:

if (new_flag)
    vhost_task_create()
else
    kthread_create()

New flag inherits the attributes of Qemu (namespaces, rlimit, cgroup,
scheduling attributes ...) which is what we want. Without the new
flag, we stick exactly to the behaviour as in the past to unbreak
existing userspace.

>
> > 2) doesn't inherit kthreadd's scheduling attributes but the owner
>
> Same as above for this one. I thought I was fixing a bug where before
> we had to manually tune the vhost thread's values but for iothreads they
> automatically got setup.
>
> Just to clarify this one. When we used kthreads, kthread() will reset the
> scheduler priority for the kthread that's created, so we got the default
> values instead of inheriting kthreadd's values.  So we would want:
>
> +       sched_setscheduler_nocheck(current, SCHED_NORMAL, &param);
>
> in vhost_task_fn() instead of inheriting kthreadd's values.
>
> >
> > Though such a change makes more sense for some use cases, it may break others.
> >
> > I wonder if we need to introduce a new flag and bring the old kthread
>
> Do you mean something like a module param?

This requires the management layer to know if it has a new user space
or not which is hard. A better place is to introduce backend features.

>
> > codes if the flag is not set? Then we would not end up trying to align
> > the behaviour?
> >
>
> Let me know what you guys prefer. The sched part is easy. The namespace
> part might be more difficult, but I will look into it if you want it.

Thanks a lot. I think it would be better to have the namespace part
(as well as other namespaces) then we don't need to answer hard
questions like if it can break user space or not.

>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting
  2024-04-17 16:03             ` Mike Christie
  2024-04-18  4:08               ` Jason Wang
@ 2024-04-18  7:01               ` Michael S. Tsirkin
  1 sibling, 0 replies; 32+ messages in thread
From: Michael S. Tsirkin @ 2024-04-18  7:01 UTC (permalink / raw)
  To: Mike Christie
  Cc: Jason Wang, oleg, ebiederm, virtualization, sgarzare, stefanha,
	brauner, Andreas Karis, Laurent Vivier

On Wed, Apr 17, 2024 at 11:03:07AM -0500, Mike Christie wrote:
> On 4/16/24 10:50 PM, Jason Wang wrote:
> > On Mon, Apr 15, 2024 at 4:52 PM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On Sat, Apr 13, 2024 at 12:53 AM <michael.christie@oracle.com> wrote:
> >>>
> >>> On 4/11/24 10:28 PM, Jason Wang wrote:
> >>>> On Fri, Apr 12, 2024 at 12:19 AM Mike Christie
> >>>> <michael.christie@oracle.com> wrote:
> >>>>>
> >>>>> On 4/11/24 3:39 AM, Jason Wang wrote:
> >>>>>> On Sat, Mar 16, 2024 at 8:47 AM Mike Christie
> >>>>>> <michael.christie@oracle.com> wrote:
> >>>>>>>
> >>>>>>> The following patches were made over Linus's tree and also apply over
> >>>>>>> mst's vhost branch. The patches add the ability for vhost_tasks to
> >>>>>>> handle SIGKILL by flushing queued works, stop new works from being
> >>>>>>> queued, and prepare the task for an early exit.
> >>>>>>>
> >>>>>>> This removes the need for the signal/coredump hacks added in:
> >>>>>>>
> >>>>>>> Commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression")
> >>>>>>>
> >>>>>>> when the vhost_task patches were initially merged and fix the issue
> >>>>>>> in this thread:
> >>>>>>>
> >>>>>>> https://lore.kernel.org/all/000000000000a41b82060e875721@google.com/
> >>>>>>>
> >>>>>>> Long Background:
> >>>>>>>
> >>>>>>> The original vhost worker code didn't support any signals. If the
> >>>>>>> userspace application that owned the worker got a SIGKILL, the app/
> >>>>>>> process would exit dropping all references to the device and then the
> >>>>>>> file operation's release function would be called. From there we would
> >>>>>>> wait on running IO then cleanup the device's memory.
> >>>>>>
> >>>>>> A dumb question.
> >>>>>>
> >>>>>> Is this a user space noticeable change? For example, with this series
> >>>>>> a SIGKILL may shutdown the datapath ...
> >>>>>
> >>>>> It already changed in 6.4. We basically added a new interface to shutdown
> >>>>> everything (userspace and vhost kernel parts). So we won't just shutdown
> >>>>> the data path while userspace is still running. We will shutdown everything
> >>>>> now if you send a SIGKILL to a vhost worker's thread.
> >>>>
> >>>> If I understand correctly, for example Qemu can still live is SIGKILL
> >>>> is just send to vhost thread.
> >>>
> >>> Pre-6.4 qemu could still survive if only the vhost thread got a SIGKILL.
> >>> We used kthreads which are special and can ignore it like how userspace
> >>> can ignore SIGHUP.
> >>>
> >>> 6.4 and newer kernels cannot survive. Even if the vhost thread sort of
> >>> ignores it like I described below where, the signal is still delivered
> >>> to the other qemu threads due to the shared signal handler. Userspace
> >>> can't ignore SIGKILL. It doesn't have any say in the matter, and the
> >>> kernel forces them to exit.
> >>
> >> Ok, I see, so the reason is that vhost belongs to the same thread
> >> group as the owner now.
> >>
> >>>
> >>>>
> >>>> If this is correct, guests may detect this (for example virtio-net has
> >>>> a watchdog).
> >>>>
> >>>
> >>> What did you mean by that part? Do you mean if the vhost thread were to
> >>> exit, so drivers/vhost/net.c couldn't process IO, then the watchdog in
> >>> the guest (virtio-net driver in the guest kernel) would detect that?
> >>
> >> I meant this one. But since we are using CLONE_THREAD, we won't see these.
> >>
> >>> Or
> >>> are you saying the watchdog in the guest can detect signals that the
> >>> host gets?
> >>>
> >>>
> >>>>>
> >>>>> Here are a lots of details:
> >>>>>
> >>>>> - Pre-6.4 kernel, when vhost workers used kthreads, if you sent any signal
> >>>>> to a vhost worker, we ignore it. Nothing happens. kthreads are special and
> >>>>> can ignore all signals.
> >>>>>
> >>>>> You could think of it as the worker is a completely different process than
> >>>>> qemu/userspace so they have completely different signal handlers. The
> >>>>> vhost worker signal handler ignores all signals even SIGKILL.
> >>>>
> >>>> Yes.
> >>>>
> >>>>>
> >>>>> If you send a SIGKILL to a qemu thread, then it just exits right away. We
> >>>>> don't get to do an explicit close() on the vhost device and we don't get
> >>>>> to do ioctls like VHOST_NET_SET_BACKEND to clear backends. The kernel exit
> >>>>> code runs and releases refcounts on the device/file, then the vhost device's
> >>>>> file_operations->release function is called. vhost_dev_cleanup then stops
> >>>>> the vhost worker.
> >>>>
> >>>> Right.
> >>>>
> >>>>>
> >>>>> - In 6.4 and newer kernels, vhost workers use vhost_tasks, so the worker
> >>>>> can be thought of as a thread within the userspace process. With that
> >>>>> change we have the same signal handler as the userspace process.
> >>>>>
> >>>>> If you send a SIGKILL to a qemu thread then it works like above.
> >>>>>
> >>>>> If you send a SIGKILL to a vhost worker, the vhost worker still sort of
> >>>>> ignores it (that is the hack that I mentioned at the beginning of this
> >>>>> thread). kernel/vhost_task.c:vhost_task_fn will see the signal and
> >>>>> then just continue to process works until file_operations->release
> >>>>> calls
> >>>>
> >>>> Yes, so this sticks to the behaviour before vhost_tasks.
> >>>
> >>> Not exactly. The vhost_task stays alive temporarily.
> >>>
> >>> The signal is still delivered to the userspace threads and they will
> >>> exit due to getting the SIGKILL also. SIGKILL goes to all the threads in
> >>> the process and all userspace threads exit like normal because the vhost
> >>> task and normal old userspace threads share a signal handler. When
> >>> userspace exits, the kernel force drops the refcounts on the vhost
> >>> devices and that runs the release function so the vhost_task will then exit.
> >>>
> >>> So what I'm trying to say is that in 6.4 we already changed the behavior.
> >>
> >> Yes. To say the truth, it looks even worse but it might be too late to fix.
> > 
> > Andres (cced) has identified two other possible changes:
> > 
> > 1) doesn't run in the global PID namespace but run in the namespace of owner
> 
> Yeah, I mentioned that one in vhost.h like it's a feature and when posting
> the patches I mentioned it as a possible fix. I mean I thought we wanted it
> to work like qemu and iothreads where the iothread would inherit all those
> values automatically.
> 
> At the time, I thought we didn't inherit the namespace, like we did the cgroup,
> because there was no kernel function for it (like how we didn't inherit v2
> cgroups until recently when someone added some code for that).
> 
> I don't know if it's allowed to have something like qemu in namespace N but then
> have it's children (vhost thread in this case) in the global namespace. I'll
> look into it.

Yea a big if.

> > 2) doesn't inherit kthreadd's scheduling attributes but the owner
> 
> Same as above for this one. I thought I was fixing a bug where before
> we had to manually tune the vhost thread's values but for iothreads they
> automatically got setup.
> 
> Just to clarify this one. When we used kthreads, kthread() will reset the
> scheduler priority for the kthread that's created, so we got the default
> values instead of inheriting kthreadd's values.  So we would want:
> 
> +	sched_setscheduler_nocheck(current, SCHED_NORMAL, &param);
> 
> in vhost_task_fn() instead of inheriting kthreadd's values.
> 
> > 
> > Though such a change makes more sense for some use cases, it may break others.
> > 
> > I wonder if we need to introduce a new flag and bring the old kthread
> 
> Do you mean something like a module param?
> 
> > codes if the flag is not set? Then we would not end up trying to align
> > the behaviour?
> >
> 
> Let me know what you guys prefer. The sched part is easy. The namespace
> part might be more difficult, but I will look into it if you want it.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting
  2024-04-18  4:08               ` Jason Wang
@ 2024-04-18  7:07                 ` Michael S. Tsirkin
  2024-04-18  9:25                   ` Andreas Karis
  2024-04-19  0:33                   ` Jason Wang
  0 siblings, 2 replies; 32+ messages in thread
From: Michael S. Tsirkin @ 2024-04-18  7:07 UTC (permalink / raw)
  To: Jason Wang
  Cc: Mike Christie, oleg, ebiederm, virtualization, sgarzare,
	stefanha, brauner, Andreas Karis, Laurent Vivier

On Thu, Apr 18, 2024 at 12:08:52PM +0800, Jason Wang wrote:
> On Thu, Apr 18, 2024 at 12:10 AM Mike Christie
> <michael.christie@oracle.com> wrote:
> >
> > On 4/16/24 10:50 PM, Jason Wang wrote:
> > > On Mon, Apr 15, 2024 at 4:52 PM Jason Wang <jasowang@redhat.com> wrote:
> > >>
> > >> On Sat, Apr 13, 2024 at 12:53 AM <michael.christie@oracle.com> wrote:
> > >>>
> > >>> On 4/11/24 10:28 PM, Jason Wang wrote:
> > >>>> On Fri, Apr 12, 2024 at 12:19 AM Mike Christie
> > >>>> <michael.christie@oracle.com> wrote:
> > >>>>>
> > >>>>> On 4/11/24 3:39 AM, Jason Wang wrote:
> > >>>>>> On Sat, Mar 16, 2024 at 8:47 AM Mike Christie
> > >>>>>> <michael.christie@oracle.com> wrote:
> > >>>>>>>
> > >>>>>>> The following patches were made over Linus's tree and also apply over
> > >>>>>>> mst's vhost branch. The patches add the ability for vhost_tasks to
> > >>>>>>> handle SIGKILL by flushing queued works, stop new works from being
> > >>>>>>> queued, and prepare the task for an early exit.
> > >>>>>>>
> > >>>>>>> This removes the need for the signal/coredump hacks added in:
> > >>>>>>>
> > >>>>>>> Commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression")
> > >>>>>>>
> > >>>>>>> when the vhost_task patches were initially merged and fix the issue
> > >>>>>>> in this thread:
> > >>>>>>>
> > >>>>>>> https://lore.kernel.org/all/000000000000a41b82060e875721@google.com/
> > >>>>>>>
> > >>>>>>> Long Background:
> > >>>>>>>
> > >>>>>>> The original vhost worker code didn't support any signals. If the
> > >>>>>>> userspace application that owned the worker got a SIGKILL, the app/
> > >>>>>>> process would exit dropping all references to the device and then the
> > >>>>>>> file operation's release function would be called. From there we would
> > >>>>>>> wait on running IO then cleanup the device's memory.
> > >>>>>>
> > >>>>>> A dumb question.
> > >>>>>>
> > >>>>>> Is this a user space noticeable change? For example, with this series
> > >>>>>> a SIGKILL may shutdown the datapath ...
> > >>>>>
> > >>>>> It already changed in 6.4. We basically added a new interface to shutdown
> > >>>>> everything (userspace and vhost kernel parts). So we won't just shutdown
> > >>>>> the data path while userspace is still running. We will shutdown everything
> > >>>>> now if you send a SIGKILL to a vhost worker's thread.
> > >>>>
> > >>>> If I understand correctly, for example Qemu can still live is SIGKILL
> > >>>> is just send to vhost thread.
> > >>>
> > >>> Pre-6.4 qemu could still survive if only the vhost thread got a SIGKILL.
> > >>> We used kthreads which are special and can ignore it like how userspace
> > >>> can ignore SIGHUP.
> > >>>
> > >>> 6.4 and newer kernels cannot survive. Even if the vhost thread sort of
> > >>> ignores it like I described below where, the signal is still delivered
> > >>> to the other qemu threads due to the shared signal handler. Userspace
> > >>> can't ignore SIGKILL. It doesn't have any say in the matter, and the
> > >>> kernel forces them to exit.
> > >>
> > >> Ok, I see, so the reason is that vhost belongs to the same thread
> > >> group as the owner now.
> > >>
> > >>>
> > >>>>
> > >>>> If this is correct, guests may detect this (for example virtio-net has
> > >>>> a watchdog).
> > >>>>
> > >>>
> > >>> What did you mean by that part? Do you mean if the vhost thread were to
> > >>> exit, so drivers/vhost/net.c couldn't process IO, then the watchdog in
> > >>> the guest (virtio-net driver in the guest kernel) would detect that?
> > >>
> > >> I meant this one. But since we are using CLONE_THREAD, we won't see these.
> > >>
> > >>> Or
> > >>> are you saying the watchdog in the guest can detect signals that the
> > >>> host gets?
> > >>>
> > >>>
> > >>>>>
> > >>>>> Here are a lots of details:
> > >>>>>
> > >>>>> - Pre-6.4 kernel, when vhost workers used kthreads, if you sent any signal
> > >>>>> to a vhost worker, we ignore it. Nothing happens. kthreads are special and
> > >>>>> can ignore all signals.
> > >>>>>
> > >>>>> You could think of it as the worker is a completely different process than
> > >>>>> qemu/userspace so they have completely different signal handlers. The
> > >>>>> vhost worker signal handler ignores all signals even SIGKILL.
> > >>>>
> > >>>> Yes.
> > >>>>
> > >>>>>
> > >>>>> If you send a SIGKILL to a qemu thread, then it just exits right away. We
> > >>>>> don't get to do an explicit close() on the vhost device and we don't get
> > >>>>> to do ioctls like VHOST_NET_SET_BACKEND to clear backends. The kernel exit
> > >>>>> code runs and releases refcounts on the device/file, then the vhost device's
> > >>>>> file_operations->release function is called. vhost_dev_cleanup then stops
> > >>>>> the vhost worker.
> > >>>>
> > >>>> Right.
> > >>>>
> > >>>>>
> > >>>>> - In 6.4 and newer kernels, vhost workers use vhost_tasks, so the worker
> > >>>>> can be thought of as a thread within the userspace process. With that
> > >>>>> change we have the same signal handler as the userspace process.
> > >>>>>
> > >>>>> If you send a SIGKILL to a qemu thread then it works like above.
> > >>>>>
> > >>>>> If you send a SIGKILL to a vhost worker, the vhost worker still sort of
> > >>>>> ignores it (that is the hack that I mentioned at the beginning of this
> > >>>>> thread). kernel/vhost_task.c:vhost_task_fn will see the signal and
> > >>>>> then just continue to process works until file_operations->release
> > >>>>> calls
> > >>>>
> > >>>> Yes, so this sticks to the behaviour before vhost_tasks.
> > >>>
> > >>> Not exactly. The vhost_task stays alive temporarily.
> > >>>
> > >>> The signal is still delivered to the userspace threads and they will
> > >>> exit due to getting the SIGKILL also. SIGKILL goes to all the threads in
> > >>> the process and all userspace threads exit like normal because the vhost
> > >>> task and normal old userspace threads share a signal handler. When
> > >>> userspace exits, the kernel force drops the refcounts on the vhost
> > >>> devices and that runs the release function so the vhost_task will then exit.
> > >>>
> > >>> So what I'm trying to say is that in 6.4 we already changed the behavior.
> > >>
> > >> Yes. To say the truth, it looks even worse but it might be too late to fix.
> > >
> > > Andres (cced) has identified two other possible changes:
> > >
> > > 1) doesn't run in the global PID namespace but run in the namespace of owner
> >
> > Yeah, I mentioned that one in vhost.h like it's a feature and when posting
> > the patches I mentioned it as a possible fix. I mean I thought we wanted it
> > to work like qemu and iothreads where the iothread would inherit all those
> > values automatically.
> 
> Right, but it could be noticed by the userspace, especially for the
> one that tries to do tweak on the performance.
> 
> The root cause is the that now we do copy_processs() in the process of
> Qemu instead of the kthreadd. Which result of the the differences of
> namespace (I think PID namespace is not the only one we see
> difference) and others for the vhost task.

Leaking things out of a namespace looks more like a bug.
If you really have to be pedantic, the thing to add would
be a namespace flag not a qemu flag. Userspace running inside
a namespace really must have no say about whether to leak
info out of it.

> >
> > At the time, I thought we didn't inherit the namespace, like we did the cgroup,
> > because there was no kernel function for it (like how we didn't inherit v2
> > cgroups until recently when someone added some code for that).
> >
> > I don't know if it's allowed to have something like qemu in namespace N but then
> > have it's children (vhost thread in this case) in the global namespace.
> > I'll
> > look into it.
> 
> Instead of moving vhost thread between difference namespaces, I wonder
> if the following is simpler:
> 
> if (new_flag)
>     vhost_task_create()
> else
>     kthread_create()
> 
> New flag inherits the attributes of Qemu (namespaces, rlimit, cgroup,
> scheduling attributes ...) which is what we want. Without the new
> flag, we stick exactly to the behaviour as in the past to unbreak
> existing userspace.
> 
> >
> > > 2) doesn't inherit kthreadd's scheduling attributes but the owner
> >
> > Same as above for this one. I thought I was fixing a bug where before
> > we had to manually tune the vhost thread's values but for iothreads they
> > automatically got setup.
> >
> > Just to clarify this one. When we used kthreads, kthread() will reset the
> > scheduler priority for the kthread that's created, so we got the default
> > values instead of inheriting kthreadd's values.  So we would want:
> >
> > +       sched_setscheduler_nocheck(current, SCHED_NORMAL, &param);
> >
> > in vhost_task_fn() instead of inheriting kthreadd's values.
> >
> > >
> > > Though such a change makes more sense for some use cases, it may break others.
> > >
> > > I wonder if we need to introduce a new flag and bring the old kthread
> >
> > Do you mean something like a module param?
> 
> This requires the management layer to know if it has a new user space
> or not which is hard. A better place is to introduce backend features.
> 
> >
> > > codes if the flag is not set? Then we would not end up trying to align
> > > the behaviour?
> > >
> >
> > Let me know what you guys prefer. The sched part is easy. The namespace
> > part might be more difficult, but I will look into it if you want it.
> 
> Thanks a lot. I think it would be better to have the namespace part
> (as well as other namespaces) then we don't need to answer hard
> questions like if it can break user space or not.
> 
> >


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting
  2024-04-09  4:16 ` [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting Jason Wang
  2024-04-09 14:57   ` Mike Christie
@ 2024-04-18  7:10   ` Michael S. Tsirkin
  1 sibling, 0 replies; 32+ messages in thread
From: Michael S. Tsirkin @ 2024-04-18  7:10 UTC (permalink / raw)
  To: Jason Wang
  Cc: Mike Christie, oleg, ebiederm, virtualization, sgarzare,
	stefanha, brauner, Laurent Vivier

On Tue, Apr 09, 2024 at 12:16:36PM +0800, Jason Wang wrote:
> On Sat, Mar 16, 2024 at 8:47 AM Mike Christie
> <michael.christie@oracle.com> wrote:
> >
> > The following patches were made over Linus's tree and also apply over
> > mst's vhost branch. The patches add the ability for vhost_tasks to
> > handle SIGKILL by flushing queued works, stop new works from being
> > queued, and prepare the task for an early exit.
> >
> > This removes the need for the signal/coredump hacks added in:
> >
> > Commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression")
> >
> > when the vhost_task patches were initially merged and fix the issue
> > in this thread:
> >
> > https://lore.kernel.org/all/000000000000a41b82060e875721@google.com/
> >
> > Long Background:
> >
> > The original vhost worker code didn't support any signals. If the
> > userspace application that owned the worker got a SIGKILL, the app/
> > process would exit dropping all references to the device and then the
> > file operation's release function would be called. From there we would
> > wait on running IO then cleanup the device's memory.
> >
> > When we switched to vhost_tasks being a thread in the owner's process we
> > added some hacks to the signal/coredump code so we could continue to
> > wait on running IO and process it from the vhost_task. The idea was that
> > we would eventually remove the hacks. We recently hit this bug:
> >
> > https://lore.kernel.org/all/000000000000a41b82060e875721@google.com/
> >
> > It turns out only vhost-scsi had an issue where it would send a command
> > to the block/LIO layer, wait for a response and then process in the vhost
> > task.
> 
> Vhost-net TX zerocopy code did the same:
> 
> It sends zerocopy packets to the under layer and waits for the
> underlayer. When the DMA is completed, vhost_zerocopy_callback will be
> called to schedule vq work for used ring updating.

Yea. It's still experimental though so I'm not sure how
stressed to be about it. I guess we can ignore it for now -
but yes it was one of the big issues with tx zerocopy
and this patchset opens the path to productizing it.

> > So patches 1-5 prepares vhost-scsi to handle when the vhost_task
> > is killed while we still have commands outstanding. The next patches then
> > prepare and convert the vhost and vhost_task layers to handle SIGKILL
> > by flushing running works, marking the vhost_task as dead so there's
> > no future uses, then exiting.
> 
> Thanks
> 
> >
> >
> >


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting
  2024-04-18  7:07                 ` Michael S. Tsirkin
@ 2024-04-18  9:25                   ` Andreas Karis
  2024-04-19  0:37                     ` Jason Wang
  2024-04-19  0:33                   ` Jason Wang
  1 sibling, 1 reply; 32+ messages in thread
From: Andreas Karis @ 2024-04-18  9:25 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Mike Christie, oleg, ebiederm, virtualization,
	sgarzare, stefanha, brauner, Laurent Vivier

Not a kernel developer, so sorry if this here is pointless spam.

But qemu is not the only thing consuming /dev/vhost-net:
https://doc.dpdk.org/guides/howto/virtio_user_as_exception_path.html

Before the series of patches around
https://github.com/torvalds/linux/commit/e297cd54b3f81d652456ae6cb93941fc6b5c6683,
you would run a DPDK application inside namespaces of container
"blue", but then the vhost-... threads were spawned as children of
kthreadd and run in the global namespaces. That seemed counter
intuitive, and what's worse, the DPDK application running inside the
"blue" container namespaces cannot see and thus cannot modify
scheduling attributes and affinity of its own vhost-... threads. In
scenarios with pretty strict isolation/pinning requirements, the
vhost-... threads could easily be scheduled on the same CPUs as the
DPDK poll mode drivers (because they inherit the calling process' CPU
affinity), and there's no way for the DPDK process itself to move the
vhost-... threads to other CPUs. Also, if the DPDK process ran as
SCHED_RR, the vhost-.... thread would still be SCHED_NORMAL.

After the patch series, the fact that the vhost-... threads run as
tasks of the process that requests them seems more natural and gives
us as users the kind of control that we'd want from within the
container to modify the vhost-... threads' scheduling and affinity.
The vhost-... thread as a child of the DPDK application inherits the
same scheduling class, CPU set, etc and the DPDK process can easily
change those attributes.

However, if another user was used to the old behavior, and their
entire tooling was created for the old behavior (imagine someone wrote
a bunch of scripts/services to force affinity from the global PID
namespace for those vhost-... threads), and now this change was
introduced, it would break their tooling. So because _I_ might fancy
the new behavior, but user _B_ might be all set up for the old
kthreadd, shouldn't there be a flag or configuration for the user to:

if (new_flag)
    vhost_task_create()
else
    kthread_create()

I don't know what kind of flag or knob I'd expect here, but it could
be granular for the calling process (a new syscall?), or a kernel
flag, etc. But something that let's the admin choose how the kernel
spawns the vhost-... threads?

On Thu, Apr 18, 2024 at 9:07 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Thu, Apr 18, 2024 at 12:08:52PM +0800, Jason Wang wrote:
> > On Thu, Apr 18, 2024 at 12:10 AM Mike Christie
> > <michael.christie@oracle.com> wrote:
> > >
> > > On 4/16/24 10:50 PM, Jason Wang wrote:
> > > > On Mon, Apr 15, 2024 at 4:52 PM Jason Wang <jasowang@redhat.com> wrote:
> > > >>
> > > >> On Sat, Apr 13, 2024 at 12:53 AM <michael.christie@oracle.com> wrote:
> > > >>>
> > > >>> On 4/11/24 10:28 PM, Jason Wang wrote:
> > > >>>> On Fri, Apr 12, 2024 at 12:19 AM Mike Christie
> > > >>>> <michael.christie@oracle.com> wrote:
> > > >>>>>
> > > >>>>> On 4/11/24 3:39 AM, Jason Wang wrote:
> > > >>>>>> On Sat, Mar 16, 2024 at 8:47 AM Mike Christie
> > > >>>>>> <michael.christie@oracle.com> wrote:
> > > >>>>>>>
> > > >>>>>>> The following patches were made over Linus's tree and also apply over
> > > >>>>>>> mst's vhost branch. The patches add the ability for vhost_tasks to
> > > >>>>>>> handle SIGKILL by flushing queued works, stop new works from being
> > > >>>>>>> queued, and prepare the task for an early exit.
> > > >>>>>>>
> > > >>>>>>> This removes the need for the signal/coredump hacks added in:
> > > >>>>>>>
> > > >>>>>>> Commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression")
> > > >>>>>>>
> > > >>>>>>> when the vhost_task patches were initially merged and fix the issue
> > > >>>>>>> in this thread:
> > > >>>>>>>
> > > >>>>>>> https://lore.kernel.org/all/000000000000a41b82060e875721@google.com/
> > > >>>>>>>
> > > >>>>>>> Long Background:
> > > >>>>>>>
> > > >>>>>>> The original vhost worker code didn't support any signals. If the
> > > >>>>>>> userspace application that owned the worker got a SIGKILL, the app/
> > > >>>>>>> process would exit dropping all references to the device and then the
> > > >>>>>>> file operation's release function would be called. From there we would
> > > >>>>>>> wait on running IO then cleanup the device's memory.
> > > >>>>>>
> > > >>>>>> A dumb question.
> > > >>>>>>
> > > >>>>>> Is this a user space noticeable change? For example, with this series
> > > >>>>>> a SIGKILL may shutdown the datapath ...
> > > >>>>>
> > > >>>>> It already changed in 6.4. We basically added a new interface to shutdown
> > > >>>>> everything (userspace and vhost kernel parts). So we won't just shutdown
> > > >>>>> the data path while userspace is still running. We will shutdown everything
> > > >>>>> now if you send a SIGKILL to a vhost worker's thread.
> > > >>>>
> > > >>>> If I understand correctly, for example Qemu can still live is SIGKILL
> > > >>>> is just send to vhost thread.
> > > >>>
> > > >>> Pre-6.4 qemu could still survive if only the vhost thread got a SIGKILL.
> > > >>> We used kthreads which are special and can ignore it like how userspace
> > > >>> can ignore SIGHUP.
> > > >>>
> > > >>> 6.4 and newer kernels cannot survive. Even if the vhost thread sort of
> > > >>> ignores it like I described below where, the signal is still delivered
> > > >>> to the other qemu threads due to the shared signal handler. Userspace
> > > >>> can't ignore SIGKILL. It doesn't have any say in the matter, and the
> > > >>> kernel forces them to exit.
> > > >>
> > > >> Ok, I see, so the reason is that vhost belongs to the same thread
> > > >> group as the owner now.
> > > >>
> > > >>>
> > > >>>>
> > > >>>> If this is correct, guests may detect this (for example virtio-net has
> > > >>>> a watchdog).
> > > >>>>
> > > >>>
> > > >>> What did you mean by that part? Do you mean if the vhost thread were to
> > > >>> exit, so drivers/vhost/net.c couldn't process IO, then the watchdog in
> > > >>> the guest (virtio-net driver in the guest kernel) would detect that?
> > > >>
> > > >> I meant this one. But since we are using CLONE_THREAD, we won't see these.
> > > >>
> > > >>> Or
> > > >>> are you saying the watchdog in the guest can detect signals that the
> > > >>> host gets?
> > > >>>
> > > >>>
> > > >>>>>
> > > >>>>> Here are a lots of details:
> > > >>>>>
> > > >>>>> - Pre-6.4 kernel, when vhost workers used kthreads, if you sent any signal
> > > >>>>> to a vhost worker, we ignore it. Nothing happens. kthreads are special and
> > > >>>>> can ignore all signals.
> > > >>>>>
> > > >>>>> You could think of it as the worker is a completely different process than
> > > >>>>> qemu/userspace so they have completely different signal handlers. The
> > > >>>>> vhost worker signal handler ignores all signals even SIGKILL.
> > > >>>>
> > > >>>> Yes.
> > > >>>>
> > > >>>>>
> > > >>>>> If you send a SIGKILL to a qemu thread, then it just exits right away. We
> > > >>>>> don't get to do an explicit close() on the vhost device and we don't get
> > > >>>>> to do ioctls like VHOST_NET_SET_BACKEND to clear backends. The kernel exit
> > > >>>>> code runs and releases refcounts on the device/file, then the vhost device's
> > > >>>>> file_operations->release function is called. vhost_dev_cleanup then stops
> > > >>>>> the vhost worker.
> > > >>>>
> > > >>>> Right.
> > > >>>>
> > > >>>>>
> > > >>>>> - In 6.4 and newer kernels, vhost workers use vhost_tasks, so the worker
> > > >>>>> can be thought of as a thread within the userspace process. With that
> > > >>>>> change we have the same signal handler as the userspace process.
> > > >>>>>
> > > >>>>> If you send a SIGKILL to a qemu thread then it works like above.
> > > >>>>>
> > > >>>>> If you send a SIGKILL to a vhost worker, the vhost worker still sort of
> > > >>>>> ignores it (that is the hack that I mentioned at the beginning of this
> > > >>>>> thread). kernel/vhost_task.c:vhost_task_fn will see the signal and
> > > >>>>> then just continue to process works until file_operations->release
> > > >>>>> calls
> > > >>>>
> > > >>>> Yes, so this sticks to the behaviour before vhost_tasks.
> > > >>>
> > > >>> Not exactly. The vhost_task stays alive temporarily.
> > > >>>
> > > >>> The signal is still delivered to the userspace threads and they will
> > > >>> exit due to getting the SIGKILL also. SIGKILL goes to all the threads in
> > > >>> the process and all userspace threads exit like normal because the vhost
> > > >>> task and normal old userspace threads share a signal handler. When
> > > >>> userspace exits, the kernel force drops the refcounts on the vhost
> > > >>> devices and that runs the release function so the vhost_task will then exit.
> > > >>>
> > > >>> So what I'm trying to say is that in 6.4 we already changed the behavior.
> > > >>
> > > >> Yes. To say the truth, it looks even worse but it might be too late to fix.
> > > >
> > > > Andres (cced) has identified two other possible changes:
> > > >
> > > > 1) doesn't run in the global PID namespace but run in the namespace of owner
> > >
> > > Yeah, I mentioned that one in vhost.h like it's a feature and when posting
> > > the patches I mentioned it as a possible fix. I mean I thought we wanted it
> > > to work like qemu and iothreads where the iothread would inherit all those
> > > values automatically.
> >
> > Right, but it could be noticed by the userspace, especially for the
> > one that tries to do tweak on the performance.
> >
> > The root cause is the that now we do copy_processs() in the process of
> > Qemu instead of the kthreadd. Which result of the the differences of
> > namespace (I think PID namespace is not the only one we see
> > difference) and others for the vhost task.
>
> Leaking things out of a namespace looks more like a bug.
> If you really have to be pedantic, the thing to add would
> be a namespace flag not a qemu flag. Userspace running inside
> a namespace really must have no say about whether to leak
> info out of it.
>
> > >
> > > At the time, I thought we didn't inherit the namespace, like we did the cgroup,
> > > because there was no kernel function for it (like how we didn't inherit v2
> > > cgroups until recently when someone added some code for that).
> > >
> > > I don't know if it's allowed to have something like qemu in namespace N but then
> > > have it's children (vhost thread in this case) in the global namespace.
> > > I'll
> > > look into it.
> >
> > Instead of moving vhost thread between difference namespaces, I wonder
> > if the following is simpler:
> >
> > if (new_flag)
> >     vhost_task_create()
> > else
> >     kthread_create()
> >
> > New flag inherits the attributes of Qemu (namespaces, rlimit, cgroup,
> > scheduling attributes ...) which is what we want. Without the new
> > flag, we stick exactly to the behaviour as in the past to unbreak
> > existing userspace.
> >
> > >
> > > > 2) doesn't inherit kthreadd's scheduling attributes but the owner
> > >
> > > Same as above for this one. I thought I was fixing a bug where before
> > > we had to manually tune the vhost thread's values but for iothreads they
> > > automatically got setup.
> > >
> > > Just to clarify this one. When we used kthreads, kthread() will reset the
> > > scheduler priority for the kthread that's created, so we got the default
> > > values instead of inheriting kthreadd's values.  So we would want:
> > >
> > > +       sched_setscheduler_nocheck(current, SCHED_NORMAL, &param);
> > >
> > > in vhost_task_fn() instead of inheriting kthreadd's values.
> > >
> > > >
> > > > Though such a change makes more sense for some use cases, it may break others.
> > > >
> > > > I wonder if we need to introduce a new flag and bring the old kthread
> > >
> > > Do you mean something like a module param?
> >
> > This requires the management layer to know if it has a new user space
> > or not which is hard. A better place is to introduce backend features.
> >
> > >
> > > > codes if the flag is not set? Then we would not end up trying to align
> > > > the behaviour?
> > > >
> > >
> > > Let me know what you guys prefer. The sched part is easy. The namespace
> > > part might be more difficult, but I will look into it if you want it.
> >
> > Thanks a lot. I think it would be better to have the namespace part
> > (as well as other namespaces) then we don't need to answer hard
> > questions like if it can break user space or not.
> >
> > >
>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting
  2024-04-18  7:07                 ` Michael S. Tsirkin
  2024-04-18  9:25                   ` Andreas Karis
@ 2024-04-19  0:33                   ` Jason Wang
  1 sibling, 0 replies; 32+ messages in thread
From: Jason Wang @ 2024-04-19  0:33 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Mike Christie, oleg, ebiederm, virtualization, sgarzare,
	stefanha, brauner, Andreas Karis, Laurent Vivier

On Thu, Apr 18, 2024 at 3:07 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Thu, Apr 18, 2024 at 12:08:52PM +0800, Jason Wang wrote:
> > On Thu, Apr 18, 2024 at 12:10 AM Mike Christie
> > <michael.christie@oracle.com> wrote:
> > >
> > > On 4/16/24 10:50 PM, Jason Wang wrote:
> > > > On Mon, Apr 15, 2024 at 4:52 PM Jason Wang <jasowang@redhat.com> wrote:
> > > >>
> > > >> On Sat, Apr 13, 2024 at 12:53 AM <michael.christie@oracle.com> wrote:
> > > >>>
> > > >>> On 4/11/24 10:28 PM, Jason Wang wrote:
> > > >>>> On Fri, Apr 12, 2024 at 12:19 AM Mike Christie
> > > >>>> <michael.christie@oracle.com> wrote:
> > > >>>>>
> > > >>>>> On 4/11/24 3:39 AM, Jason Wang wrote:
> > > >>>>>> On Sat, Mar 16, 2024 at 8:47 AM Mike Christie
> > > >>>>>> <michael.christie@oracle.com> wrote:
> > > >>>>>>>
> > > >>>>>>> The following patches were made over Linus's tree and also apply over
> > > >>>>>>> mst's vhost branch. The patches add the ability for vhost_tasks to
> > > >>>>>>> handle SIGKILL by flushing queued works, stop new works from being
> > > >>>>>>> queued, and prepare the task for an early exit.
> > > >>>>>>>
> > > >>>>>>> This removes the need for the signal/coredump hacks added in:
> > > >>>>>>>
> > > >>>>>>> Commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression")
> > > >>>>>>>
> > > >>>>>>> when the vhost_task patches were initially merged and fix the issue
> > > >>>>>>> in this thread:
> > > >>>>>>>
> > > >>>>>>> https://lore.kernel.org/all/000000000000a41b82060e875721@google.com/
> > > >>>>>>>
> > > >>>>>>> Long Background:
> > > >>>>>>>
> > > >>>>>>> The original vhost worker code didn't support any signals. If the
> > > >>>>>>> userspace application that owned the worker got a SIGKILL, the app/
> > > >>>>>>> process would exit dropping all references to the device and then the
> > > >>>>>>> file operation's release function would be called. From there we would
> > > >>>>>>> wait on running IO then cleanup the device's memory.
> > > >>>>>>
> > > >>>>>> A dumb question.
> > > >>>>>>
> > > >>>>>> Is this a user space noticeable change? For example, with this series
> > > >>>>>> a SIGKILL may shutdown the datapath ...
> > > >>>>>
> > > >>>>> It already changed in 6.4. We basically added a new interface to shutdown
> > > >>>>> everything (userspace and vhost kernel parts). So we won't just shutdown
> > > >>>>> the data path while userspace is still running. We will shutdown everything
> > > >>>>> now if you send a SIGKILL to a vhost worker's thread.
> > > >>>>
> > > >>>> If I understand correctly, for example Qemu can still live is SIGKILL
> > > >>>> is just send to vhost thread.
> > > >>>
> > > >>> Pre-6.4 qemu could still survive if only the vhost thread got a SIGKILL.
> > > >>> We used kthreads which are special and can ignore it like how userspace
> > > >>> can ignore SIGHUP.
> > > >>>
> > > >>> 6.4 and newer kernels cannot survive. Even if the vhost thread sort of
> > > >>> ignores it like I described below where, the signal is still delivered
> > > >>> to the other qemu threads due to the shared signal handler. Userspace
> > > >>> can't ignore SIGKILL. It doesn't have any say in the matter, and the
> > > >>> kernel forces them to exit.
> > > >>
> > > >> Ok, I see, so the reason is that vhost belongs to the same thread
> > > >> group as the owner now.
> > > >>
> > > >>>
> > > >>>>
> > > >>>> If this is correct, guests may detect this (for example virtio-net has
> > > >>>> a watchdog).
> > > >>>>
> > > >>>
> > > >>> What did you mean by that part? Do you mean if the vhost thread were to
> > > >>> exit, so drivers/vhost/net.c couldn't process IO, then the watchdog in
> > > >>> the guest (virtio-net driver in the guest kernel) would detect that?
> > > >>
> > > >> I meant this one. But since we are using CLONE_THREAD, we won't see these.
> > > >>
> > > >>> Or
> > > >>> are you saying the watchdog in the guest can detect signals that the
> > > >>> host gets?
> > > >>>
> > > >>>
> > > >>>>>
> > > >>>>> Here are a lots of details:
> > > >>>>>
> > > >>>>> - Pre-6.4 kernel, when vhost workers used kthreads, if you sent any signal
> > > >>>>> to a vhost worker, we ignore it. Nothing happens. kthreads are special and
> > > >>>>> can ignore all signals.
> > > >>>>>
> > > >>>>> You could think of it as the worker is a completely different process than
> > > >>>>> qemu/userspace so they have completely different signal handlers. The
> > > >>>>> vhost worker signal handler ignores all signals even SIGKILL.
> > > >>>>
> > > >>>> Yes.
> > > >>>>
> > > >>>>>
> > > >>>>> If you send a SIGKILL to a qemu thread, then it just exits right away. We
> > > >>>>> don't get to do an explicit close() on the vhost device and we don't get
> > > >>>>> to do ioctls like VHOST_NET_SET_BACKEND to clear backends. The kernel exit
> > > >>>>> code runs and releases refcounts on the device/file, then the vhost device's
> > > >>>>> file_operations->release function is called. vhost_dev_cleanup then stops
> > > >>>>> the vhost worker.
> > > >>>>
> > > >>>> Right.
> > > >>>>
> > > >>>>>
> > > >>>>> - In 6.4 and newer kernels, vhost workers use vhost_tasks, so the worker
> > > >>>>> can be thought of as a thread within the userspace process. With that
> > > >>>>> change we have the same signal handler as the userspace process.
> > > >>>>>
> > > >>>>> If you send a SIGKILL to a qemu thread then it works like above.
> > > >>>>>
> > > >>>>> If you send a SIGKILL to a vhost worker, the vhost worker still sort of
> > > >>>>> ignores it (that is the hack that I mentioned at the beginning of this
> > > >>>>> thread). kernel/vhost_task.c:vhost_task_fn will see the signal and
> > > >>>>> then just continue to process works until file_operations->release
> > > >>>>> calls
> > > >>>>
> > > >>>> Yes, so this sticks to the behaviour before vhost_tasks.
> > > >>>
> > > >>> Not exactly. The vhost_task stays alive temporarily.
> > > >>>
> > > >>> The signal is still delivered to the userspace threads and they will
> > > >>> exit due to getting the SIGKILL also. SIGKILL goes to all the threads in
> > > >>> the process and all userspace threads exit like normal because the vhost
> > > >>> task and normal old userspace threads share a signal handler. When
> > > >>> userspace exits, the kernel force drops the refcounts on the vhost
> > > >>> devices and that runs the release function so the vhost_task will then exit.
> > > >>>
> > > >>> So what I'm trying to say is that in 6.4 we already changed the behavior.
> > > >>
> > > >> Yes. To say the truth, it looks even worse but it might be too late to fix.
> > > >
> > > > Andres (cced) has identified two other possible changes:
> > > >
> > > > 1) doesn't run in the global PID namespace but run in the namespace of owner
> > >
> > > Yeah, I mentioned that one in vhost.h like it's a feature and when posting
> > > the patches I mentioned it as a possible fix. I mean I thought we wanted it
> > > to work like qemu and iothreads where the iothread would inherit all those
> > > values automatically.
> >
> > Right, but it could be noticed by the userspace, especially for the
> > one that tries to do tweak on the performance.
> >
> > The root cause is the that now we do copy_processs() in the process of
> > Qemu instead of the kthreadd. Which result of the the differences of
> > namespace (I think PID namespace is not the only one we see
> > difference) and others for the vhost task.
>
> Leaking things out of a namespace looks more like a bug.

Well, not sure if it's a bug as kthread inherits the namespace of it's
parent which is kthreadd. It follows the semantics of copy_process and
there are plenty of users now. The problem is that it's too late to
judge if it is a bug as it has been noticed by the userspace. And
namespace is not the only thing that could be noticed by the
userspace. Any assumption for the kthreadd attributes will break.

Instead of waiting for complaints for the new behaviour, why not
simply stick to the old behaviour and use the flag for the new one?

Thanks


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting
  2024-04-18  9:25                   ` Andreas Karis
@ 2024-04-19  0:37                     ` Jason Wang
  2024-04-19  0:40                       ` Jason Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Jason Wang @ 2024-04-19  0:37 UTC (permalink / raw)
  To: Andreas Karis
  Cc: Michael S. Tsirkin, Mike Christie, oleg, ebiederm,
	virtualization, sgarzare, stefanha, brauner, Laurent Vivier

On Thu, Apr 18, 2024 at 5:26 PM Andreas Karis <akaris@redhat.com> wrote:
>
> Not a kernel developer, so sorry if this here is pointless spam.
>
> But qemu is not the only thing consuming /dev/vhost-net:
> https://doc.dpdk.org/guides/howto/virtio_user_as_exception_path.html
>
> Before the series of patches around
> https://github.com/torvalds/linux/commit/e297cd54b3f81d652456ae6cb93941fc6b5c6683,
> you would run a DPDK application inside namespaces of container
> "blue", but then the vhost-... threads were spawned as children of
> kthreadd and run in the global namespaces. That seemed counter
> intuitive, and what's worse, the DPDK application running inside the
> "blue" container namespaces cannot see and thus cannot modify
> scheduling attributes and affinity of its own vhost-... threads. In
> scenarios with pretty strict isolation/pinning requirements, the
> vhost-... threads could easily be scheduled on the same CPUs as the
> DPDK poll mode drivers (because they inherit the calling process' CPU
> affinity), and there's no way for the DPDK process itself to move the
> vhost-... threads to other CPUs. Also, if the DPDK process ran as
> SCHED_RR, the vhost-.... thread would still be SCHED_NORMAL.
>
> After the patch series, the fact that the vhost-... threads run as
> tasks of the process that requests them seems more natural and gives
> us as users the kind of control that we'd want from within the
> container to modify the vhost-... threads' scheduling and affinity.
> The vhost-... thread as a child of the DPDK application inherits the
> same scheduling class, CPU set, etc and the DPDK process can easily
> change those attributes.
>
> However, if another user was used to the old behavior, and their
> entire tooling was created for the old behavior (imagine someone wrote
> a bunch of scripts/services to force affinity from the global PID
> namespace for those vhost-... threads), and now this change was
> introduced, it would break their tooling. So because _I_ might fancy
> the new behavior, but user _B_ might be all set up for the old
> kthreadd, shouldn't there be a flag or configuration for the user to:
>
> if (new_flag)
>     vhost_task_create()
> else
>     kthread_create()
>
> I don't know what kind of flag or knob I'd expect here, but it could
> be granular for the calling process (a new syscall?), or a kernel
> flag, etc. But something that let's the admin choose how the kernel
> spawns the vhost-... threads?

A flag via uAPI that could be controlled by the user of vhost-net via
syscall. For example, it could be done via set/get_backend_feautre()
syscall.

Thanks

>
> On Thu, Apr 18, 2024 at 9:07 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Thu, Apr 18, 2024 at 12:08:52PM +0800, Jason Wang wrote:
> > > On Thu, Apr 18, 2024 at 12:10 AM Mike Christie
> > > <michael.christie@oracle.com> wrote:
> > > >
> > > > On 4/16/24 10:50 PM, Jason Wang wrote:
> > > > > On Mon, Apr 15, 2024 at 4:52 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > >>
> > > > >> On Sat, Apr 13, 2024 at 12:53 AM <michael.christie@oracle.com> wrote:
> > > > >>>
> > > > >>> On 4/11/24 10:28 PM, Jason Wang wrote:
> > > > >>>> On Fri, Apr 12, 2024 at 12:19 AM Mike Christie
> > > > >>>> <michael.christie@oracle.com> wrote:
> > > > >>>>>
> > > > >>>>> On 4/11/24 3:39 AM, Jason Wang wrote:
> > > > >>>>>> On Sat, Mar 16, 2024 at 8:47 AM Mike Christie
> > > > >>>>>> <michael.christie@oracle.com> wrote:
> > > > >>>>>>>
> > > > >>>>>>> The following patches were made over Linus's tree and also apply over
> > > > >>>>>>> mst's vhost branch. The patches add the ability for vhost_tasks to
> > > > >>>>>>> handle SIGKILL by flushing queued works, stop new works from being
> > > > >>>>>>> queued, and prepare the task for an early exit.
> > > > >>>>>>>
> > > > >>>>>>> This removes the need for the signal/coredump hacks added in:
> > > > >>>>>>>
> > > > >>>>>>> Commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression")
> > > > >>>>>>>
> > > > >>>>>>> when the vhost_task patches were initially merged and fix the issue
> > > > >>>>>>> in this thread:
> > > > >>>>>>>
> > > > >>>>>>> https://lore.kernel.org/all/000000000000a41b82060e875721@google.com/
> > > > >>>>>>>
> > > > >>>>>>> Long Background:
> > > > >>>>>>>
> > > > >>>>>>> The original vhost worker code didn't support any signals. If the
> > > > >>>>>>> userspace application that owned the worker got a SIGKILL, the app/
> > > > >>>>>>> process would exit dropping all references to the device and then the
> > > > >>>>>>> file operation's release function would be called. From there we would
> > > > >>>>>>> wait on running IO then cleanup the device's memory.
> > > > >>>>>>
> > > > >>>>>> A dumb question.
> > > > >>>>>>
> > > > >>>>>> Is this a user space noticeable change? For example, with this series
> > > > >>>>>> a SIGKILL may shutdown the datapath ...
> > > > >>>>>
> > > > >>>>> It already changed in 6.4. We basically added a new interface to shutdown
> > > > >>>>> everything (userspace and vhost kernel parts). So we won't just shutdown
> > > > >>>>> the data path while userspace is still running. We will shutdown everything
> > > > >>>>> now if you send a SIGKILL to a vhost worker's thread.
> > > > >>>>
> > > > >>>> If I understand correctly, for example Qemu can still live is SIGKILL
> > > > >>>> is just send to vhost thread.
> > > > >>>
> > > > >>> Pre-6.4 qemu could still survive if only the vhost thread got a SIGKILL.
> > > > >>> We used kthreads which are special and can ignore it like how userspace
> > > > >>> can ignore SIGHUP.
> > > > >>>
> > > > >>> 6.4 and newer kernels cannot survive. Even if the vhost thread sort of
> > > > >>> ignores it like I described below where, the signal is still delivered
> > > > >>> to the other qemu threads due to the shared signal handler. Userspace
> > > > >>> can't ignore SIGKILL. It doesn't have any say in the matter, and the
> > > > >>> kernel forces them to exit.
> > > > >>
> > > > >> Ok, I see, so the reason is that vhost belongs to the same thread
> > > > >> group as the owner now.
> > > > >>
> > > > >>>
> > > > >>>>
> > > > >>>> If this is correct, guests may detect this (for example virtio-net has
> > > > >>>> a watchdog).
> > > > >>>>
> > > > >>>
> > > > >>> What did you mean by that part? Do you mean if the vhost thread were to
> > > > >>> exit, so drivers/vhost/net.c couldn't process IO, then the watchdog in
> > > > >>> the guest (virtio-net driver in the guest kernel) would detect that?
> > > > >>
> > > > >> I meant this one. But since we are using CLONE_THREAD, we won't see these.
> > > > >>
> > > > >>> Or
> > > > >>> are you saying the watchdog in the guest can detect signals that the
> > > > >>> host gets?
> > > > >>>
> > > > >>>
> > > > >>>>>
> > > > >>>>> Here are a lots of details:
> > > > >>>>>
> > > > >>>>> - Pre-6.4 kernel, when vhost workers used kthreads, if you sent any signal
> > > > >>>>> to a vhost worker, we ignore it. Nothing happens. kthreads are special and
> > > > >>>>> can ignore all signals.
> > > > >>>>>
> > > > >>>>> You could think of it as the worker is a completely different process than
> > > > >>>>> qemu/userspace so they have completely different signal handlers. The
> > > > >>>>> vhost worker signal handler ignores all signals even SIGKILL.
> > > > >>>>
> > > > >>>> Yes.
> > > > >>>>
> > > > >>>>>
> > > > >>>>> If you send a SIGKILL to a qemu thread, then it just exits right away. We
> > > > >>>>> don't get to do an explicit close() on the vhost device and we don't get
> > > > >>>>> to do ioctls like VHOST_NET_SET_BACKEND to clear backends. The kernel exit
> > > > >>>>> code runs and releases refcounts on the device/file, then the vhost device's
> > > > >>>>> file_operations->release function is called. vhost_dev_cleanup then stops
> > > > >>>>> the vhost worker.
> > > > >>>>
> > > > >>>> Right.
> > > > >>>>
> > > > >>>>>
> > > > >>>>> - In 6.4 and newer kernels, vhost workers use vhost_tasks, so the worker
> > > > >>>>> can be thought of as a thread within the userspace process. With that
> > > > >>>>> change we have the same signal handler as the userspace process.
> > > > >>>>>
> > > > >>>>> If you send a SIGKILL to a qemu thread then it works like above.
> > > > >>>>>
> > > > >>>>> If you send a SIGKILL to a vhost worker, the vhost worker still sort of
> > > > >>>>> ignores it (that is the hack that I mentioned at the beginning of this
> > > > >>>>> thread). kernel/vhost_task.c:vhost_task_fn will see the signal and
> > > > >>>>> then just continue to process works until file_operations->release
> > > > >>>>> calls
> > > > >>>>
> > > > >>>> Yes, so this sticks to the behaviour before vhost_tasks.
> > > > >>>
> > > > >>> Not exactly. The vhost_task stays alive temporarily.
> > > > >>>
> > > > >>> The signal is still delivered to the userspace threads and they will
> > > > >>> exit due to getting the SIGKILL also. SIGKILL goes to all the threads in
> > > > >>> the process and all userspace threads exit like normal because the vhost
> > > > >>> task and normal old userspace threads share a signal handler. When
> > > > >>> userspace exits, the kernel force drops the refcounts on the vhost
> > > > >>> devices and that runs the release function so the vhost_task will then exit.
> > > > >>>
> > > > >>> So what I'm trying to say is that in 6.4 we already changed the behavior.
> > > > >>
> > > > >> Yes. To say the truth, it looks even worse but it might be too late to fix.
> > > > >
> > > > > Andres (cced) has identified two other possible changes:
> > > > >
> > > > > 1) doesn't run in the global PID namespace but run in the namespace of owner
> > > >
> > > > Yeah, I mentioned that one in vhost.h like it's a feature and when posting
> > > > the patches I mentioned it as a possible fix. I mean I thought we wanted it
> > > > to work like qemu and iothreads where the iothread would inherit all those
> > > > values automatically.
> > >
> > > Right, but it could be noticed by the userspace, especially for the
> > > one that tries to do tweak on the performance.
> > >
> > > The root cause is the that now we do copy_processs() in the process of
> > > Qemu instead of the kthreadd. Which result of the the differences of
> > > namespace (I think PID namespace is not the only one we see
> > > difference) and others for the vhost task.
> >
> > Leaking things out of a namespace looks more like a bug.
> > If you really have to be pedantic, the thing to add would
> > be a namespace flag not a qemu flag. Userspace running inside
> > a namespace really must have no say about whether to leak
> > info out of it.
> >
> > > >
> > > > At the time, I thought we didn't inherit the namespace, like we did the cgroup,
> > > > because there was no kernel function for it (like how we didn't inherit v2
> > > > cgroups until recently when someone added some code for that).
> > > >
> > > > I don't know if it's allowed to have something like qemu in namespace N but then
> > > > have it's children (vhost thread in this case) in the global namespace.
> > > > I'll
> > > > look into it.
> > >
> > > Instead of moving vhost thread between difference namespaces, I wonder
> > > if the following is simpler:
> > >
> > > if (new_flag)
> > >     vhost_task_create()
> > > else
> > >     kthread_create()
> > >
> > > New flag inherits the attributes of Qemu (namespaces, rlimit, cgroup,
> > > scheduling attributes ...) which is what we want. Without the new
> > > flag, we stick exactly to the behaviour as in the past to unbreak
> > > existing userspace.
> > >
> > > >
> > > > > 2) doesn't inherit kthreadd's scheduling attributes but the owner
> > > >
> > > > Same as above for this one. I thought I was fixing a bug where before
> > > > we had to manually tune the vhost thread's values but for iothreads they
> > > > automatically got setup.
> > > >
> > > > Just to clarify this one. When we used kthreads, kthread() will reset the
> > > > scheduler priority for the kthread that's created, so we got the default
> > > > values instead of inheriting kthreadd's values.  So we would want:
> > > >
> > > > +       sched_setscheduler_nocheck(current, SCHED_NORMAL, &param);
> > > >
> > > > in vhost_task_fn() instead of inheriting kthreadd's values.
> > > >
> > > > >
> > > > > Though such a change makes more sense for some use cases, it may break others.
> > > > >
> > > > > I wonder if we need to introduce a new flag and bring the old kthread
> > > >
> > > > Do you mean something like a module param?
> > >
> > > This requires the management layer to know if it has a new user space
> > > or not which is hard. A better place is to introduce backend features.
> > >
> > > >
> > > > > codes if the flag is not set? Then we would not end up trying to align
> > > > > the behaviour?
> > > > >
> > > >
> > > > Let me know what you guys prefer. The sched part is easy. The namespace
> > > > part might be more difficult, but I will look into it if you want it.
> > >
> > > Thanks a lot. I think it would be better to have the namespace part
> > > (as well as other namespaces) then we don't need to answer hard
> > > questions like if it can break user space or not.
> > >
> > > >
> >
>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting
  2024-04-19  0:37                     ` Jason Wang
@ 2024-04-19  0:40                       ` Jason Wang
  2024-05-15  6:27                         ` Jason Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Jason Wang @ 2024-04-19  0:40 UTC (permalink / raw)
  To: Andreas Karis
  Cc: Michael S. Tsirkin, Mike Christie, oleg, ebiederm,
	virtualization, sgarzare, stefanha, brauner, Laurent Vivier

On Fri, Apr 19, 2024 at 8:37 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Thu, Apr 18, 2024 at 5:26 PM Andreas Karis <akaris@redhat.com> wrote:
> >
> > Not a kernel developer, so sorry if this here is pointless spam.
> >
> > But qemu is not the only thing consuming /dev/vhost-net:
> > https://doc.dpdk.org/guides/howto/virtio_user_as_exception_path.html
> >
> > Before the series of patches around
> > https://github.com/torvalds/linux/commit/e297cd54b3f81d652456ae6cb93941fc6b5c6683,
> > you would run a DPDK application inside namespaces of container
> > "blue", but then the vhost-... threads were spawned as children of
> > kthreadd and run in the global namespaces. That seemed counter
> > intuitive, and what's worse, the DPDK application running inside the
> > "blue" container namespaces cannot see and thus cannot modify
> > scheduling attributes and affinity of its own vhost-... threads. In
> > scenarios with pretty strict isolation/pinning requirements, the
> > vhost-... threads could easily be scheduled on the same CPUs as the
> > DPDK poll mode drivers (because they inherit the calling process' CPU
> > affinity), and there's no way for the DPDK process itself to move the
> > vhost-... threads to other CPUs. Also, if the DPDK process ran as
> > SCHED_RR, the vhost-.... thread would still be SCHED_NORMAL.
> >
> > After the patch series, the fact that the vhost-... threads run as
> > tasks of the process that requests them seems more natural and gives
> > us as users the kind of control that we'd want from within the
> > container to modify the vhost-... threads' scheduling and affinity.
> > The vhost-... thread as a child of the DPDK application inherits the
> > same scheduling class, CPU set, etc and the DPDK process can easily
> > change those attributes.
> >
> > However, if another user was used to the old behavior, and their
> > entire tooling was created for the old behavior (imagine someone wrote
> > a bunch of scripts/services to force affinity from the global PID
> > namespace for those vhost-... threads), and now this change was
> > introduced, it would break their tooling. So because _I_ might fancy
> > the new behavior, but user _B_ might be all set up for the old
> > kthreadd, shouldn't there be a flag or configuration for the user to:
> >
> > if (new_flag)
> >     vhost_task_create()
> > else
> >     kthread_create()
> >
> > I don't know what kind of flag or knob I'd expect here, but it could
> > be granular for the calling process (a new syscall?), or a kernel
> > flag, etc. But something that let's the admin choose how the kernel
> > spawns the vhost-... threads?
>
> A flag via uAPI that could be controlled by the user of vhost-net via
> syscall. For example, it could be done via set/get_backend_feautre()
> syscall.

So it can be controlled by the admin if we want.

Thanks

>
> Thanks
>
> >
> > On Thu, Apr 18, 2024 at 9:07 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Thu, Apr 18, 2024 at 12:08:52PM +0800, Jason Wang wrote:
> > > > On Thu, Apr 18, 2024 at 12:10 AM Mike Christie
> > > > <michael.christie@oracle.com> wrote:
> > > > >
> > > > > On 4/16/24 10:50 PM, Jason Wang wrote:
> > > > > > On Mon, Apr 15, 2024 at 4:52 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > >>
> > > > > >> On Sat, Apr 13, 2024 at 12:53 AM <michael.christie@oracle.com> wrote:
> > > > > >>>
> > > > > >>> On 4/11/24 10:28 PM, Jason Wang wrote:
> > > > > >>>> On Fri, Apr 12, 2024 at 12:19 AM Mike Christie
> > > > > >>>> <michael.christie@oracle.com> wrote:
> > > > > >>>>>
> > > > > >>>>> On 4/11/24 3:39 AM, Jason Wang wrote:
> > > > > >>>>>> On Sat, Mar 16, 2024 at 8:47 AM Mike Christie
> > > > > >>>>>> <michael.christie@oracle.com> wrote:
> > > > > >>>>>>>
> > > > > >>>>>>> The following patches were made over Linus's tree and also apply over
> > > > > >>>>>>> mst's vhost branch. The patches add the ability for vhost_tasks to
> > > > > >>>>>>> handle SIGKILL by flushing queued works, stop new works from being
> > > > > >>>>>>> queued, and prepare the task for an early exit.
> > > > > >>>>>>>
> > > > > >>>>>>> This removes the need for the signal/coredump hacks added in:
> > > > > >>>>>>>
> > > > > >>>>>>> Commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression")
> > > > > >>>>>>>
> > > > > >>>>>>> when the vhost_task patches were initially merged and fix the issue
> > > > > >>>>>>> in this thread:
> > > > > >>>>>>>
> > > > > >>>>>>> https://lore.kernel.org/all/000000000000a41b82060e875721@google.com/
> > > > > >>>>>>>
> > > > > >>>>>>> Long Background:
> > > > > >>>>>>>
> > > > > >>>>>>> The original vhost worker code didn't support any signals. If the
> > > > > >>>>>>> userspace application that owned the worker got a SIGKILL, the app/
> > > > > >>>>>>> process would exit dropping all references to the device and then the
> > > > > >>>>>>> file operation's release function would be called. From there we would
> > > > > >>>>>>> wait on running IO then cleanup the device's memory.
> > > > > >>>>>>
> > > > > >>>>>> A dumb question.
> > > > > >>>>>>
> > > > > >>>>>> Is this a user space noticeable change? For example, with this series
> > > > > >>>>>> a SIGKILL may shutdown the datapath ...
> > > > > >>>>>
> > > > > >>>>> It already changed in 6.4. We basically added a new interface to shutdown
> > > > > >>>>> everything (userspace and vhost kernel parts). So we won't just shutdown
> > > > > >>>>> the data path while userspace is still running. We will shutdown everything
> > > > > >>>>> now if you send a SIGKILL to a vhost worker's thread.
> > > > > >>>>
> > > > > >>>> If I understand correctly, for example Qemu can still live is SIGKILL
> > > > > >>>> is just send to vhost thread.
> > > > > >>>
> > > > > >>> Pre-6.4 qemu could still survive if only the vhost thread got a SIGKILL.
> > > > > >>> We used kthreads which are special and can ignore it like how userspace
> > > > > >>> can ignore SIGHUP.
> > > > > >>>
> > > > > >>> 6.4 and newer kernels cannot survive. Even if the vhost thread sort of
> > > > > >>> ignores it like I described below where, the signal is still delivered
> > > > > >>> to the other qemu threads due to the shared signal handler. Userspace
> > > > > >>> can't ignore SIGKILL. It doesn't have any say in the matter, and the
> > > > > >>> kernel forces them to exit.
> > > > > >>
> > > > > >> Ok, I see, so the reason is that vhost belongs to the same thread
> > > > > >> group as the owner now.
> > > > > >>
> > > > > >>>
> > > > > >>>>
> > > > > >>>> If this is correct, guests may detect this (for example virtio-net has
> > > > > >>>> a watchdog).
> > > > > >>>>
> > > > > >>>
> > > > > >>> What did you mean by that part? Do you mean if the vhost thread were to
> > > > > >>> exit, so drivers/vhost/net.c couldn't process IO, then the watchdog in
> > > > > >>> the guest (virtio-net driver in the guest kernel) would detect that?
> > > > > >>
> > > > > >> I meant this one. But since we are using CLONE_THREAD, we won't see these.
> > > > > >>
> > > > > >>> Or
> > > > > >>> are you saying the watchdog in the guest can detect signals that the
> > > > > >>> host gets?
> > > > > >>>
> > > > > >>>
> > > > > >>>>>
> > > > > >>>>> Here are a lots of details:
> > > > > >>>>>
> > > > > >>>>> - Pre-6.4 kernel, when vhost workers used kthreads, if you sent any signal
> > > > > >>>>> to a vhost worker, we ignore it. Nothing happens. kthreads are special and
> > > > > >>>>> can ignore all signals.
> > > > > >>>>>
> > > > > >>>>> You could think of it as the worker is a completely different process than
> > > > > >>>>> qemu/userspace so they have completely different signal handlers. The
> > > > > >>>>> vhost worker signal handler ignores all signals even SIGKILL.
> > > > > >>>>
> > > > > >>>> Yes.
> > > > > >>>>
> > > > > >>>>>
> > > > > >>>>> If you send a SIGKILL to a qemu thread, then it just exits right away. We
> > > > > >>>>> don't get to do an explicit close() on the vhost device and we don't get
> > > > > >>>>> to do ioctls like VHOST_NET_SET_BACKEND to clear backends. The kernel exit
> > > > > >>>>> code runs and releases refcounts on the device/file, then the vhost device's
> > > > > >>>>> file_operations->release function is called. vhost_dev_cleanup then stops
> > > > > >>>>> the vhost worker.
> > > > > >>>>
> > > > > >>>> Right.
> > > > > >>>>
> > > > > >>>>>
> > > > > >>>>> - In 6.4 and newer kernels, vhost workers use vhost_tasks, so the worker
> > > > > >>>>> can be thought of as a thread within the userspace process. With that
> > > > > >>>>> change we have the same signal handler as the userspace process.
> > > > > >>>>>
> > > > > >>>>> If you send a SIGKILL to a qemu thread then it works like above.
> > > > > >>>>>
> > > > > >>>>> If you send a SIGKILL to a vhost worker, the vhost worker still sort of
> > > > > >>>>> ignores it (that is the hack that I mentioned at the beginning of this
> > > > > >>>>> thread). kernel/vhost_task.c:vhost_task_fn will see the signal and
> > > > > >>>>> then just continue to process works until file_operations->release
> > > > > >>>>> calls
> > > > > >>>>
> > > > > >>>> Yes, so this sticks to the behaviour before vhost_tasks.
> > > > > >>>
> > > > > >>> Not exactly. The vhost_task stays alive temporarily.
> > > > > >>>
> > > > > >>> The signal is still delivered to the userspace threads and they will
> > > > > >>> exit due to getting the SIGKILL also. SIGKILL goes to all the threads in
> > > > > >>> the process and all userspace threads exit like normal because the vhost
> > > > > >>> task and normal old userspace threads share a signal handler. When
> > > > > >>> userspace exits, the kernel force drops the refcounts on the vhost
> > > > > >>> devices and that runs the release function so the vhost_task will then exit.
> > > > > >>>
> > > > > >>> So what I'm trying to say is that in 6.4 we already changed the behavior.
> > > > > >>
> > > > > >> Yes. To say the truth, it looks even worse but it might be too late to fix.
> > > > > >
> > > > > > Andres (cced) has identified two other possible changes:
> > > > > >
> > > > > > 1) doesn't run in the global PID namespace but run in the namespace of owner
> > > > >
> > > > > Yeah, I mentioned that one in vhost.h like it's a feature and when posting
> > > > > the patches I mentioned it as a possible fix. I mean I thought we wanted it
> > > > > to work like qemu and iothreads where the iothread would inherit all those
> > > > > values automatically.
> > > >
> > > > Right, but it could be noticed by the userspace, especially for the
> > > > one that tries to do tweak on the performance.
> > > >
> > > > The root cause is the that now we do copy_processs() in the process of
> > > > Qemu instead of the kthreadd. Which result of the the differences of
> > > > namespace (I think PID namespace is not the only one we see
> > > > difference) and others for the vhost task.
> > >
> > > Leaking things out of a namespace looks more like a bug.
> > > If you really have to be pedantic, the thing to add would
> > > be a namespace flag not a qemu flag. Userspace running inside
> > > a namespace really must have no say about whether to leak
> > > info out of it.
> > >
> > > > >
> > > > > At the time, I thought we didn't inherit the namespace, like we did the cgroup,
> > > > > because there was no kernel function for it (like how we didn't inherit v2
> > > > > cgroups until recently when someone added some code for that).
> > > > >
> > > > > I don't know if it's allowed to have something like qemu in namespace N but then
> > > > > have it's children (vhost thread in this case) in the global namespace.
> > > > > I'll
> > > > > look into it.
> > > >
> > > > Instead of moving vhost thread between difference namespaces, I wonder
> > > > if the following is simpler:
> > > >
> > > > if (new_flag)
> > > >     vhost_task_create()
> > > > else
> > > >     kthread_create()
> > > >
> > > > New flag inherits the attributes of Qemu (namespaces, rlimit, cgroup,
> > > > scheduling attributes ...) which is what we want. Without the new
> > > > flag, we stick exactly to the behaviour as in the past to unbreak
> > > > existing userspace.
> > > >
> > > > >
> > > > > > 2) doesn't inherit kthreadd's scheduling attributes but the owner
> > > > >
> > > > > Same as above for this one. I thought I was fixing a bug where before
> > > > > we had to manually tune the vhost thread's values but for iothreads they
> > > > > automatically got setup.
> > > > >
> > > > > Just to clarify this one. When we used kthreads, kthread() will reset the
> > > > > scheduler priority for the kthread that's created, so we got the default
> > > > > values instead of inheriting kthreadd's values.  So we would want:
> > > > >
> > > > > +       sched_setscheduler_nocheck(current, SCHED_NORMAL, &param);
> > > > >
> > > > > in vhost_task_fn() instead of inheriting kthreadd's values.
> > > > >
> > > > > >
> > > > > > Though such a change makes more sense for some use cases, it may break others.
> > > > > >
> > > > > > I wonder if we need to introduce a new flag and bring the old kthread
> > > > >
> > > > > Do you mean something like a module param?
> > > >
> > > > This requires the management layer to know if it has a new user space
> > > > or not which is hard. A better place is to introduce backend features.
> > > >
> > > > >
> > > > > > codes if the flag is not set? Then we would not end up trying to align
> > > > > > the behaviour?
> > > > > >
> > > > >
> > > > > Let me know what you guys prefer. The sched part is easy. The namespace
> > > > > part might be more difficult, but I will look into it if you want it.
> > > >
> > > > Thanks a lot. I think it would be better to have the namespace part
> > > > (as well as other namespaces) then we don't need to answer hard
> > > > questions like if it can break user space or not.
> > > >
> > > > >
> > >
> >


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting
  2024-04-19  0:40                       ` Jason Wang
@ 2024-05-15  6:27                         ` Jason Wang
  2024-05-15  7:24                           ` Michael S. Tsirkin
  0 siblings, 1 reply; 32+ messages in thread
From: Jason Wang @ 2024-05-15  6:27 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Andreas Karis, Mike Christie, oleg, ebiederm, virtualization,
	sgarzare, stefanha, brauner, Laurent Vivier

On Fri, Apr 19, 2024 at 8:40 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Fri, Apr 19, 2024 at 8:37 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Thu, Apr 18, 2024 at 5:26 PM Andreas Karis <akaris@redhat.com> wrote:
> > >
> > > Not a kernel developer, so sorry if this here is pointless spam.
> > >
> > > But qemu is not the only thing consuming /dev/vhost-net:
> > > https://doc.dpdk.org/guides/howto/virtio_user_as_exception_path.html
> > >
> > > Before the series of patches around
> > > https://github.com/torvalds/linux/commit/e297cd54b3f81d652456ae6cb93941fc6b5c6683,
> > > you would run a DPDK application inside namespaces of container
> > > "blue", but then the vhost-... threads were spawned as children of
> > > kthreadd and run in the global namespaces. That seemed counter
> > > intuitive, and what's worse, the DPDK application running inside the
> > > "blue" container namespaces cannot see and thus cannot modify
> > > scheduling attributes and affinity of its own vhost-... threads. In
> > > scenarios with pretty strict isolation/pinning requirements, the
> > > vhost-... threads could easily be scheduled on the same CPUs as the
> > > DPDK poll mode drivers (because they inherit the calling process' CPU
> > > affinity), and there's no way for the DPDK process itself to move the
> > > vhost-... threads to other CPUs. Also, if the DPDK process ran as
> > > SCHED_RR, the vhost-.... thread would still be SCHED_NORMAL.
> > >
> > > After the patch series, the fact that the vhost-... threads run as
> > > tasks of the process that requests them seems more natural and gives
> > > us as users the kind of control that we'd want from within the
> > > container to modify the vhost-... threads' scheduling and affinity.
> > > The vhost-... thread as a child of the DPDK application inherits the
> > > same scheduling class, CPU set, etc and the DPDK process can easily
> > > change those attributes.
> > >
> > > However, if another user was used to the old behavior, and their
> > > entire tooling was created for the old behavior (imagine someone wrote
> > > a bunch of scripts/services to force affinity from the global PID
> > > namespace for those vhost-... threads), and now this change was
> > > introduced, it would break their tooling. So because _I_ might fancy
> > > the new behavior, but user _B_ might be all set up for the old
> > > kthreadd, shouldn't there be a flag or configuration for the user to:
> > >
> > > if (new_flag)
> > >     vhost_task_create()
> > > else
> > >     kthread_create()
> > >
> > > I don't know what kind of flag or knob I'd expect here, but it could
> > > be granular for the calling process (a new syscall?), or a kernel
> > > flag, etc. But something that let's the admin choose how the kernel
> > > spawns the vhost-... threads?
> >
> > A flag via uAPI that could be controlled by the user of vhost-net via
> > syscall. For example, it could be done via set/get_backend_feautre()
> > syscall.
>
> So it can be controlled by the admin if we want.
>
> Thanks

Michael, does the flag make sense or not?

Thanks

>
> >
> > Thanks
> >
> > >
> > > On Thu, Apr 18, 2024 at 9:07 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > >
> > > > On Thu, Apr 18, 2024 at 12:08:52PM +0800, Jason Wang wrote:
> > > > > On Thu, Apr 18, 2024 at 12:10 AM Mike Christie
> > > > > <michael.christie@oracle.com> wrote:
> > > > > >
> > > > > > On 4/16/24 10:50 PM, Jason Wang wrote:
> > > > > > > On Mon, Apr 15, 2024 at 4:52 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > >>
> > > > > > >> On Sat, Apr 13, 2024 at 12:53 AM <michael.christie@oracle.com> wrote:
> > > > > > >>>
> > > > > > >>> On 4/11/24 10:28 PM, Jason Wang wrote:
> > > > > > >>>> On Fri, Apr 12, 2024 at 12:19 AM Mike Christie
> > > > > > >>>> <michael.christie@oracle.com> wrote:
> > > > > > >>>>>
> > > > > > >>>>> On 4/11/24 3:39 AM, Jason Wang wrote:
> > > > > > >>>>>> On Sat, Mar 16, 2024 at 8:47 AM Mike Christie
> > > > > > >>>>>> <michael.christie@oracle.com> wrote:
> > > > > > >>>>>>>
> > > > > > >>>>>>> The following patches were made over Linus's tree and also apply over
> > > > > > >>>>>>> mst's vhost branch. The patches add the ability for vhost_tasks to
> > > > > > >>>>>>> handle SIGKILL by flushing queued works, stop new works from being
> > > > > > >>>>>>> queued, and prepare the task for an early exit.
> > > > > > >>>>>>>
> > > > > > >>>>>>> This removes the need for the signal/coredump hacks added in:
> > > > > > >>>>>>>
> > > > > > >>>>>>> Commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression")
> > > > > > >>>>>>>
> > > > > > >>>>>>> when the vhost_task patches were initially merged and fix the issue
> > > > > > >>>>>>> in this thread:
> > > > > > >>>>>>>
> > > > > > >>>>>>> https://lore.kernel.org/all/000000000000a41b82060e875721@google.com/
> > > > > > >>>>>>>
> > > > > > >>>>>>> Long Background:
> > > > > > >>>>>>>
> > > > > > >>>>>>> The original vhost worker code didn't support any signals. If the
> > > > > > >>>>>>> userspace application that owned the worker got a SIGKILL, the app/
> > > > > > >>>>>>> process would exit dropping all references to the device and then the
> > > > > > >>>>>>> file operation's release function would be called. From there we would
> > > > > > >>>>>>> wait on running IO then cleanup the device's memory.
> > > > > > >>>>>>
> > > > > > >>>>>> A dumb question.
> > > > > > >>>>>>
> > > > > > >>>>>> Is this a user space noticeable change? For example, with this series
> > > > > > >>>>>> a SIGKILL may shutdown the datapath ...
> > > > > > >>>>>
> > > > > > >>>>> It already changed in 6.4. We basically added a new interface to shutdown
> > > > > > >>>>> everything (userspace and vhost kernel parts). So we won't just shutdown
> > > > > > >>>>> the data path while userspace is still running. We will shutdown everything
> > > > > > >>>>> now if you send a SIGKILL to a vhost worker's thread.
> > > > > > >>>>
> > > > > > >>>> If I understand correctly, for example Qemu can still live is SIGKILL
> > > > > > >>>> is just send to vhost thread.
> > > > > > >>>
> > > > > > >>> Pre-6.4 qemu could still survive if only the vhost thread got a SIGKILL.
> > > > > > >>> We used kthreads which are special and can ignore it like how userspace
> > > > > > >>> can ignore SIGHUP.
> > > > > > >>>
> > > > > > >>> 6.4 and newer kernels cannot survive. Even if the vhost thread sort of
> > > > > > >>> ignores it like I described below where, the signal is still delivered
> > > > > > >>> to the other qemu threads due to the shared signal handler. Userspace
> > > > > > >>> can't ignore SIGKILL. It doesn't have any say in the matter, and the
> > > > > > >>> kernel forces them to exit.
> > > > > > >>
> > > > > > >> Ok, I see, so the reason is that vhost belongs to the same thread
> > > > > > >> group as the owner now.
> > > > > > >>
> > > > > > >>>
> > > > > > >>>>
> > > > > > >>>> If this is correct, guests may detect this (for example virtio-net has
> > > > > > >>>> a watchdog).
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>> What did you mean by that part? Do you mean if the vhost thread were to
> > > > > > >>> exit, so drivers/vhost/net.c couldn't process IO, then the watchdog in
> > > > > > >>> the guest (virtio-net driver in the guest kernel) would detect that?
> > > > > > >>
> > > > > > >> I meant this one. But since we are using CLONE_THREAD, we won't see these.
> > > > > > >>
> > > > > > >>> Or
> > > > > > >>> are you saying the watchdog in the guest can detect signals that the
> > > > > > >>> host gets?
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>>>
> > > > > > >>>>> Here are a lots of details:
> > > > > > >>>>>
> > > > > > >>>>> - Pre-6.4 kernel, when vhost workers used kthreads, if you sent any signal
> > > > > > >>>>> to a vhost worker, we ignore it. Nothing happens. kthreads are special and
> > > > > > >>>>> can ignore all signals.
> > > > > > >>>>>
> > > > > > >>>>> You could think of it as the worker is a completely different process than
> > > > > > >>>>> qemu/userspace so they have completely different signal handlers. The
> > > > > > >>>>> vhost worker signal handler ignores all signals even SIGKILL.
> > > > > > >>>>
> > > > > > >>>> Yes.
> > > > > > >>>>
> > > > > > >>>>>
> > > > > > >>>>> If you send a SIGKILL to a qemu thread, then it just exits right away. We
> > > > > > >>>>> don't get to do an explicit close() on the vhost device and we don't get
> > > > > > >>>>> to do ioctls like VHOST_NET_SET_BACKEND to clear backends. The kernel exit
> > > > > > >>>>> code runs and releases refcounts on the device/file, then the vhost device's
> > > > > > >>>>> file_operations->release function is called. vhost_dev_cleanup then stops
> > > > > > >>>>> the vhost worker.
> > > > > > >>>>
> > > > > > >>>> Right.
> > > > > > >>>>
> > > > > > >>>>>
> > > > > > >>>>> - In 6.4 and newer kernels, vhost workers use vhost_tasks, so the worker
> > > > > > >>>>> can be thought of as a thread within the userspace process. With that
> > > > > > >>>>> change we have the same signal handler as the userspace process.
> > > > > > >>>>>
> > > > > > >>>>> If you send a SIGKILL to a qemu thread then it works like above.
> > > > > > >>>>>
> > > > > > >>>>> If you send a SIGKILL to a vhost worker, the vhost worker still sort of
> > > > > > >>>>> ignores it (that is the hack that I mentioned at the beginning of this
> > > > > > >>>>> thread). kernel/vhost_task.c:vhost_task_fn will see the signal and
> > > > > > >>>>> then just continue to process works until file_operations->release
> > > > > > >>>>> calls
> > > > > > >>>>
> > > > > > >>>> Yes, so this sticks to the behaviour before vhost_tasks.
> > > > > > >>>
> > > > > > >>> Not exactly. The vhost_task stays alive temporarily.
> > > > > > >>>
> > > > > > >>> The signal is still delivered to the userspace threads and they will
> > > > > > >>> exit due to getting the SIGKILL also. SIGKILL goes to all the threads in
> > > > > > >>> the process and all userspace threads exit like normal because the vhost
> > > > > > >>> task and normal old userspace threads share a signal handler. When
> > > > > > >>> userspace exits, the kernel force drops the refcounts on the vhost
> > > > > > >>> devices and that runs the release function so the vhost_task will then exit.
> > > > > > >>>
> > > > > > >>> So what I'm trying to say is that in 6.4 we already changed the behavior.
> > > > > > >>
> > > > > > >> Yes. To say the truth, it looks even worse but it might be too late to fix.
> > > > > > >
> > > > > > > Andres (cced) has identified two other possible changes:
> > > > > > >
> > > > > > > 1) doesn't run in the global PID namespace but run in the namespace of owner
> > > > > >
> > > > > > Yeah, I mentioned that one in vhost.h like it's a feature and when posting
> > > > > > the patches I mentioned it as a possible fix. I mean I thought we wanted it
> > > > > > to work like qemu and iothreads where the iothread would inherit all those
> > > > > > values automatically.
> > > > >
> > > > > Right, but it could be noticed by the userspace, especially for the
> > > > > one that tries to do tweak on the performance.
> > > > >
> > > > > The root cause is the that now we do copy_processs() in the process of
> > > > > Qemu instead of the kthreadd. Which result of the the differences of
> > > > > namespace (I think PID namespace is not the only one we see
> > > > > difference) and others for the vhost task.
> > > >
> > > > Leaking things out of a namespace looks more like a bug.
> > > > If you really have to be pedantic, the thing to add would
> > > > be a namespace flag not a qemu flag. Userspace running inside
> > > > a namespace really must have no say about whether to leak
> > > > info out of it.
> > > >
> > > > > >
> > > > > > At the time, I thought we didn't inherit the namespace, like we did the cgroup,
> > > > > > because there was no kernel function for it (like how we didn't inherit v2
> > > > > > cgroups until recently when someone added some code for that).
> > > > > >
> > > > > > I don't know if it's allowed to have something like qemu in namespace N but then
> > > > > > have it's children (vhost thread in this case) in the global namespace.
> > > > > > I'll
> > > > > > look into it.
> > > > >
> > > > > Instead of moving vhost thread between difference namespaces, I wonder
> > > > > if the following is simpler:
> > > > >
> > > > > if (new_flag)
> > > > >     vhost_task_create()
> > > > > else
> > > > >     kthread_create()
> > > > >
> > > > > New flag inherits the attributes of Qemu (namespaces, rlimit, cgroup,
> > > > > scheduling attributes ...) which is what we want. Without the new
> > > > > flag, we stick exactly to the behaviour as in the past to unbreak
> > > > > existing userspace.
> > > > >
> > > > > >
> > > > > > > 2) doesn't inherit kthreadd's scheduling attributes but the owner
> > > > > >
> > > > > > Same as above for this one. I thought I was fixing a bug where before
> > > > > > we had to manually tune the vhost thread's values but for iothreads they
> > > > > > automatically got setup.
> > > > > >
> > > > > > Just to clarify this one. When we used kthreads, kthread() will reset the
> > > > > > scheduler priority for the kthread that's created, so we got the default
> > > > > > values instead of inheriting kthreadd's values.  So we would want:
> > > > > >
> > > > > > +       sched_setscheduler_nocheck(current, SCHED_NORMAL, &param);
> > > > > >
> > > > > > in vhost_task_fn() instead of inheriting kthreadd's values.
> > > > > >
> > > > > > >
> > > > > > > Though such a change makes more sense for some use cases, it may break others.
> > > > > > >
> > > > > > > I wonder if we need to introduce a new flag and bring the old kthread
> > > > > >
> > > > > > Do you mean something like a module param?
> > > > >
> > > > > This requires the management layer to know if it has a new user space
> > > > > or not which is hard. A better place is to introduce backend features.
> > > > >
> > > > > >
> > > > > > > codes if the flag is not set? Then we would not end up trying to align
> > > > > > > the behaviour?
> > > > > > >
> > > > > >
> > > > > > Let me know what you guys prefer. The sched part is easy. The namespace
> > > > > > part might be more difficult, but I will look into it if you want it.
> > > > >
> > > > > Thanks a lot. I think it would be better to have the namespace part
> > > > > (as well as other namespaces) then we don't need to answer hard
> > > > > questions like if it can break user space or not.
> > > > >
> > > > > >
> > > >
> > >


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting
  2024-05-15  6:27                         ` Jason Wang
@ 2024-05-15  7:24                           ` Michael S. Tsirkin
  0 siblings, 0 replies; 32+ messages in thread
From: Michael S. Tsirkin @ 2024-05-15  7:24 UTC (permalink / raw)
  To: Jason Wang
  Cc: Andreas Karis, Mike Christie, oleg, ebiederm, virtualization,
	sgarzare, stefanha, brauner, Laurent Vivier

On Wed, May 15, 2024 at 02:27:32PM +0800, Jason Wang wrote:
> On Fri, Apr 19, 2024 at 8:40 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Fri, Apr 19, 2024 at 8:37 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Thu, Apr 18, 2024 at 5:26 PM Andreas Karis <akaris@redhat.com> wrote:
> > > >
> > > > Not a kernel developer, so sorry if this here is pointless spam.
> > > >
> > > > But qemu is not the only thing consuming /dev/vhost-net:
> > > > https://doc.dpdk.org/guides/howto/virtio_user_as_exception_path.html
> > > >
> > > > Before the series of patches around
> > > > https://github.com/torvalds/linux/commit/e297cd54b3f81d652456ae6cb93941fc6b5c6683,
> > > > you would run a DPDK application inside namespaces of container
> > > > "blue", but then the vhost-... threads were spawned as children of
> > > > kthreadd and run in the global namespaces. That seemed counter
> > > > intuitive, and what's worse, the DPDK application running inside the
> > > > "blue" container namespaces cannot see and thus cannot modify
> > > > scheduling attributes and affinity of its own vhost-... threads. In
> > > > scenarios with pretty strict isolation/pinning requirements, the
> > > > vhost-... threads could easily be scheduled on the same CPUs as the
> > > > DPDK poll mode drivers (because they inherit the calling process' CPU
> > > > affinity), and there's no way for the DPDK process itself to move the
> > > > vhost-... threads to other CPUs. Also, if the DPDK process ran as
> > > > SCHED_RR, the vhost-.... thread would still be SCHED_NORMAL.
> > > >
> > > > After the patch series, the fact that the vhost-... threads run as
> > > > tasks of the process that requests them seems more natural and gives
> > > > us as users the kind of control that we'd want from within the
> > > > container to modify the vhost-... threads' scheduling and affinity.
> > > > The vhost-... thread as a child of the DPDK application inherits the
> > > > same scheduling class, CPU set, etc and the DPDK process can easily
> > > > change those attributes.
> > > >
> > > > However, if another user was used to the old behavior, and their
> > > > entire tooling was created for the old behavior (imagine someone wrote
> > > > a bunch of scripts/services to force affinity from the global PID
> > > > namespace for those vhost-... threads), and now this change was
> > > > introduced, it would break their tooling. So because _I_ might fancy
> > > > the new behavior, but user _B_ might be all set up for the old
> > > > kthreadd, shouldn't there be a flag or configuration for the user to:
> > > >
> > > > if (new_flag)
> > > >     vhost_task_create()
> > > > else
> > > >     kthread_create()
> > > >
> > > > I don't know what kind of flag or knob I'd expect here, but it could
> > > > be granular for the calling process (a new syscall?), or a kernel
> > > > flag, etc. But something that let's the admin choose how the kernel
> > > > spawns the vhost-... threads?
> > >
> > > A flag via uAPI that could be controlled by the user of vhost-net via
> > > syscall. For example, it could be done via set/get_backend_feautre()
> > > syscall.
> >
> > So it can be controlled by the admin if we want.
> >
> > Thanks
> 
> Michael, does the flag make sense or not?
> 
> Thanks

So you want a flag to bring back the pre-6.4 behaviour?
Given the behaviour has been in an upstream kernel already,
I'd be inclined to say let's wait until there's an actual
report of a follout, and then look for work-arounds.
Otherwise we don't really know if we are even fixing anything.

> >
> > >
> > > Thanks
> > >
> > > >
> > > > On Thu, Apr 18, 2024 at 9:07 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > >
> > > > > On Thu, Apr 18, 2024 at 12:08:52PM +0800, Jason Wang wrote:
> > > > > > On Thu, Apr 18, 2024 at 12:10 AM Mike Christie
> > > > > > <michael.christie@oracle.com> wrote:
> > > > > > >
> > > > > > > On 4/16/24 10:50 PM, Jason Wang wrote:
> > > > > > > > On Mon, Apr 15, 2024 at 4:52 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > >>
> > > > > > > >> On Sat, Apr 13, 2024 at 12:53 AM <michael.christie@oracle.com> wrote:
> > > > > > > >>>
> > > > > > > >>> On 4/11/24 10:28 PM, Jason Wang wrote:
> > > > > > > >>>> On Fri, Apr 12, 2024 at 12:19 AM Mike Christie
> > > > > > > >>>> <michael.christie@oracle.com> wrote:
> > > > > > > >>>>>
> > > > > > > >>>>> On 4/11/24 3:39 AM, Jason Wang wrote:
> > > > > > > >>>>>> On Sat, Mar 16, 2024 at 8:47 AM Mike Christie
> > > > > > > >>>>>> <michael.christie@oracle.com> wrote:
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> The following patches were made over Linus's tree and also apply over
> > > > > > > >>>>>>> mst's vhost branch. The patches add the ability for vhost_tasks to
> > > > > > > >>>>>>> handle SIGKILL by flushing queued works, stop new works from being
> > > > > > > >>>>>>> queued, and prepare the task for an early exit.
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> This removes the need for the signal/coredump hacks added in:
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> Commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression")
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> when the vhost_task patches were initially merged and fix the issue
> > > > > > > >>>>>>> in this thread:
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> https://lore.kernel.org/all/000000000000a41b82060e875721@google.com/
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> Long Background:
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> The original vhost worker code didn't support any signals. If the
> > > > > > > >>>>>>> userspace application that owned the worker got a SIGKILL, the app/
> > > > > > > >>>>>>> process would exit dropping all references to the device and then the
> > > > > > > >>>>>>> file operation's release function would be called. From there we would
> > > > > > > >>>>>>> wait on running IO then cleanup the device's memory.
> > > > > > > >>>>>>
> > > > > > > >>>>>> A dumb question.
> > > > > > > >>>>>>
> > > > > > > >>>>>> Is this a user space noticeable change? For example, with this series
> > > > > > > >>>>>> a SIGKILL may shutdown the datapath ...
> > > > > > > >>>>>
> > > > > > > >>>>> It already changed in 6.4. We basically added a new interface to shutdown
> > > > > > > >>>>> everything (userspace and vhost kernel parts). So we won't just shutdown
> > > > > > > >>>>> the data path while userspace is still running. We will shutdown everything
> > > > > > > >>>>> now if you send a SIGKILL to a vhost worker's thread.
> > > > > > > >>>>
> > > > > > > >>>> If I understand correctly, for example Qemu can still live is SIGKILL
> > > > > > > >>>> is just send to vhost thread.
> > > > > > > >>>
> > > > > > > >>> Pre-6.4 qemu could still survive if only the vhost thread got a SIGKILL.
> > > > > > > >>> We used kthreads which are special and can ignore it like how userspace
> > > > > > > >>> can ignore SIGHUP.
> > > > > > > >>>
> > > > > > > >>> 6.4 and newer kernels cannot survive. Even if the vhost thread sort of
> > > > > > > >>> ignores it like I described below where, the signal is still delivered
> > > > > > > >>> to the other qemu threads due to the shared signal handler. Userspace
> > > > > > > >>> can't ignore SIGKILL. It doesn't have any say in the matter, and the
> > > > > > > >>> kernel forces them to exit.
> > > > > > > >>
> > > > > > > >> Ok, I see, so the reason is that vhost belongs to the same thread
> > > > > > > >> group as the owner now.
> > > > > > > >>
> > > > > > > >>>
> > > > > > > >>>>
> > > > > > > >>>> If this is correct, guests may detect this (for example virtio-net has
> > > > > > > >>>> a watchdog).
> > > > > > > >>>>
> > > > > > > >>>
> > > > > > > >>> What did you mean by that part? Do you mean if the vhost thread were to
> > > > > > > >>> exit, so drivers/vhost/net.c couldn't process IO, then the watchdog in
> > > > > > > >>> the guest (virtio-net driver in the guest kernel) would detect that?
> > > > > > > >>
> > > > > > > >> I meant this one. But since we are using CLONE_THREAD, we won't see these.
> > > > > > > >>
> > > > > > > >>> Or
> > > > > > > >>> are you saying the watchdog in the guest can detect signals that the
> > > > > > > >>> host gets?
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>>>>
> > > > > > > >>>>> Here are a lots of details:
> > > > > > > >>>>>
> > > > > > > >>>>> - Pre-6.4 kernel, when vhost workers used kthreads, if you sent any signal
> > > > > > > >>>>> to a vhost worker, we ignore it. Nothing happens. kthreads are special and
> > > > > > > >>>>> can ignore all signals.
> > > > > > > >>>>>
> > > > > > > >>>>> You could think of it as the worker is a completely different process than
> > > > > > > >>>>> qemu/userspace so they have completely different signal handlers. The
> > > > > > > >>>>> vhost worker signal handler ignores all signals even SIGKILL.
> > > > > > > >>>>
> > > > > > > >>>> Yes.
> > > > > > > >>>>
> > > > > > > >>>>>
> > > > > > > >>>>> If you send a SIGKILL to a qemu thread, then it just exits right away. We
> > > > > > > >>>>> don't get to do an explicit close() on the vhost device and we don't get
> > > > > > > >>>>> to do ioctls like VHOST_NET_SET_BACKEND to clear backends. The kernel exit
> > > > > > > >>>>> code runs and releases refcounts on the device/file, then the vhost device's
> > > > > > > >>>>> file_operations->release function is called. vhost_dev_cleanup then stops
> > > > > > > >>>>> the vhost worker.
> > > > > > > >>>>
> > > > > > > >>>> Right.
> > > > > > > >>>>
> > > > > > > >>>>>
> > > > > > > >>>>> - In 6.4 and newer kernels, vhost workers use vhost_tasks, so the worker
> > > > > > > >>>>> can be thought of as a thread within the userspace process. With that
> > > > > > > >>>>> change we have the same signal handler as the userspace process.
> > > > > > > >>>>>
> > > > > > > >>>>> If you send a SIGKILL to a qemu thread then it works like above.
> > > > > > > >>>>>
> > > > > > > >>>>> If you send a SIGKILL to a vhost worker, the vhost worker still sort of
> > > > > > > >>>>> ignores it (that is the hack that I mentioned at the beginning of this
> > > > > > > >>>>> thread). kernel/vhost_task.c:vhost_task_fn will see the signal and
> > > > > > > >>>>> then just continue to process works until file_operations->release
> > > > > > > >>>>> calls
> > > > > > > >>>>
> > > > > > > >>>> Yes, so this sticks to the behaviour before vhost_tasks.
> > > > > > > >>>
> > > > > > > >>> Not exactly. The vhost_task stays alive temporarily.
> > > > > > > >>>
> > > > > > > >>> The signal is still delivered to the userspace threads and they will
> > > > > > > >>> exit due to getting the SIGKILL also. SIGKILL goes to all the threads in
> > > > > > > >>> the process and all userspace threads exit like normal because the vhost
> > > > > > > >>> task and normal old userspace threads share a signal handler. When
> > > > > > > >>> userspace exits, the kernel force drops the refcounts on the vhost
> > > > > > > >>> devices and that runs the release function so the vhost_task will then exit.
> > > > > > > >>>
> > > > > > > >>> So what I'm trying to say is that in 6.4 we already changed the behavior.
> > > > > > > >>
> > > > > > > >> Yes. To say the truth, it looks even worse but it might be too late to fix.
> > > > > > > >
> > > > > > > > Andres (cced) has identified two other possible changes:
> > > > > > > >
> > > > > > > > 1) doesn't run in the global PID namespace but run in the namespace of owner
> > > > > > >
> > > > > > > Yeah, I mentioned that one in vhost.h like it's a feature and when posting
> > > > > > > the patches I mentioned it as a possible fix. I mean I thought we wanted it
> > > > > > > to work like qemu and iothreads where the iothread would inherit all those
> > > > > > > values automatically.
> > > > > >
> > > > > > Right, but it could be noticed by the userspace, especially for the
> > > > > > one that tries to do tweak on the performance.
> > > > > >
> > > > > > The root cause is the that now we do copy_processs() in the process of
> > > > > > Qemu instead of the kthreadd. Which result of the the differences of
> > > > > > namespace (I think PID namespace is not the only one we see
> > > > > > difference) and others for the vhost task.
> > > > >
> > > > > Leaking things out of a namespace looks more like a bug.
> > > > > If you really have to be pedantic, the thing to add would
> > > > > be a namespace flag not a qemu flag. Userspace running inside
> > > > > a namespace really must have no say about whether to leak
> > > > > info out of it.
> > > > >
> > > > > > >
> > > > > > > At the time, I thought we didn't inherit the namespace, like we did the cgroup,
> > > > > > > because there was no kernel function for it (like how we didn't inherit v2
> > > > > > > cgroups until recently when someone added some code for that).
> > > > > > >
> > > > > > > I don't know if it's allowed to have something like qemu in namespace N but then
> > > > > > > have it's children (vhost thread in this case) in the global namespace.
> > > > > > > I'll
> > > > > > > look into it.
> > > > > >
> > > > > > Instead of moving vhost thread between difference namespaces, I wonder
> > > > > > if the following is simpler:
> > > > > >
> > > > > > if (new_flag)
> > > > > >     vhost_task_create()
> > > > > > else
> > > > > >     kthread_create()
> > > > > >
> > > > > > New flag inherits the attributes of Qemu (namespaces, rlimit, cgroup,
> > > > > > scheduling attributes ...) which is what we want. Without the new
> > > > > > flag, we stick exactly to the behaviour as in the past to unbreak
> > > > > > existing userspace.
> > > > > >
> > > > > > >
> > > > > > > > 2) doesn't inherit kthreadd's scheduling attributes but the owner
> > > > > > >
> > > > > > > Same as above for this one. I thought I was fixing a bug where before
> > > > > > > we had to manually tune the vhost thread's values but for iothreads they
> > > > > > > automatically got setup.
> > > > > > >
> > > > > > > Just to clarify this one. When we used kthreads, kthread() will reset the
> > > > > > > scheduler priority for the kthread that's created, so we got the default
> > > > > > > values instead of inheriting kthreadd's values.  So we would want:
> > > > > > >
> > > > > > > +       sched_setscheduler_nocheck(current, SCHED_NORMAL, &param);
> > > > > > >
> > > > > > > in vhost_task_fn() instead of inheriting kthreadd's values.
> > > > > > >
> > > > > > > >
> > > > > > > > Though such a change makes more sense for some use cases, it may break others.
> > > > > > > >
> > > > > > > > I wonder if we need to introduce a new flag and bring the old kthread
> > > > > > >
> > > > > > > Do you mean something like a module param?
> > > > > >
> > > > > > This requires the management layer to know if it has a new user space
> > > > > > or not which is hard. A better place is to introduce backend features.
> > > > > >
> > > > > > >
> > > > > > > > codes if the flag is not set? Then we would not end up trying to align
> > > > > > > > the behaviour?
> > > > > > > >
> > > > > > >
> > > > > > > Let me know what you guys prefer. The sched part is easy. The namespace
> > > > > > > part might be more difficult, but I will look into it if you want it.
> > > > > >
> > > > > > Thanks a lot. I think it would be better to have the namespace part
> > > > > > (as well as other namespaces) then we don't need to answer hard
> > > > > > questions like if it can break user space or not.
> > > > > >
> > > > > > >
> > > > >
> > > >


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2024-05-15  7:24 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-16  0:46 [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting Mike Christie
2024-03-16  0:46 ` [PATCH 1/9] vhost-scsi: Handle vhost_vq_work_queue failures for events Mike Christie
2024-03-16  0:47 ` [PATCH 2/9] vhost-scsi: Handle vhost_vq_work_queue failures for cmds Mike Christie
2024-03-16  0:47 ` [PATCH 3/9] vhost-scsi: Use system wq to flush dev for TMFs Mike Christie
2024-03-16  0:47 ` [PATCH 4/9] vhost: Remove vhost_vq_flush Mike Christie
2024-03-16  0:47 ` [PATCH 5/9] vhost_scsi: Handle vhost_vq_work_queue failures for TMFs Mike Christie
2024-03-16  0:47 ` [PATCH 6/9] vhost: Use virtqueue mutex for swapping worker Mike Christie
2024-03-16  0:47 ` [PATCH 7/9] vhost: Release worker mutex during flushes Mike Christie
2024-03-16  0:47 ` [PATCH 8/9] vhost_task: Handle SIGKILL by flushing work and exiting Mike Christie
2024-03-16  0:47 ` [PATCH 9/9] kernel: Remove signal hacks for vhost_tasks Mike Christie
2024-04-09  4:16 ` [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting Jason Wang
2024-04-09 14:57   ` Mike Christie
2024-04-09 16:40     ` Michael S. Tsirkin
2024-04-09 21:55       ` michael.christie
2024-04-10  4:21         ` Michael S. Tsirkin
2024-04-18  7:10   ` Michael S. Tsirkin
2024-04-11  8:39 ` Jason Wang
2024-04-11 16:19   ` Mike Christie
2024-04-12  3:28     ` Jason Wang
2024-04-12 16:52       ` michael.christie
2024-04-15  8:52         ` Jason Wang
2024-04-17  3:50           ` Jason Wang
2024-04-17 16:03             ` Mike Christie
2024-04-18  4:08               ` Jason Wang
2024-04-18  7:07                 ` Michael S. Tsirkin
2024-04-18  9:25                   ` Andreas Karis
2024-04-19  0:37                     ` Jason Wang
2024-04-19  0:40                       ` Jason Wang
2024-05-15  6:27                         ` Jason Wang
2024-05-15  7:24                           ` Michael S. Tsirkin
2024-04-19  0:33                   ` Jason Wang
2024-04-18  7:01               ` Michael S. Tsirkin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.