[PATCH v2 0/3] virtio-scsi: fix SCSIDevice hot unplug with IOThread

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/3] virtio-scsi: fix SCSIDevice hot unplug with IOThread
@ 2023-02-10 14:32 Stefan Hajnoczi
  2023-02-10 14:32 ` [PATCH v2 1/3] scsi: protect req->aiocb with AioContext lock Stefan Hajnoczi
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Stefan Hajnoczi @ 2023-02-10 14:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Michael S. Tsirkin, Paolo Bonzini, David Hildenbrand,
	Peter Xu, Philippe Mathieu-Daudé,
	Fam Zheng, Stefan Hajnoczi

Unplugging SCSIDevices when virtio-scsi is using an IOThread suffers from race
conditions:
- scsi_device_purge_requests() is called from the IOThread in TMF emulation.
  This is unsafe, it should only be called from the BQL.
- SCSIRequest->aiocb is not protected by a lock, so there are races between the
  main loop thread and the IOThread when scsi_device_purge_requests() runs in
  the main loop thread.
- DMAAIOCB->acb is not protected by a lock, so there are races in the DMA
  helpers code when cancelling a request from the main loop thread.

These fixes solve assertion failures during SCSIDevice hot unplug in
virtio-scsi with IOThread. Expanding the use of the AioContext lock isn't great
since we're in the midst of trying to remove it. However, I think this solution
is appropriate so that stable trees or distros can backport the fix without
depending on QEMU multi-queue block layer refactoring.

Special thanks to Qing Wang, who helped me iterate these patches because I
couldn't reproduce the assertion failures myself.

Stefan Hajnoczi (3):
  scsi: protect req->aiocb with AioContext lock
  dma-helpers: prevent dma_blk_cb() vs dma_aio_cancel() race
  virtio-scsi: reset SCSI devices from main loop thread

 include/hw/virtio/virtio-scsi.h |  11 ++-
 hw/scsi/scsi-disk.c             |  23 +++--
 hw/scsi/scsi-generic.c          |  11 ++-
 hw/scsi/virtio-scsi.c           | 169 +++++++++++++++++++++++++-------
 softmmu/dma-helpers.c           |  12 ++-
 5 files changed, 171 insertions(+), 55 deletions(-)

-- 
2.39.1



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 1/3] scsi: protect req->aiocb with AioContext lock
  2023-02-10 14:32 [PATCH v2 0/3] virtio-scsi: fix SCSIDevice hot unplug with IOThread Stefan Hajnoczi
@ 2023-02-10 14:32 ` Stefan Hajnoczi
  2023-02-15 18:17   ` Eric Blake
  2023-02-16 12:34   ` Kevin Wolf
  2023-02-10 14:32 ` [PATCH v2 2/3] dma-helpers: prevent dma_blk_cb() vs dma_aio_cancel() race Stefan Hajnoczi
  2023-02-10 14:32 ` [PATCH v2 3/3] virtio-scsi: reset SCSI devices from main loop thread Stefan Hajnoczi
  2 siblings, 2 replies; 12+ messages in thread
From: Stefan Hajnoczi @ 2023-02-10 14:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Michael S. Tsirkin, Paolo Bonzini, David Hildenbrand,
	Peter Xu, Philippe Mathieu-Daudé,
	Fam Zheng, Stefan Hajnoczi

If requests are being processed in the IOThread when a SCSIDevice is
unplugged, scsi_device_purge_requests() -> scsi_req_cancel_async() races
with I/O completion callbacks. Both threads load and store req->aiocb.
This can lead to assert(r->req.aiocb == NULL) failures and undefined
behavior.

Protect r->req.aiocb with the AioContext lock to prevent the race.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 hw/scsi/scsi-disk.c    | 23 ++++++++++++++++-------
 hw/scsi/scsi-generic.c | 11 ++++++-----
 2 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index d4e360850f..115584f8b9 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -273,9 +273,11 @@ static void scsi_aio_complete(void *opaque, int ret)
     SCSIDiskReq *r = (SCSIDiskReq *)opaque;
     SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
 
+    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
+
     assert(r->req.aiocb != NULL);
     r->req.aiocb = NULL;
-    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
+
     if (scsi_disk_req_check_error(r, ret, true)) {
         goto done;
     }
@@ -357,10 +359,11 @@ static void scsi_dma_complete(void *opaque, int ret)
     SCSIDiskReq *r = (SCSIDiskReq *)opaque;
     SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
 
+    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
+
     assert(r->req.aiocb != NULL);
     r->req.aiocb = NULL;
 
-    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
     if (ret < 0) {
         block_acct_failed(blk_get_stats(s->qdev.conf.blk), &r->acct);
     } else {
@@ -393,10 +396,11 @@ static void scsi_read_complete(void *opaque, int ret)
     SCSIDiskReq *r = (SCSIDiskReq *)opaque;
     SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
 
+    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
+
     assert(r->req.aiocb != NULL);
     r->req.aiocb = NULL;
 
-    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
     if (ret < 0) {
         block_acct_failed(blk_get_stats(s->qdev.conf.blk), &r->acct);
     } else {
@@ -446,10 +450,11 @@ static void scsi_do_read_cb(void *opaque, int ret)
     SCSIDiskReq *r = (SCSIDiskReq *)opaque;
     SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
 
+    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
+
     assert (r->req.aiocb != NULL);
     r->req.aiocb = NULL;
 
-    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
     if (ret < 0) {
         block_acct_failed(blk_get_stats(s->qdev.conf.blk), &r->acct);
     } else {
@@ -530,10 +535,11 @@ static void scsi_write_complete(void * opaque, int ret)
     SCSIDiskReq *r = (SCSIDiskReq *)opaque;
     SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
 
+    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
+
     assert (r->req.aiocb != NULL);
     r->req.aiocb = NULL;
 
-    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
     if (ret < 0) {
         block_acct_failed(blk_get_stats(s->qdev.conf.blk), &r->acct);
     } else {
@@ -1737,10 +1743,11 @@ static void scsi_unmap_complete(void *opaque, int ret)
     SCSIDiskReq *r = data->r;
     SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
 
+    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
+
     assert(r->req.aiocb != NULL);
     r->req.aiocb = NULL;
 
-    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
     if (scsi_disk_req_check_error(r, ret, true)) {
         scsi_req_unref(&r->req);
         g_free(data);
@@ -1816,9 +1823,11 @@ static void scsi_write_same_complete(void *opaque, int ret)
     SCSIDiskReq *r = data->r;
     SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
 
+    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
+
     assert(r->req.aiocb != NULL);
     r->req.aiocb = NULL;
-    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
+
     if (scsi_disk_req_check_error(r, ret, true)) {
         goto done;
     }
diff --git a/hw/scsi/scsi-generic.c b/hw/scsi/scsi-generic.c
index 92cce20a4d..ac9fa662b4 100644
--- a/hw/scsi/scsi-generic.c
+++ b/hw/scsi/scsi-generic.c
@@ -111,10 +111,11 @@ static void scsi_command_complete(void *opaque, int ret)
     SCSIGenericReq *r = (SCSIGenericReq *)opaque;
     SCSIDevice *s = r->req.dev;
 
+    aio_context_acquire(blk_get_aio_context(s->conf.blk));
+
     assert(r->req.aiocb != NULL);
     r->req.aiocb = NULL;
 
-    aio_context_acquire(blk_get_aio_context(s->conf.blk));
     scsi_command_complete_noio(r, ret);
     aio_context_release(blk_get_aio_context(s->conf.blk));
 }
@@ -269,11 +270,11 @@ static void scsi_read_complete(void * opaque, int ret)
     SCSIDevice *s = r->req.dev;
     int len;
 
+    aio_context_acquire(blk_get_aio_context(s->conf.blk));
+
     assert(r->req.aiocb != NULL);
     r->req.aiocb = NULL;
 
-    aio_context_acquire(blk_get_aio_context(s->conf.blk));
-
     if (ret || r->req.io_canceled) {
         scsi_command_complete_noio(r, ret);
         goto done;
@@ -386,11 +387,11 @@ static void scsi_write_complete(void * opaque, int ret)
 
     trace_scsi_generic_write_complete(ret);
 
+    aio_context_acquire(blk_get_aio_context(s->conf.blk));
+
     assert(r->req.aiocb != NULL);
     r->req.aiocb = NULL;
 
-    aio_context_acquire(blk_get_aio_context(s->conf.blk));
-
     if (ret || r->req.io_canceled) {
         scsi_command_complete_noio(r, ret);
         goto done;
-- 
2.39.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 2/3] dma-helpers: prevent dma_blk_cb() vs dma_aio_cancel() race
  2023-02-10 14:32 [PATCH v2 0/3] virtio-scsi: fix SCSIDevice hot unplug with IOThread Stefan Hajnoczi
  2023-02-10 14:32 ` [PATCH v2 1/3] scsi: protect req->aiocb with AioContext lock Stefan Hajnoczi
@ 2023-02-10 14:32 ` Stefan Hajnoczi
  2023-02-15 18:19   ` Eric Blake
  2023-02-16 15:27   ` Kevin Wolf
  2023-02-10 14:32 ` [PATCH v2 3/3] virtio-scsi: reset SCSI devices from main loop thread Stefan Hajnoczi
  2 siblings, 2 replies; 12+ messages in thread
From: Stefan Hajnoczi @ 2023-02-10 14:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Michael S. Tsirkin, Paolo Bonzini, David Hildenbrand,
	Peter Xu, Philippe Mathieu-Daudé,
	Fam Zheng, Stefan Hajnoczi

dma_blk_cb() only takes the AioContext lock around ->io_func(). That
means the rest of dma_blk_cb() is not protected. In particular, the
DMAAIOCB field accesses happen outside the lock.

There is a race when the main loop thread holds the AioContext lock and
invokes scsi_device_purge_requests() -> bdrv_aio_cancel() ->
dma_aio_cancel() while an IOThread executes dma_blk_cb(). The dbs->acb
field determines how cancellation proceeds. If dma_aio_cancel() see
dbs->acb == NULL while dma_blk_cb() is still running, the request can be
completed twice (-ECANCELED and the actual return value).

The following assertion can occur with virtio-scsi when an IOThread is
used:

  ../hw/scsi/scsi-disk.c:368: scsi_dma_complete: Assertion `r->req.aiocb != NULL' failed.

Fix the race by holding the AioContext across dma_blk_cb(). Now
dma_aio_cancel() under the AioContext lock will not see
inconsistent/intermediate states.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 hw/scsi/scsi-disk.c   |  4 +---
 softmmu/dma-helpers.c | 12 +++++++-----
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 115584f8b9..97c9b1c8cd 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -354,13 +354,12 @@ done:
     scsi_req_unref(&r->req);
 }
 
+/* Called with AioContext lock held */
 static void scsi_dma_complete(void *opaque, int ret)
 {
     SCSIDiskReq *r = (SCSIDiskReq *)opaque;
     SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
 
-    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
-
     assert(r->req.aiocb != NULL);
     r->req.aiocb = NULL;
 
@@ -370,7 +369,6 @@ static void scsi_dma_complete(void *opaque, int ret)
         block_acct_done(blk_get_stats(s->qdev.conf.blk), &r->acct);
     }
     scsi_dma_complete_noio(r, ret);
-    aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
 }
 
 static void scsi_read_complete_noio(SCSIDiskReq *r, int ret)
diff --git a/softmmu/dma-helpers.c b/softmmu/dma-helpers.c
index 7820fec54c..2463964805 100644
--- a/softmmu/dma-helpers.c
+++ b/softmmu/dma-helpers.c
@@ -113,17 +113,19 @@ static void dma_complete(DMAAIOCB *dbs, int ret)
 static void dma_blk_cb(void *opaque, int ret)
 {
     DMAAIOCB *dbs = (DMAAIOCB *)opaque;
+    AioContext *ctx = dbs->ctx;
     dma_addr_t cur_addr, cur_len;
     void *mem;
 
     trace_dma_blk_cb(dbs, ret);
 
+    aio_context_acquire(ctx);
     dbs->acb = NULL;
     dbs->offset += dbs->iov.size;
 
     if (dbs->sg_cur_index == dbs->sg->nsg || ret < 0) {
         dma_complete(dbs, ret);
-        return;
+        goto out;
     }
     dma_blk_unmap(dbs);
 
@@ -164,9 +166,9 @@ static void dma_blk_cb(void *opaque, int ret)
 
     if (dbs->iov.size == 0) {
         trace_dma_map_wait(dbs);
-        dbs->bh = aio_bh_new(dbs->ctx, reschedule_dma, dbs);
+        dbs->bh = aio_bh_new(ctx, reschedule_dma, dbs);
         cpu_register_map_client(dbs->bh);
-        return;
+        goto out;
     }
 
     if (!QEMU_IS_ALIGNED(dbs->iov.size, dbs->align)) {
@@ -174,11 +176,11 @@ static void dma_blk_cb(void *opaque, int ret)
                                 QEMU_ALIGN_DOWN(dbs->iov.size, dbs->align));
     }
 
-    aio_context_acquire(dbs->ctx);
     dbs->acb = dbs->io_func(dbs->offset, &dbs->iov,
                             dma_blk_cb, dbs, dbs->io_func_opaque);
-    aio_context_release(dbs->ctx);
     assert(dbs->acb);
+out:
+    aio_context_release(ctx);
 }
 
 static void dma_aio_cancel(BlockAIOCB *acb)
-- 
2.39.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 3/3] virtio-scsi: reset SCSI devices from main loop thread
  2023-02-10 14:32 [PATCH v2 0/3] virtio-scsi: fix SCSIDevice hot unplug with IOThread Stefan Hajnoczi
  2023-02-10 14:32 ` [PATCH v2 1/3] scsi: protect req->aiocb with AioContext lock Stefan Hajnoczi
  2023-02-10 14:32 ` [PATCH v2 2/3] dma-helpers: prevent dma_blk_cb() vs dma_aio_cancel() race Stefan Hajnoczi
@ 2023-02-10 14:32 ` Stefan Hajnoczi
  2023-02-15 18:27   ` Eric Blake
  2023-02-17 10:22   ` Kevin Wolf
  2 siblings, 2 replies; 12+ messages in thread
From: Stefan Hajnoczi @ 2023-02-10 14:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Michael S. Tsirkin, Paolo Bonzini, David Hildenbrand,
	Peter Xu, Philippe Mathieu-Daudé,
	Fam Zheng, Stefan Hajnoczi, Qing Wang

When an IOThread is configured, the ctrl virtqueue is processed in the
IOThread. TMFs that reset SCSI devices are currently called directly
from the IOThread and trigger an assertion failure in blk_drain():

  ../block/block-backend.c:1780: void blk_drain(BlockBackend *): Assertion `qemu_in_main_thread()' failed.

The blk_drain() function is not designed to be called from an IOThread
because it needs the Big QEMU Lock (BQL).

This patch defers TMFs that reset SCSI devices to a Bottom Half (BH)
that runs in the main loop thread under the BQL. This way it's safe to
call blk_drain() and the assertion failure is avoided.

Introduce s->tmf_bh_list for tracking TMF requests that have been
deferred to the BH. When the BH runs it will grab the entire list and
process all requests. Care must be taken to clear the list when the
virtio-scsi device is reset or unrealized. Otherwise deferred TMF
requests could execute later and lead to use-after-free or other
undefined behavior.

The s->resetting counter that's used by TMFs that reset SCSI devices is
accessed from multiple threads. This patch makes that explicit by using
atomic accessor functions. With this patch applied the counter is only
modified by the main loop thread under the BQL but can be read by any
thread.

Reported-by: Qing Wang <qinwang@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/hw/virtio/virtio-scsi.h |  11 ++-
 hw/scsi/virtio-scsi.c           | 169 +++++++++++++++++++++++++-------
 2 files changed, 143 insertions(+), 37 deletions(-)

diff --git a/include/hw/virtio/virtio-scsi.h b/include/hw/virtio/virtio-scsi.h
index 37b75e15e3..779568ab5d 100644
--- a/include/hw/virtio/virtio-scsi.h
+++ b/include/hw/virtio/virtio-scsi.h
@@ -74,13 +74,22 @@ struct VirtIOSCSICommon {
     VirtQueue **cmd_vqs;
 };
 
+struct VirtIOSCSIReq;
+
 struct VirtIOSCSI {
     VirtIOSCSICommon parent_obj;
 
     SCSIBus bus;
-    int resetting;
+    int resetting; /* written from main loop thread, read from any thread */
     bool events_dropped;
 
+    /*
+     * TMFs deferred to main loop BH. These fields are protected by
+     * virtio_scsi_acquire().
+     */
+    QEMUBH *tmf_bh;
+    QTAILQ_HEAD(, VirtIOSCSIReq) tmf_bh_list;
+
     /* Fields for dataplane below */
     AioContext *ctx; /* one iothread per virtio-scsi-pci for now */
 
diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 2b649ca976..612c525d9d 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -43,13 +43,11 @@ typedef struct VirtIOSCSIReq {
     QEMUSGList qsgl;
     QEMUIOVector resp_iov;
 
-    union {
-        /* Used for two-stage request submission */
-        QTAILQ_ENTRY(VirtIOSCSIReq) next;
+    /* Used for two-stage request submission and TMFs deferred to BH */
+    QTAILQ_ENTRY(VirtIOSCSIReq) next;
 
-        /* Used for cancellation of request during TMFs */
-        int remaining;
-    };
+    /* Used for cancellation of request during TMFs */
+    int remaining;
 
     SCSIRequest *sreq;
     size_t resp_size;
@@ -294,6 +292,122 @@ static inline void virtio_scsi_ctx_check(VirtIOSCSI *s, SCSIDevice *d)
     }
 }
 
+static void virtio_scsi_do_one_tmf_bh(VirtIOSCSIReq *req)
+{
+    VirtIOSCSI *s = req->dev;
+    SCSIDevice *d = virtio_scsi_device_get(s, req->req.tmf.lun);
+    BusChild *kid;
+    int target;
+
+    switch (req->req.tmf.subtype) {
+    case VIRTIO_SCSI_T_TMF_LOGICAL_UNIT_RESET:
+        if (!d) {
+            req->resp.tmf.response = VIRTIO_SCSI_S_BAD_TARGET;
+            goto out;
+        }
+        if (d->lun != virtio_scsi_get_lun(req->req.tmf.lun)) {
+            req->resp.tmf.response = VIRTIO_SCSI_S_INCORRECT_LUN;
+            goto out;
+        }
+        qatomic_inc(&s->resetting);
+        device_cold_reset(&d->qdev);
+        qatomic_dec(&s->resetting);
+        break;
+
+    case VIRTIO_SCSI_T_TMF_I_T_NEXUS_RESET:
+        target = req->req.tmf.lun[1];
+        qatomic_inc(&s->resetting);
+
+        rcu_read_lock();
+        QTAILQ_FOREACH_RCU(kid, &s->bus.qbus.children, sibling) {
+            SCSIDevice *d1 = SCSI_DEVICE(kid->child);
+            if (d1->channel == 0 && d1->id == target) {
+                device_cold_reset(&d1->qdev);
+            }
+        }
+        rcu_read_unlock();
+
+        qatomic_dec(&s->resetting);
+        break;
+
+    default:
+        g_assert_not_reached();
+        break;
+    }
+
+out:
+    object_unref(OBJECT(d));
+
+    virtio_scsi_acquire(s);
+    virtio_scsi_complete_req(req);
+    virtio_scsi_release(s);
+}
+
+/* Some TMFs must be processed from the main loop thread */
+static void virtio_scsi_do_tmf_bh(void *opaque)
+{
+    VirtIOSCSI *s = opaque;
+    QTAILQ_HEAD(, VirtIOSCSIReq) reqs = QTAILQ_HEAD_INITIALIZER(reqs);
+    VirtIOSCSIReq *req;
+    VirtIOSCSIReq *tmp;
+
+    GLOBAL_STATE_CODE();
+
+    virtio_scsi_acquire(s);
+
+    QTAILQ_FOREACH_SAFE(req, &s->tmf_bh_list, next, tmp) {
+        QTAILQ_REMOVE(&s->tmf_bh_list, req, next);
+        QTAILQ_INSERT_TAIL(&reqs, req, next);
+    }
+
+    qemu_bh_delete(s->tmf_bh);
+    s->tmf_bh = NULL;
+
+    virtio_scsi_release(s);
+
+    QTAILQ_FOREACH_SAFE(req, &reqs, next, tmp) {
+        QTAILQ_REMOVE(&reqs, req, next);
+        virtio_scsi_do_one_tmf_bh(req);
+    }
+}
+
+static void virtio_scsi_reset_tmf_bh(VirtIOSCSI *s)
+{
+    VirtIOSCSIReq *req;
+    VirtIOSCSIReq *tmp;
+
+    GLOBAL_STATE_CODE();
+
+    virtio_scsi_acquire(s);
+
+    if (s->tmf_bh) {
+        qemu_bh_delete(s->tmf_bh);
+        s->tmf_bh = NULL;
+    }
+
+    QTAILQ_FOREACH_SAFE(req, &s->tmf_bh_list, next, tmp) {
+        QTAILQ_REMOVE(&s->tmf_bh_list, req, next);
+
+        /* SAM-6 6.3.2 Hard reset */
+        req->resp.tmf.response = VIRTIO_SCSI_S_TARGET_FAILURE;
+        virtio_scsi_complete_req(req);
+    }
+
+    virtio_scsi_release(s);
+}
+
+static void virtio_scsi_defer_tmf_to_bh(VirtIOSCSIReq *req)
+{
+    VirtIOSCSI *s = req->dev;
+
+    QTAILQ_INSERT_TAIL(&s->tmf_bh_list, req, next);
+
+    if (!s->tmf_bh) {
+        s->tmf_bh = qemu_bh_new(virtio_scsi_do_tmf_bh, s);
+        qemu_bh_schedule(s->tmf_bh);
+    }
+}
+
 /* Return 0 if the request is ready to be completed and return to guest;
  * -EINPROGRESS if the request is submitted and will be completed later, in the
  *  case of async cancellation. */
@@ -301,8 +415,6 @@ static int virtio_scsi_do_tmf(VirtIOSCSI *s, VirtIOSCSIReq *req)
 {
     SCSIDevice *d = virtio_scsi_device_get(s, req->req.tmf.lun);
     SCSIRequest *r, *next;
-    BusChild *kid;
-    int target;
     int ret = 0;
 
     virtio_scsi_ctx_check(s, d);
@@ -359,15 +471,9 @@ static int virtio_scsi_do_tmf(VirtIOSCSI *s, VirtIOSCSIReq *req)
         break;
 
     case VIRTIO_SCSI_T_TMF_LOGICAL_UNIT_RESET:
-        if (!d) {
-            goto fail;
-        }
-        if (d->lun != virtio_scsi_get_lun(req->req.tmf.lun)) {
-            goto incorrect_lun;
-        }
-        s->resetting++;
-        device_cold_reset(&d->qdev);
-        s->resetting--;
+    case VIRTIO_SCSI_T_TMF_I_T_NEXUS_RESET:
+        virtio_scsi_defer_tmf_to_bh(req);
+        ret = -EINPROGRESS;
         break;
 
     case VIRTIO_SCSI_T_TMF_ABORT_TASK_SET:
@@ -410,22 +516,6 @@ static int virtio_scsi_do_tmf(VirtIOSCSI *s, VirtIOSCSIReq *req)
         }
         break;
 
-    case VIRTIO_SCSI_T_TMF_I_T_NEXUS_RESET:
-        target = req->req.tmf.lun[1];
-        s->resetting++;
-
-        rcu_read_lock();
-        QTAILQ_FOREACH_RCU(kid, &s->bus.qbus.children, sibling) {
-            SCSIDevice *d1 = SCSI_DEVICE(kid->child);
-            if (d1->channel == 0 && d1->id == target) {
-                device_cold_reset(&d1->qdev);
-            }
-        }
-        rcu_read_unlock();
-
-        s->resetting--;
-        break;
-
     case VIRTIO_SCSI_T_TMF_CLEAR_ACA:
     default:
         req->resp.tmf.response = VIRTIO_SCSI_S_FUNCTION_REJECTED;
@@ -655,7 +745,7 @@ static void virtio_scsi_request_cancelled(SCSIRequest *r)
     if (!req) {
         return;
     }
-    if (req->dev->resetting) {
+    if (qatomic_read(&req->dev->resetting)) {
         req->resp.cmd.response = VIRTIO_SCSI_S_RESET;
     } else {
         req->resp.cmd.response = VIRTIO_SCSI_S_ABORTED;
@@ -831,9 +921,12 @@ static void virtio_scsi_reset(VirtIODevice *vdev)
     VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(vdev);
 
     assert(!s->dataplane_started);
-    s->resetting++;
+
+    virtio_scsi_reset_tmf_bh(s);
+
+    qatomic_inc(&s->resetting);
     bus_cold_reset(BUS(&s->bus));
-    s->resetting--;
+    qatomic_dec(&s->resetting);
 
     vs->sense_size = VIRTIO_SCSI_SENSE_DEFAULT_SIZE;
     vs->cdb_size = VIRTIO_SCSI_CDB_DEFAULT_SIZE;
@@ -1053,6 +1146,8 @@ static void virtio_scsi_device_realize(DeviceState *dev, Error **errp)
     VirtIOSCSI *s = VIRTIO_SCSI(dev);
     Error *err = NULL;
 
+    QTAILQ_INIT(&s->tmf_bh_list);
+
     virtio_scsi_common_realize(dev,
                                virtio_scsi_handle_ctrl,
                                virtio_scsi_handle_event,
@@ -1090,6 +1185,8 @@ static void virtio_scsi_device_unrealize(DeviceState *dev)
 {
     VirtIOSCSI *s = VIRTIO_SCSI(dev);
 
+    virtio_scsi_reset_tmf_bh(s);
+
     qbus_set_hotplug_handler(BUS(&s->bus), NULL);
     virtio_scsi_common_unrealize(dev);
 }
-- 
2.39.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/3] scsi: protect req->aiocb with AioContext lock
  2023-02-10 14:32 ` [PATCH v2 1/3] scsi: protect req->aiocb with AioContext lock Stefan Hajnoczi
@ 2023-02-15 18:17   ` Eric Blake
  2023-02-16 12:34   ` Kevin Wolf
  1 sibling, 0 replies; 12+ messages in thread
From: Eric Blake @ 2023-02-15 18:17 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, qemu-block, Michael S. Tsirkin, Paolo Bonzini,
	David Hildenbrand, Peter Xu, Philippe Mathieu-Daudé,
	Fam Zheng

On Fri, Feb 10, 2023 at 09:32:36AM -0500, Stefan Hajnoczi wrote:
> If requests are being processed in the IOThread when a SCSIDevice is
> unplugged, scsi_device_purge_requests() -> scsi_req_cancel_async() races
> with I/O completion callbacks. Both threads load and store req->aiocb.
> This can lead to assert(r->req.aiocb == NULL) failures and undefined
> behavior.
> 
> Protect r->req.aiocb with the AioContext lock to prevent the race.

I understand that we're trying to get rid of this lock down the road,
but until then, properly using it to guard things is appropriate.

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/3] dma-helpers: prevent dma_blk_cb() vs dma_aio_cancel() race
  2023-02-10 14:32 ` [PATCH v2 2/3] dma-helpers: prevent dma_blk_cb() vs dma_aio_cancel() race Stefan Hajnoczi
@ 2023-02-15 18:19   ` Eric Blake
  2023-02-16 15:27   ` Kevin Wolf
  1 sibling, 0 replies; 12+ messages in thread
From: Eric Blake @ 2023-02-15 18:19 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, qemu-block, Michael S. Tsirkin, Paolo Bonzini,
	David Hildenbrand, Peter Xu, Philippe Mathieu-Daudé,
	Fam Zheng

On Fri, Feb 10, 2023 at 09:32:37AM -0500, Stefan Hajnoczi wrote:
> dma_blk_cb() only takes the AioContext lock around ->io_func(). That
> means the rest of dma_blk_cb() is not protected. In particular, the
> DMAAIOCB field accesses happen outside the lock.
> 
> There is a race when the main loop thread holds the AioContext lock and
> invokes scsi_device_purge_requests() -> bdrv_aio_cancel() ->
> dma_aio_cancel() while an IOThread executes dma_blk_cb(). The dbs->acb
> field determines how cancellation proceeds. If dma_aio_cancel() see

"sees" or "can see"

> dbs->acb == NULL while dma_blk_cb() is still running, the request can be
> completed twice (-ECANCELED and the actual return value).
> 
> The following assertion can occur with virtio-scsi when an IOThread is
> used:
> 
>   ../hw/scsi/scsi-disk.c:368: scsi_dma_complete: Assertion `r->req.aiocb != NULL' failed.
> 
> Fix the race by holding the AioContext across dma_blk_cb(). Now
> dma_aio_cancel() under the AioContext lock will not see
> inconsistent/intermediate states.
> 
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  hw/scsi/scsi-disk.c   |  4 +---
>  softmmu/dma-helpers.c | 12 +++++++-----
>  2 files changed, 8 insertions(+), 8 deletions(-)

Widening the scope protected by the lock makes sense, insofar as we
still need the lock.

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/3] virtio-scsi: reset SCSI devices from main loop thread
  2023-02-10 14:32 ` [PATCH v2 3/3] virtio-scsi: reset SCSI devices from main loop thread Stefan Hajnoczi
@ 2023-02-15 18:27   ` Eric Blake
  2023-02-17 10:22   ` Kevin Wolf
  1 sibling, 0 replies; 12+ messages in thread
From: Eric Blake @ 2023-02-15 18:27 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, qemu-block, Michael S. Tsirkin, Paolo Bonzini,
	David Hildenbrand, Peter Xu, Philippe Mathieu-Daudé,
	Fam Zheng, Qing Wang

On Fri, Feb 10, 2023 at 09:32:38AM -0500, Stefan Hajnoczi wrote:
> When an IOThread is configured, the ctrl virtqueue is processed in the
> IOThread. TMFs that reset SCSI devices are currently called directly
> from the IOThread and trigger an assertion failure in blk_drain():
> 
>   ../block/block-backend.c:1780: void blk_drain(BlockBackend *): Assertion `qemu_in_main_thread()' failed.
> 
> The blk_drain() function is not designed to be called from an IOThread
> because it needs the Big QEMU Lock (BQL).
> 
> This patch defers TMFs that reset SCSI devices to a Bottom Half (BH)
> that runs in the main loop thread under the BQL. This way it's safe to
> call blk_drain() and the assertion failure is avoided.
> 
> Introduce s->tmf_bh_list for tracking TMF requests that have been
> deferred to the BH. When the BH runs it will grab the entire list and
> process all requests. Care must be taken to clear the list when the
> virtio-scsi device is reset or unrealized. Otherwise deferred TMF
> requests could execute later and lead to use-after-free or other
> undefined behavior.
> 
> The s->resetting counter that's used by TMFs that reset SCSI devices is
> accessed from multiple threads. This patch makes that explicit by using
> atomic accessor functions. With this patch applied the counter is only
> modified by the main loop thread under the BQL but can be read by any
> thread.
> 
> Reported-by: Qing Wang <qinwang@redhat.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  include/hw/virtio/virtio-scsi.h |  11 ++-
>  hw/scsi/virtio-scsi.c           | 169 +++++++++++++++++++++++++-------
>  2 files changed, 143 insertions(+), 37 deletions(-)

This one is trickier than the first two, so if you want a second
reviewer, I understand.  That said, although the number of lines of
code grows, the code motion into separate handlers was fairly easy to
follow, and the logic for why they must be in a separate BH context is
correct.

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/3] scsi: protect req->aiocb with AioContext lock
  2023-02-10 14:32 ` [PATCH v2 1/3] scsi: protect req->aiocb with AioContext lock Stefan Hajnoczi
  2023-02-15 18:17   ` Eric Blake
@ 2023-02-16 12:34   ` Kevin Wolf
  1 sibling, 0 replies; 12+ messages in thread
From: Kevin Wolf @ 2023-02-16 12:34 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, qemu-block, Michael S. Tsirkin, Paolo Bonzini,
	David Hildenbrand, Peter Xu, Philippe Mathieu-Daudé,
	Fam Zheng

Am 10.02.2023 um 15:32 hat Stefan Hajnoczi geschrieben:
> If requests are being processed in the IOThread when a SCSIDevice is
> unplugged, scsi_device_purge_requests() -> scsi_req_cancel_async() races
> with I/O completion callbacks. Both threads load and store req->aiocb.
> This can lead to assert(r->req.aiocb == NULL) failures and undefined
> behavior.
> 
> Protect r->req.aiocb with the AioContext lock to prevent the race.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

I tried to check that all accesses of .aiocb are actually protected by
the AioContext lock. I stopped at scsi_read_data(), which asserts that
it is non-NULL, but it is only called as a SCSIReqOps function. I
couldn't find any information on what the locking rules are with
SCSIReqOps functions and didn't feel like reverse engineering scsi-bus
etc. without just asking first.

The same question applies to:
- scsi_read_data
- scsi_write_data
- scsi_disk_emulate_write_data
- scsi_disk_emulate_command

Since these are not callbacks scheduled in the AioContext by scsi-disk
itself, I expect that they are indeed covered. The acquire/release pair
in virtio_scsi_handle_cmd() might actually indirectly cover all of them,
but I haven't checked that.

Either way, the changes that you're making look good:

Reviewed-by: Kevin Wolf <kwolf@redhat.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/3] dma-helpers: prevent dma_blk_cb() vs dma_aio_cancel() race
  2023-02-10 14:32 ` [PATCH v2 2/3] dma-helpers: prevent dma_blk_cb() vs dma_aio_cancel() race Stefan Hajnoczi
  2023-02-15 18:19   ` Eric Blake
@ 2023-02-16 15:27   ` Kevin Wolf
  2023-02-16 21:27     ` Stefan Hajnoczi
  1 sibling, 1 reply; 12+ messages in thread
From: Kevin Wolf @ 2023-02-16 15:27 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, qemu-block, Michael S. Tsirkin, Paolo Bonzini,
	David Hildenbrand, Peter Xu, Philippe Mathieu-Daudé,
	Fam Zheng

Am 10.02.2023 um 15:32 hat Stefan Hajnoczi geschrieben:
> dma_blk_cb() only takes the AioContext lock around ->io_func(). That
> means the rest of dma_blk_cb() is not protected. In particular, the
> DMAAIOCB field accesses happen outside the lock.
> 
> There is a race when the main loop thread holds the AioContext lock and
> invokes scsi_device_purge_requests() -> bdrv_aio_cancel() ->
> dma_aio_cancel() while an IOThread executes dma_blk_cb(). The dbs->acb
> field determines how cancellation proceeds. If dma_aio_cancel() see
> dbs->acb == NULL while dma_blk_cb() is still running, the request can be
> completed twice (-ECANCELED and the actual return value).
> 
> The following assertion can occur with virtio-scsi when an IOThread is
> used:
> 
>   ../hw/scsi/scsi-disk.c:368: scsi_dma_complete: Assertion `r->req.aiocb != NULL' failed.
> 
> Fix the race by holding the AioContext across dma_blk_cb(). Now
> dma_aio_cancel() under the AioContext lock will not see
> inconsistent/intermediate states.
> 
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

Two things that seem to need attention in the review:

> diff --git a/softmmu/dma-helpers.c b/softmmu/dma-helpers.c
> index 7820fec54c..2463964805 100644
> --- a/softmmu/dma-helpers.c
> +++ b/softmmu/dma-helpers.c
> @@ -113,17 +113,19 @@ static void dma_complete(DMAAIOCB *dbs, int ret)
>  static void dma_blk_cb(void *opaque, int ret)
>  {
>      DMAAIOCB *dbs = (DMAAIOCB *)opaque;
> +    AioContext *ctx = dbs->ctx;
>      dma_addr_t cur_addr, cur_len;
>      void *mem;
>  
>      trace_dma_blk_cb(dbs, ret);
>  
> +    aio_context_acquire(ctx);

During the first iteration, the caller may already hold the AioContext
lock. In the case of scsi-disk, it does. Locking a second time is fine
in principle because it's a recursive lock, but we have to be careful
not to call AIO_WAIT_WHILE() indirectly then because it could deadlock.

Except for the dbs->common.cb (see below) I don't see any functions that
would be problematic in this respect. In fact, the one I would be most
worried about is dbs->io_func, but it was already locked before.

>      dbs->acb = NULL;
>      dbs->offset += dbs->iov.size;
>  
>      if (dbs->sg_cur_index == dbs->sg->nsg || ret < 0) {
>          dma_complete(dbs, ret);

All request callbacks hold the AioContext lock now when they didn't
before. I wonder if it's worth documenting the locking rules for
dma_blk_io() in a comment. Could be a separate patch, though.

You remove the locking in scsi_dma_complete() to compensate. All the
other callers come from IDE and nvme, which don't take the lock
internally. Taking the main AioContext lock once is fine, so this looks
good.

If it is possible that we already complete on the first iteration, this
could however also be affected by the case above so that the AioContext
is locked twice. In this case, the invoked callback must not call
AIO_WAIT_WHILE() and we would need to audit IDE and nvme.

Is it possible? In other words, can dbs->sg->nsg be 0? If not, can we
assert and document it?

> -        return;
> +        goto out;
>      }
>      dma_blk_unmap(dbs);
>  
> @@ -164,9 +166,9 @@ static void dma_blk_cb(void *opaque, int ret)
>  
>      if (dbs->iov.size == 0) {
>          trace_dma_map_wait(dbs);
> -        dbs->bh = aio_bh_new(dbs->ctx, reschedule_dma, dbs);
> +        dbs->bh = aio_bh_new(ctx, reschedule_dma, dbs);

Does copying dbs->ctx to a local variable imply that it may change
during the function? I didn't think so, but if it may, then why is
calling aio_bh_new() with the old value right?

>          cpu_register_map_client(dbs->bh);
> -        return;
> +        goto out;
>      }
>  
>      if (!QEMU_IS_ALIGNED(dbs->iov.size, dbs->align)) {
> @@ -174,11 +176,11 @@ static void dma_blk_cb(void *opaque, int ret)
>                                  QEMU_ALIGN_DOWN(dbs->iov.size, dbs->align));
>      }
>  
> -    aio_context_acquire(dbs->ctx);
>      dbs->acb = dbs->io_func(dbs->offset, &dbs->iov,
>                              dma_blk_cb, dbs, dbs->io_func_opaque);
> -    aio_context_release(dbs->ctx);
>      assert(dbs->acb);
> +out:
> +    aio_context_release(ctx);
>  }

Kevin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/3] dma-helpers: prevent dma_blk_cb() vs dma_aio_cancel() race
  2023-02-16 15:27   ` Kevin Wolf
@ 2023-02-16 21:27     ` Stefan Hajnoczi
  0 siblings, 0 replies; 12+ messages in thread
From: Stefan Hajnoczi @ 2023-02-16 21:27 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: qemu-devel, qemu-block, Michael S. Tsirkin, Paolo Bonzini,
	David Hildenbrand, Peter Xu, Philippe Mathieu-Daudé,
	Fam Zheng

[-- Attachment #1: Type: text/plain, Size: 4811 bytes --]

On Thu, Feb 16, 2023 at 04:27:42PM +0100, Kevin Wolf wrote:
> Am 10.02.2023 um 15:32 hat Stefan Hajnoczi geschrieben:
> > dma_blk_cb() only takes the AioContext lock around ->io_func(). That
> > means the rest of dma_blk_cb() is not protected. In particular, the
> > DMAAIOCB field accesses happen outside the lock.
> > 
> > There is a race when the main loop thread holds the AioContext lock and
> > invokes scsi_device_purge_requests() -> bdrv_aio_cancel() ->
> > dma_aio_cancel() while an IOThread executes dma_blk_cb(). The dbs->acb
> > field determines how cancellation proceeds. If dma_aio_cancel() see
> > dbs->acb == NULL while dma_blk_cb() is still running, the request can be
> > completed twice (-ECANCELED and the actual return value).
> > 
> > The following assertion can occur with virtio-scsi when an IOThread is
> > used:
> > 
> >   ../hw/scsi/scsi-disk.c:368: scsi_dma_complete: Assertion `r->req.aiocb != NULL' failed.
> > 
> > Fix the race by holding the AioContext across dma_blk_cb(). Now
> > dma_aio_cancel() under the AioContext lock will not see
> > inconsistent/intermediate states.
> > 
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> 
> Two things that seem to need attention in the review:
> 
> > diff --git a/softmmu/dma-helpers.c b/softmmu/dma-helpers.c
> > index 7820fec54c..2463964805 100644
> > --- a/softmmu/dma-helpers.c
> > +++ b/softmmu/dma-helpers.c
> > @@ -113,17 +113,19 @@ static void dma_complete(DMAAIOCB *dbs, int ret)
> >  static void dma_blk_cb(void *opaque, int ret)
> >  {
> >      DMAAIOCB *dbs = (DMAAIOCB *)opaque;
> > +    AioContext *ctx = dbs->ctx;
> >      dma_addr_t cur_addr, cur_len;
> >      void *mem;
> >  
> >      trace_dma_blk_cb(dbs, ret);
> >  
> > +    aio_context_acquire(ctx);
> 
> During the first iteration, the caller may already hold the AioContext
> lock. In the case of scsi-disk, it does. Locking a second time is fine
> in principle because it's a recursive lock, but we have to be careful
> not to call AIO_WAIT_WHILE() indirectly then because it could deadlock.
> 
> Except for the dbs->common.cb (see below) I don't see any functions that
> would be problematic in this respect. In fact, the one I would be most
> worried about is dbs->io_func, but it was already locked before.
> 
> >      dbs->acb = NULL;
> >      dbs->offset += dbs->iov.size;
> >  
> >      if (dbs->sg_cur_index == dbs->sg->nsg || ret < 0) {
> >          dma_complete(dbs, ret);
> 
> All request callbacks hold the AioContext lock now when they didn't
> before. I wonder if it's worth documenting the locking rules for
> dma_blk_io() in a comment. Could be a separate patch, though.
> 
> You remove the locking in scsi_dma_complete() to compensate. All the
> other callers come from IDE and nvme, which don't take the lock
> internally. Taking the main AioContext lock once is fine, so this looks
> good.
> 
> If it is possible that we already complete on the first iteration, this
> could however also be affected by the case above so that the AioContext
> is locked twice. In this case, the invoked callback must not call
> AIO_WAIT_WHILE() and we would need to audit IDE and nvme.
> 
> Is it possible? In other words, can dbs->sg->nsg be 0? If not, can we
> assert and document it?

In the nsg == 0 case there's another problem: the completion callback
function is invoked and the AIOCB is unref'd before dma_blk_io() returns
the stale AIOCB pointer.

That would lead to problems later because the pattern is typically:

  r->aiocb = dma_blk_io(...);
  ...
  use r and r->aiocb later

So I don't think nsg = 0 makes sense.

Functions I looked at invoke dma_blk_io() only when there are still
bytes to transfer. I think we're safe but I'll admit I'm not 100% sure.

> > -        return;
> > +        goto out;
> >      }
> >      dma_blk_unmap(dbs);
> >  
> > @@ -164,9 +166,9 @@ static void dma_blk_cb(void *opaque, int ret)
> >  
> >      if (dbs->iov.size == 0) {
> >          trace_dma_map_wait(dbs);
> > -        dbs->bh = aio_bh_new(dbs->ctx, reschedule_dma, dbs);
> > +        dbs->bh = aio_bh_new(ctx, reschedule_dma, dbs);
> 
> Does copying dbs->ctx to a local variable imply that it may change
> during the function? I didn't think so, but if it may, then why is
> calling aio_bh_new() with the old value right?

I changed this line for consistency, not to change behavior or fix a bug.

Regarding AioContext changes, they can't happen because no function that
changes the AioContext is called between this line and the earlier
aio_context_acquire().

(Having to worry about AioContext changes is a pain though and I look
forward to when we can get rid of this lock.)

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/3] virtio-scsi: reset SCSI devices from main loop thread
  2023-02-10 14:32 ` [PATCH v2 3/3] virtio-scsi: reset SCSI devices from main loop thread Stefan Hajnoczi
  2023-02-15 18:27   ` Eric Blake
@ 2023-02-17 10:22   ` Kevin Wolf
       [not found]     ` <Y/Tz+qw7thcwO+G3@fedora>
  1 sibling, 1 reply; 12+ messages in thread
From: Kevin Wolf @ 2023-02-17 10:22 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, qemu-block, Michael S. Tsirkin, Paolo Bonzini,
	David Hildenbrand, Peter Xu, Philippe Mathieu-Daudé,
	Fam Zheng, Qing Wang

Am 10.02.2023 um 15:32 hat Stefan Hajnoczi geschrieben:
> When an IOThread is configured, the ctrl virtqueue is processed in the
> IOThread. TMFs that reset SCSI devices are currently called directly
> from the IOThread and trigger an assertion failure in blk_drain():
> 
>   ../block/block-backend.c:1780: void blk_drain(BlockBackend *): Assertion `qemu_in_main_thread()' failed.
> 
> The blk_drain() function is not designed to be called from an IOThread
> because it needs the Big QEMU Lock (BQL).
> 
> This patch defers TMFs that reset SCSI devices to a Bottom Half (BH)
> that runs in the main loop thread under the BQL. This way it's safe to
> call blk_drain() and the assertion failure is avoided.

It's not entirely obvious what is the call path that leads to
blk_drain(). Do we somehow call into virtio_scsi_dataplane_stop()?

> Introduce s->tmf_bh_list for tracking TMF requests that have been
> deferred to the BH. When the BH runs it will grab the entire list and
> process all requests. Care must be taken to clear the list when the
> virtio-scsi device is reset or unrealized. Otherwise deferred TMF
> requests could execute later and lead to use-after-free or other
> undefined behavior.

Why don't we already need the same for other asynchronously processed
requests? Certainly having a read request write to guest memory after
the device has been reset or unplugged isn't what we want either.

I see that we assert(!s->dataplane_started) in virtio_scsi_reset(),
which may be part of the reason. If we're not processing any requests,
then we're safe. virtio_scsi_dataplane_stop() calls blk_drain_all()
(which is really a too big hammer) in order to make sure that in-flight
requests are completed before dataplane_started becomes false.

I was wondering if we couldn't just blk_inc_in_flight() while a TMF
request is in flight and then use the same draining logic to be covered.
You could use oneshot BHs then and do away with the list because you
don't need to cancel anything any more, you just wait until the BHs have
completed.

The practical problem may be that we don't have a blk here (which is
probably also why blk_drain_all() is used). We could have our own
AIO_WAIT_WHILE() instead. I feel waiting instead of cancelling BHs would
simplify the code.

In fact, I think technically, you may not need any of that because
blk_drain_all() also executes all BHs in the main loop before it
returns, but that might be a bit subtle...

> The s->resetting counter that's used by TMFs that reset SCSI devices is
> accessed from multiple threads. This patch makes that explicit by using
> atomic accessor functions. With this patch applied the counter is only
> modified by the main loop thread under the BQL but can be read by any
> thread.
> 
> Reported-by: Qing Wang <qinwang@redhat.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

Kevin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/3] virtio-scsi: reset SCSI devices from main loop thread
       [not found]     ` <Y/Tz+qw7thcwO+G3@fedora>
@ 2023-02-23 15:03       ` Stefan Hajnoczi
  0 siblings, 0 replies; 12+ messages in thread
From: Stefan Hajnoczi @ 2023-02-23 15:03 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, qemu-devel, qemu-block, Michael S. Tsirkin,
	Paolo Bonzini, David Hildenbrand, Peter Xu,
	Philippe Mathieu-Daudé,
	Fam Zheng, Qing Wang

[-- Attachment #1: Type: text/plain, Size: 4795 bytes --]

On Tue, Feb 21, 2023 at 11:40:26AM -0500, Stefan Hajnoczi wrote:
> On Fri, Feb 17, 2023 at 11:22:02AM +0100, Kevin Wolf wrote:
> > Am 10.02.2023 um 15:32 hat Stefan Hajnoczi geschrieben:
> > > When an IOThread is configured, the ctrl virtqueue is processed in the
> > > IOThread. TMFs that reset SCSI devices are currently called directly
> > > from the IOThread and trigger an assertion failure in blk_drain():
> > > 
> > >   ../block/block-backend.c:1780: void blk_drain(BlockBackend *): Assertion `qemu_in_main_thread()' failed.
> > > 
> > > The blk_drain() function is not designed to be called from an IOThread
> > > because it needs the Big QEMU Lock (BQL).
> > > 
> > > This patch defers TMFs that reset SCSI devices to a Bottom Half (BH)
> > > that runs in the main loop thread under the BQL. This way it's safe to
> > > call blk_drain() and the assertion failure is avoided.
> > 
> > It's not entirely obvious what is the call path that leads to
> > blk_drain(). Do we somehow call into virtio_scsi_dataplane_stop()?
> 
> I'll make it clearer that resetting the SCSIDevice calls blk_drain():
> 
>   void scsi_device_purge_requests(SCSIDevice *sdev, SCSISense sense)
>   {
>       SCSIRequest *req;
> 
>       aio_context_acquire(blk_get_aio_context(sdev->conf.blk));
>       while (!QTAILQ_EMPTY(&sdev->requests)) {
>           req = QTAILQ_FIRST(&sdev->requests);
>           scsi_req_cancel_async(req, NULL);
>       }
>       blk_drain(sdev->conf.blk);
>       ^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> The IOThread code path is virtio_scsi_handle_ctrl_req ->
> virtio_scsi_do_tmf -> device_code_reset -> scsi_disk_reset ->
> scsi_device_purge_requests.
> 
> > > Introduce s->tmf_bh_list for tracking TMF requests that have been
> > > deferred to the BH. When the BH runs it will grab the entire list and
> > > process all requests. Care must be taken to clear the list when the
> > > virtio-scsi device is reset or unrealized. Otherwise deferred TMF
> > > requests could execute later and lead to use-after-free or other
> > > undefined behavior.
> > 
> > Why don't we already need the same for other asynchronously processed
> > requests? Certainly having a read request write to guest memory after
> > the device has been reset or unplugged isn't what we want either.
> 
> I/O is already taken care of because bdrv_drain_all() is called in
> do_vm_stop(). TMFs are not purely a device emulation concept and not a
> block layer concept covered by bdrv_drain_all(), so they need to be
> handled explicitly.
> 
> > 
> > I see that we assert(!s->dataplane_started) in virtio_scsi_reset(),
> > which may be part of the reason. If we're not processing any requests,
> 
> Yes, dataplane is stopped when the virtio-scsi device is reset.
> 
> > then we're safe. virtio_scsi_dataplane_stop() calls blk_drain_all()
> > (which is really a too big hammer) in order to make sure that in-flight
> > requests are completed before dataplane_started becomes false.
> > 
> > I was wondering if we couldn't just blk_inc_in_flight() while a TMF
> > request is in flight and then use the same draining logic to be covered.
> > You could use oneshot BHs then and do away with the list because you
> > don't need to cancel anything any more, you just wait until the BHs have
> > completed.
> > 
> > The practical problem may be that we don't have a blk here (which is
> > probably also why blk_drain_all() is used). We could have our own
> > AIO_WAIT_WHILE() instead. I feel waiting instead of cancelling BHs would
> > simplify the code.
> > 
> > In fact, I think technically, you may not need any of that because
> > blk_drain_all() also executes all BHs in the main loop before it
> > returns, but that might be a bit subtle...
> 
> There is a bit of a layering violation if we use the fake block layer
> requests to track virtio-scsi HBA TMF emulation.
> 
> The starting point for doing this would probably be a
> SCSIDeviceClass->schedule_reset() function that allows scsi-disk.c to
> implement the one-shot BH + blk_inc_in_flight() approach. scsi-disk.c
> has the BlockBackend.
> 
> The fact that we're having to think hard about how to make this work
> makes me wonder whether it's a good idea?
> 
> > > The s->resetting counter that's used by TMFs that reset SCSI devices is
> > > accessed from multiple threads. This patch makes that explicit by using
> > > atomic accessor functions. With this patch applied the counter is only
> > > modified by the main loop thread under the BQL but can be read by any
> > > thread.
> > > 
> > > Reported-by: Qing Wang <qinwang@redhat.com>
> > > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > 
> > Kevin
> > 



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2023-02-23 15:34 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-10 14:32 [PATCH v2 0/3] virtio-scsi: fix SCSIDevice hot unplug with IOThread Stefan Hajnoczi
2023-02-10 14:32 ` [PATCH v2 1/3] scsi: protect req->aiocb with AioContext lock Stefan Hajnoczi
2023-02-15 18:17   ` Eric Blake
2023-02-16 12:34   ` Kevin Wolf
2023-02-10 14:32 ` [PATCH v2 2/3] dma-helpers: prevent dma_blk_cb() vs dma_aio_cancel() race Stefan Hajnoczi
2023-02-15 18:19   ` Eric Blake
2023-02-16 15:27   ` Kevin Wolf
2023-02-16 21:27     ` Stefan Hajnoczi
2023-02-10 14:32 ` [PATCH v2 3/3] virtio-scsi: reset SCSI devices from main loop thread Stefan Hajnoczi
2023-02-15 18:27   ` Eric Blake
2023-02-17 10:22   ` Kevin Wolf
     [not found]     ` <Y/Tz+qw7thcwO+G3@fedora>
2023-02-23 15:03       ` Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.