* [PATCH v2 0/7] block/nvme: support nested aio_poll()
@ 2020-06-17 13:21 Stefan Hajnoczi
2020-06-17 13:21 ` [PATCH v2 1/7] block/nvme: poll queues without q->lock Stefan Hajnoczi
` (7 more replies)
0 siblings, 8 replies; 10+ messages in thread
From: Stefan Hajnoczi @ 2020-06-17 13:21 UTC (permalink / raw)
To: qemu-devel
Cc: Kevin Wolf, Fam Zheng, qemu-block, Max Reitz, Stefan Hajnoczi,
Philippe Mathieu-Daudé
v2:
* Reword comment in Patch 1 explainin why its safe not to hold q->lock [Sergio]
* Fix s/unwiedly/unwieldy/ typo in the Patch 6 commit description [Philippe]
This series allows aio_poll() to work from I/O request completion callbacks.
QEMU block drivers are supposed to support this because some code paths rely on
this behavior.
There was no measurable performance difference with nested aio_poll() support.
This patch series also contains cleanups that I made while examining and
benchmarking the code.
Stefan Hajnoczi (7):
block/nvme: poll queues without q->lock
block/nvme: drop tautologous assertion
block/nvme: don't access CQE after moving cq.head
block/nvme: switch to a NVMeRequest freelist
block/nvme: clarify that free_req_queue is protected by q->lock
block/nvme: keep BDRVNVMeState pointer in NVMeQueuePair
block/nvme: support nested aio_poll()
block/nvme.c | 220 ++++++++++++++++++++++++++++++++-------------
block/trace-events | 2 +-
2 files changed, 160 insertions(+), 62 deletions(-)
--
2.26.2
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v2 1/7] block/nvme: poll queues without q->lock
2020-06-17 13:21 [PATCH v2 0/7] block/nvme: support nested aio_poll() Stefan Hajnoczi
@ 2020-06-17 13:21 ` Stefan Hajnoczi
2020-06-22 13:54 ` Sergio Lopez
2020-06-17 13:21 ` [PATCH v2 2/7] block/nvme: drop tautologous assertion Stefan Hajnoczi
` (6 subsequent siblings)
7 siblings, 1 reply; 10+ messages in thread
From: Stefan Hajnoczi @ 2020-06-17 13:21 UTC (permalink / raw)
To: qemu-devel
Cc: Kevin Wolf, Fam Zheng, qemu-block, Max Reitz, Stefan Hajnoczi,
Philippe Mathieu-Daudé
A lot of CPU time is spent simply locking/unlocking q->lock during
polling. Check for completion outside the lock to make q->lock disappear
from the profile.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
block/nvme.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/block/nvme.c b/block/nvme.c
index eb2f54dd9d..e4375ec245 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -512,6 +512,18 @@ static bool nvme_poll_queues(BDRVNVMeState *s)
for (i = 0; i < s->nr_queues; i++) {
NVMeQueuePair *q = s->queues[i];
+ const size_t cqe_offset = q->cq.head * NVME_CQ_ENTRY_BYTES;
+ NvmeCqe *cqe = (NvmeCqe *)&q->cq.queue[cqe_offset];
+
+ /*
+ * Do an early check for completions. q->lock isn't needed because
+ * nvme_process_completion() only runs in the event loop thread and
+ * cannot race with itself.
+ */
+ if ((le16_to_cpu(cqe->status) & 0x1) == q->cq_phase) {
+ continue;
+ }
+
qemu_mutex_lock(&q->lock);
while (nvme_process_completion(s, q)) {
/* Keep polling */
--
2.26.2
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v2 2/7] block/nvme: drop tautologous assertion
2020-06-17 13:21 [PATCH v2 0/7] block/nvme: support nested aio_poll() Stefan Hajnoczi
2020-06-17 13:21 ` [PATCH v2 1/7] block/nvme: poll queues without q->lock Stefan Hajnoczi
@ 2020-06-17 13:21 ` Stefan Hajnoczi
2020-06-17 13:21 ` [PATCH v2 3/7] block/nvme: don't access CQE after moving cq.head Stefan Hajnoczi
` (5 subsequent siblings)
7 siblings, 0 replies; 10+ messages in thread
From: Stefan Hajnoczi @ 2020-06-17 13:21 UTC (permalink / raw)
To: qemu-devel
Cc: Kevin Wolf, Fam Zheng, Sergio Lopez, qemu-block, Max Reitz,
Stefan Hajnoczi, Philippe Mathieu-Daudé
nvme_process_completion() explicitly checks cid so the assertion that
follows is always true:
if (cid == 0 || cid > NVME_QUEUE_SIZE) {
...
continue;
}
assert(cid <= NVME_QUEUE_SIZE);
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Sergio Lopez <slp@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
---
block/nvme.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/block/nvme.c b/block/nvme.c
index e4375ec245..d567ece3f4 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -336,7 +336,6 @@ static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q)
cid);
continue;
}
- assert(cid <= NVME_QUEUE_SIZE);
trace_nvme_complete_command(s, q->index, cid);
preq = &q->reqs[cid - 1];
req = *preq;
--
2.26.2
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v2 3/7] block/nvme: don't access CQE after moving cq.head
2020-06-17 13:21 [PATCH v2 0/7] block/nvme: support nested aio_poll() Stefan Hajnoczi
2020-06-17 13:21 ` [PATCH v2 1/7] block/nvme: poll queues without q->lock Stefan Hajnoczi
2020-06-17 13:21 ` [PATCH v2 2/7] block/nvme: drop tautologous assertion Stefan Hajnoczi
@ 2020-06-17 13:21 ` Stefan Hajnoczi
2020-06-17 13:21 ` [PATCH v2 4/7] block/nvme: switch to a NVMeRequest freelist Stefan Hajnoczi
` (4 subsequent siblings)
7 siblings, 0 replies; 10+ messages in thread
From: Stefan Hajnoczi @ 2020-06-17 13:21 UTC (permalink / raw)
To: qemu-devel
Cc: Kevin Wolf, Fam Zheng, Sergio Lopez, qemu-block, Max Reitz,
Stefan Hajnoczi, Philippe Mathieu-Daudé
Do not access a CQE after incrementing q->cq.head and releasing q->lock.
It is unlikely that this causes problems in practice but it's a latent
bug.
The reason why it should be safe at the moment is that completion
processing is not re-entrant and the CQ doorbell isn't written until the
end of nvme_process_completion().
Make this change now because QEMU expects completion processing to be
re-entrant and later patches will do that.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Sergio Lopez <slp@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
---
block/nvme.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/block/nvme.c b/block/nvme.c
index d567ece3f4..344893811a 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -321,11 +321,14 @@ static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q)
q->busy = true;
assert(q->inflight >= 0);
while (q->inflight) {
+ int ret;
int16_t cid;
+
c = (NvmeCqe *)&q->cq.queue[q->cq.head * NVME_CQ_ENTRY_BYTES];
if ((le16_to_cpu(c->status) & 0x1) == q->cq_phase) {
break;
}
+ ret = nvme_translate_error(c);
q->cq.head = (q->cq.head + 1) % NVME_QUEUE_SIZE;
if (!q->cq.head) {
q->cq_phase = !q->cq_phase;
@@ -344,7 +347,7 @@ static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q)
preq->busy = false;
preq->cb = preq->opaque = NULL;
qemu_mutex_unlock(&q->lock);
- req.cb(req.opaque, nvme_translate_error(c));
+ req.cb(req.opaque, ret);
qemu_mutex_lock(&q->lock);
q->inflight--;
progress = true;
--
2.26.2
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v2 4/7] block/nvme: switch to a NVMeRequest freelist
2020-06-17 13:21 [PATCH v2 0/7] block/nvme: support nested aio_poll() Stefan Hajnoczi
` (2 preceding siblings ...)
2020-06-17 13:21 ` [PATCH v2 3/7] block/nvme: don't access CQE after moving cq.head Stefan Hajnoczi
@ 2020-06-17 13:21 ` Stefan Hajnoczi
2020-06-17 13:21 ` [PATCH v2 5/7] block/nvme: clarify that free_req_queue is protected by q->lock Stefan Hajnoczi
` (3 subsequent siblings)
7 siblings, 0 replies; 10+ messages in thread
From: Stefan Hajnoczi @ 2020-06-17 13:21 UTC (permalink / raw)
To: qemu-devel
Cc: Kevin Wolf, Fam Zheng, Sergio Lopez, qemu-block, Max Reitz,
Stefan Hajnoczi, Philippe Mathieu-Daudé
There are three issues with the current NVMeRequest->busy field:
1. The busy field is accidentally accessed outside q->lock when request
submission fails.
2. Waiters on free_req_queue are not woken when a request is returned
early due to submission failure.
2. Finding a free request involves scanning all requests. This makes
request submission O(n^2).
Switch to an O(1) freelist that is always accessed under the lock.
Also differentiate between NVME_QUEUE_SIZE, the actual SQ/CQ size, and
NVME_NUM_REQS, the number of usable requests. This makes the code
simpler than using NVME_QUEUE_SIZE everywhere and having to keep in mind
that one slot is reserved.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Sergio Lopez <slp@redhat.com>
---
block/nvme.c | 81 ++++++++++++++++++++++++++++++++++------------------
1 file changed, 54 insertions(+), 27 deletions(-)
diff --git a/block/nvme.c b/block/nvme.c
index 344893811a..8e60882af3 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -33,6 +33,12 @@
#define NVME_QUEUE_SIZE 128
#define NVME_BAR_SIZE 8192
+/*
+ * We have to leave one slot empty as that is the full queue case where
+ * head == tail + 1.
+ */
+#define NVME_NUM_REQS (NVME_QUEUE_SIZE - 1)
+
typedef struct {
int32_t head, tail;
uint8_t *queue;
@@ -47,7 +53,7 @@ typedef struct {
int cid;
void *prp_list_page;
uint64_t prp_list_iova;
- bool busy;
+ int free_req_next; /* q->reqs[] index of next free req */
} NVMeRequest;
typedef struct {
@@ -61,7 +67,8 @@ typedef struct {
/* Fields protected by @lock */
NVMeQueue sq, cq;
int cq_phase;
- NVMeRequest reqs[NVME_QUEUE_SIZE];
+ int free_req_head;
+ NVMeRequest reqs[NVME_NUM_REQS];
bool busy;
int need_kick;
int inflight;
@@ -200,19 +207,23 @@ static NVMeQueuePair *nvme_create_queue_pair(BlockDriverState *bs,
qemu_mutex_init(&q->lock);
q->index = idx;
qemu_co_queue_init(&q->free_req_queue);
- q->prp_list_pages = qemu_blockalign0(bs, s->page_size * NVME_QUEUE_SIZE);
+ q->prp_list_pages = qemu_blockalign0(bs, s->page_size * NVME_NUM_REQS);
r = qemu_vfio_dma_map(s->vfio, q->prp_list_pages,
- s->page_size * NVME_QUEUE_SIZE,
+ s->page_size * NVME_NUM_REQS,
false, &prp_list_iova);
if (r) {
goto fail;
}
- for (i = 0; i < NVME_QUEUE_SIZE; i++) {
+ q->free_req_head = -1;
+ for (i = 0; i < NVME_NUM_REQS; i++) {
NVMeRequest *req = &q->reqs[i];
req->cid = i + 1;
+ req->free_req_next = q->free_req_head;
+ q->free_req_head = i;
req->prp_list_page = q->prp_list_pages + i * s->page_size;
req->prp_list_iova = prp_list_iova + i * s->page_size;
}
+
nvme_init_queue(bs, &q->sq, size, NVME_SQ_ENTRY_BYTES, &local_err);
if (local_err) {
error_propagate(errp, local_err);
@@ -254,13 +265,11 @@ static void nvme_kick(BDRVNVMeState *s, NVMeQueuePair *q)
*/
static NVMeRequest *nvme_get_free_req(NVMeQueuePair *q)
{
- int i;
- NVMeRequest *req = NULL;
+ NVMeRequest *req;
qemu_mutex_lock(&q->lock);
- while (q->inflight + q->need_kick > NVME_QUEUE_SIZE - 2) {
- /* We have to leave one slot empty as that is the full queue case (head
- * == tail + 1). */
+
+ while (q->free_req_head == -1) {
if (qemu_in_coroutine()) {
trace_nvme_free_req_queue_wait(q);
qemu_co_queue_wait(&q->free_req_queue, &q->lock);
@@ -269,20 +278,41 @@ static NVMeRequest *nvme_get_free_req(NVMeQueuePair *q)
return NULL;
}
}
- for (i = 0; i < NVME_QUEUE_SIZE; i++) {
- if (!q->reqs[i].busy) {
- q->reqs[i].busy = true;
- req = &q->reqs[i];
- break;
- }
- }
- /* We have checked inflight and need_kick while holding q->lock, so one
- * free req must be available. */
- assert(req);
+
+ req = &q->reqs[q->free_req_head];
+ q->free_req_head = req->free_req_next;
+ req->free_req_next = -1;
+
qemu_mutex_unlock(&q->lock);
return req;
}
+/* With q->lock */
+static void nvme_put_free_req_locked(NVMeQueuePair *q, NVMeRequest *req)
+{
+ req->free_req_next = q->free_req_head;
+ q->free_req_head = req - q->reqs;
+}
+
+/* With q->lock */
+static void nvme_wake_free_req_locked(BDRVNVMeState *s, NVMeQueuePair *q)
+{
+ if (!qemu_co_queue_empty(&q->free_req_queue)) {
+ replay_bh_schedule_oneshot_event(s->aio_context,
+ nvme_free_req_queue_cb, q);
+ }
+}
+
+/* Insert a request in the freelist and wake waiters */
+static void nvme_put_free_req_and_wake(BDRVNVMeState *s, NVMeQueuePair *q,
+ NVMeRequest *req)
+{
+ qemu_mutex_lock(&q->lock);
+ nvme_put_free_req_locked(q, req);
+ nvme_wake_free_req_locked(s, q);
+ qemu_mutex_unlock(&q->lock);
+}
+
static inline int nvme_translate_error(const NvmeCqe *c)
{
uint16_t status = (le16_to_cpu(c->status) >> 1) & 0xFF;
@@ -344,7 +374,7 @@ static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q)
req = *preq;
assert(req.cid == cid);
assert(req.cb);
- preq->busy = false;
+ nvme_put_free_req_locked(q, preq);
preq->cb = preq->opaque = NULL;
qemu_mutex_unlock(&q->lock);
req.cb(req.opaque, ret);
@@ -356,10 +386,7 @@ static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q)
/* Notify the device so it can post more completions. */
smp_mb_release();
*q->cq.doorbell = cpu_to_le32(q->cq.head);
- if (!qemu_co_queue_empty(&q->free_req_queue)) {
- replay_bh_schedule_oneshot_event(s->aio_context,
- nvme_free_req_queue_cb, q);
- }
+ nvme_wake_free_req_locked(s, q);
}
q->busy = false;
return progress;
@@ -1001,7 +1028,7 @@ static coroutine_fn int nvme_co_prw_aligned(BlockDriverState *bs,
r = nvme_cmd_map_qiov(bs, &cmd, req, qiov);
qemu_co_mutex_unlock(&s->dma_map_lock);
if (r) {
- req->busy = false;
+ nvme_put_free_req_and_wake(s, ioq, req);
return r;
}
nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data);
@@ -1218,7 +1245,7 @@ static int coroutine_fn nvme_co_pdiscard(BlockDriverState *bs,
qemu_co_mutex_unlock(&s->dma_map_lock);
if (ret) {
- req->busy = false;
+ nvme_put_free_req_and_wake(s, ioq, req);
goto out;
}
--
2.26.2
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v2 5/7] block/nvme: clarify that free_req_queue is protected by q->lock
2020-06-17 13:21 [PATCH v2 0/7] block/nvme: support nested aio_poll() Stefan Hajnoczi
` (3 preceding siblings ...)
2020-06-17 13:21 ` [PATCH v2 4/7] block/nvme: switch to a NVMeRequest freelist Stefan Hajnoczi
@ 2020-06-17 13:21 ` Stefan Hajnoczi
2020-06-17 13:22 ` [PATCH v2 6/7] block/nvme: keep BDRVNVMeState pointer in NVMeQueuePair Stefan Hajnoczi
` (2 subsequent siblings)
7 siblings, 0 replies; 10+ messages in thread
From: Stefan Hajnoczi @ 2020-06-17 13:21 UTC (permalink / raw)
To: qemu-devel
Cc: Kevin Wolf, Fam Zheng, Sergio Lopez, qemu-block, Max Reitz,
Stefan Hajnoczi, Philippe Mathieu-Daudé
Existing users access free_req_queue under q->lock. Document this.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Sergio Lopez <slp@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
---
block/nvme.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/block/nvme.c b/block/nvme.c
index 8e60882af3..426c77e5ab 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -57,7 +57,6 @@ typedef struct {
} NVMeRequest;
typedef struct {
- CoQueue free_req_queue;
QemuMutex lock;
/* Fields protected by BQL */
@@ -65,6 +64,7 @@ typedef struct {
uint8_t *prp_list_pages;
/* Fields protected by @lock */
+ CoQueue free_req_queue;
NVMeQueue sq, cq;
int cq_phase;
int free_req_head;
--
2.26.2
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v2 6/7] block/nvme: keep BDRVNVMeState pointer in NVMeQueuePair
2020-06-17 13:21 [PATCH v2 0/7] block/nvme: support nested aio_poll() Stefan Hajnoczi
` (4 preceding siblings ...)
2020-06-17 13:21 ` [PATCH v2 5/7] block/nvme: clarify that free_req_queue is protected by q->lock Stefan Hajnoczi
@ 2020-06-17 13:22 ` Stefan Hajnoczi
2020-06-17 13:22 ` [PATCH v2 7/7] block/nvme: support nested aio_poll() Stefan Hajnoczi
2020-06-23 14:46 ` [PATCH v2 0/7] " Stefan Hajnoczi
7 siblings, 0 replies; 10+ messages in thread
From: Stefan Hajnoczi @ 2020-06-17 13:22 UTC (permalink / raw)
To: qemu-devel
Cc: Kevin Wolf, Fam Zheng, Sergio Lopez, qemu-block, Max Reitz,
Stefan Hajnoczi, Philippe Mathieu-Daudé
Passing around both BDRVNVMeState and NVMeQueuePair is unwieldy. Reduce
the number of function arguments by keeping the BDRVNVMeState pointer in
NVMeQueuePair. This will come in handly when a BH is introduced in a
later patch and only one argument can be passed to it.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Sergio Lopez <slp@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
---
block/nvme.c | 70 ++++++++++++++++++++++++++++------------------------
1 file changed, 38 insertions(+), 32 deletions(-)
diff --git a/block/nvme.c b/block/nvme.c
index 426c77e5ab..8dc68d3daa 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -39,6 +39,8 @@
*/
#define NVME_NUM_REQS (NVME_QUEUE_SIZE - 1)
+typedef struct BDRVNVMeState BDRVNVMeState;
+
typedef struct {
int32_t head, tail;
uint8_t *queue;
@@ -59,8 +61,11 @@ typedef struct {
typedef struct {
QemuMutex lock;
+ /* Read from I/O code path, initialized under BQL */
+ BDRVNVMeState *s;
+ int index;
+
/* Fields protected by BQL */
- int index;
uint8_t *prp_list_pages;
/* Fields protected by @lock */
@@ -96,7 +101,7 @@ typedef volatile struct {
QEMU_BUILD_BUG_ON(offsetof(NVMeRegs, doorbells) != 0x1000);
-typedef struct {
+struct BDRVNVMeState {
AioContext *aio_context;
QEMUVFIOState *vfio;
NVMeRegs *regs;
@@ -130,7 +135,7 @@ typedef struct {
/* PCI address (required for nvme_refresh_filename()) */
char *device;
-} BDRVNVMeState;
+};
#define NVME_BLOCK_OPT_DEVICE "device"
#define NVME_BLOCK_OPT_NAMESPACE "namespace"
@@ -174,7 +179,7 @@ static void nvme_init_queue(BlockDriverState *bs, NVMeQueue *q,
}
}
-static void nvme_free_queue_pair(BlockDriverState *bs, NVMeQueuePair *q)
+static void nvme_free_queue_pair(NVMeQueuePair *q)
{
qemu_vfree(q->prp_list_pages);
qemu_vfree(q->sq.queue);
@@ -205,6 +210,7 @@ static NVMeQueuePair *nvme_create_queue_pair(BlockDriverState *bs,
uint64_t prp_list_iova;
qemu_mutex_init(&q->lock);
+ q->s = s;
q->index = idx;
qemu_co_queue_init(&q->free_req_queue);
q->prp_list_pages = qemu_blockalign0(bs, s->page_size * NVME_NUM_REQS);
@@ -240,13 +246,15 @@ static NVMeQueuePair *nvme_create_queue_pair(BlockDriverState *bs,
return q;
fail:
- nvme_free_queue_pair(bs, q);
+ nvme_free_queue_pair(q);
return NULL;
}
/* With q->lock */
-static void nvme_kick(BDRVNVMeState *s, NVMeQueuePair *q)
+static void nvme_kick(NVMeQueuePair *q)
{
+ BDRVNVMeState *s = q->s;
+
if (s->plugged || !q->need_kick) {
return;
}
@@ -295,21 +303,20 @@ static void nvme_put_free_req_locked(NVMeQueuePair *q, NVMeRequest *req)
}
/* With q->lock */
-static void nvme_wake_free_req_locked(BDRVNVMeState *s, NVMeQueuePair *q)
+static void nvme_wake_free_req_locked(NVMeQueuePair *q)
{
if (!qemu_co_queue_empty(&q->free_req_queue)) {
- replay_bh_schedule_oneshot_event(s->aio_context,
+ replay_bh_schedule_oneshot_event(q->s->aio_context,
nvme_free_req_queue_cb, q);
}
}
/* Insert a request in the freelist and wake waiters */
-static void nvme_put_free_req_and_wake(BDRVNVMeState *s, NVMeQueuePair *q,
- NVMeRequest *req)
+static void nvme_put_free_req_and_wake(NVMeQueuePair *q, NVMeRequest *req)
{
qemu_mutex_lock(&q->lock);
nvme_put_free_req_locked(q, req);
- nvme_wake_free_req_locked(s, q);
+ nvme_wake_free_req_locked(q);
qemu_mutex_unlock(&q->lock);
}
@@ -336,8 +343,9 @@ static inline int nvme_translate_error(const NvmeCqe *c)
}
/* With q->lock */
-static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q)
+static bool nvme_process_completion(NVMeQueuePair *q)
{
+ BDRVNVMeState *s = q->s;
bool progress = false;
NVMeRequest *preq;
NVMeRequest req;
@@ -386,7 +394,7 @@ static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q)
/* Notify the device so it can post more completions. */
smp_mb_release();
*q->cq.doorbell = cpu_to_le32(q->cq.head);
- nvme_wake_free_req_locked(s, q);
+ nvme_wake_free_req_locked(q);
}
q->busy = false;
return progress;
@@ -403,8 +411,7 @@ static void nvme_trace_command(const NvmeCmd *cmd)
}
}
-static void nvme_submit_command(BDRVNVMeState *s, NVMeQueuePair *q,
- NVMeRequest *req,
+static void nvme_submit_command(NVMeQueuePair *q, NVMeRequest *req,
NvmeCmd *cmd, BlockCompletionFunc cb,
void *opaque)
{
@@ -413,15 +420,15 @@ static void nvme_submit_command(BDRVNVMeState *s, NVMeQueuePair *q,
req->opaque = opaque;
cmd->cid = cpu_to_le32(req->cid);
- trace_nvme_submit_command(s, q->index, req->cid);
+ trace_nvme_submit_command(q->s, q->index, req->cid);
nvme_trace_command(cmd);
qemu_mutex_lock(&q->lock);
memcpy((uint8_t *)q->sq.queue +
q->sq.tail * NVME_SQ_ENTRY_BYTES, cmd, sizeof(*cmd));
q->sq.tail = (q->sq.tail + 1) % NVME_QUEUE_SIZE;
q->need_kick++;
- nvme_kick(s, q);
- nvme_process_completion(s, q);
+ nvme_kick(q);
+ nvme_process_completion(q);
qemu_mutex_unlock(&q->lock);
}
@@ -436,13 +443,12 @@ static int nvme_cmd_sync(BlockDriverState *bs, NVMeQueuePair *q,
NvmeCmd *cmd)
{
NVMeRequest *req;
- BDRVNVMeState *s = bs->opaque;
int ret = -EINPROGRESS;
req = nvme_get_free_req(q);
if (!req) {
return -EBUSY;
}
- nvme_submit_command(s, q, req, cmd, nvme_cmd_sync_cb, &ret);
+ nvme_submit_command(q, req, cmd, nvme_cmd_sync_cb, &ret);
BDRV_POLL_WHILE(bs, ret == -EINPROGRESS);
return ret;
@@ -554,7 +560,7 @@ static bool nvme_poll_queues(BDRVNVMeState *s)
}
qemu_mutex_lock(&q->lock);
- while (nvme_process_completion(s, q)) {
+ while (nvme_process_completion(q)) {
/* Keep polling */
progress = true;
}
@@ -592,7 +598,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error **errp)
};
if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
error_setg(errp, "Failed to create io queue [%d]", n);
- nvme_free_queue_pair(bs, q);
+ nvme_free_queue_pair(q);
return false;
}
cmd = (NvmeCmd) {
@@ -603,7 +609,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error **errp)
};
if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
error_setg(errp, "Failed to create io queue [%d]", n);
- nvme_free_queue_pair(bs, q);
+ nvme_free_queue_pair(q);
return false;
}
s->queues = g_renew(NVMeQueuePair *, s->queues, n + 1);
@@ -798,7 +804,7 @@ static void nvme_close(BlockDriverState *bs)
BDRVNVMeState *s = bs->opaque;
for (i = 0; i < s->nr_queues; ++i) {
- nvme_free_queue_pair(bs, s->queues[i]);
+ nvme_free_queue_pair(s->queues[i]);
}
g_free(s->queues);
aio_set_event_notifier(bdrv_get_aio_context(bs), &s->irq_notifier,
@@ -1028,10 +1034,10 @@ static coroutine_fn int nvme_co_prw_aligned(BlockDriverState *bs,
r = nvme_cmd_map_qiov(bs, &cmd, req, qiov);
qemu_co_mutex_unlock(&s->dma_map_lock);
if (r) {
- nvme_put_free_req_and_wake(s, ioq, req);
+ nvme_put_free_req_and_wake(ioq, req);
return r;
}
- nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data);
+ nvme_submit_command(ioq, req, &cmd, nvme_rw_cb, &data);
data.co = qemu_coroutine_self();
while (data.ret == -EINPROGRESS) {
@@ -1131,7 +1137,7 @@ static coroutine_fn int nvme_co_flush(BlockDriverState *bs)
assert(s->nr_queues > 1);
req = nvme_get_free_req(ioq);
assert(req);
- nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data);
+ nvme_submit_command(ioq, req, &cmd, nvme_rw_cb, &data);
data.co = qemu_coroutine_self();
if (data.ret == -EINPROGRESS) {
@@ -1184,7 +1190,7 @@ static coroutine_fn int nvme_co_pwrite_zeroes(BlockDriverState *bs,
req = nvme_get_free_req(ioq);
assert(req);
- nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data);
+ nvme_submit_command(ioq, req, &cmd, nvme_rw_cb, &data);
data.co = qemu_coroutine_self();
while (data.ret == -EINPROGRESS) {
@@ -1245,13 +1251,13 @@ static int coroutine_fn nvme_co_pdiscard(BlockDriverState *bs,
qemu_co_mutex_unlock(&s->dma_map_lock);
if (ret) {
- nvme_put_free_req_and_wake(s, ioq, req);
+ nvme_put_free_req_and_wake(ioq, req);
goto out;
}
trace_nvme_dsm(s, offset, bytes);
- nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data);
+ nvme_submit_command(ioq, req, &cmd, nvme_rw_cb, &data);
data.co = qemu_coroutine_self();
while (data.ret == -EINPROGRESS) {
@@ -1333,8 +1339,8 @@ static void nvme_aio_unplug(BlockDriverState *bs)
for (i = 1; i < s->nr_queues; i++) {
NVMeQueuePair *q = s->queues[i];
qemu_mutex_lock(&q->lock);
- nvme_kick(s, q);
- nvme_process_completion(s, q);
+ nvme_kick(q);
+ nvme_process_completion(q);
qemu_mutex_unlock(&q->lock);
}
}
--
2.26.2
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v2 7/7] block/nvme: support nested aio_poll()
2020-06-17 13:21 [PATCH v2 0/7] block/nvme: support nested aio_poll() Stefan Hajnoczi
` (5 preceding siblings ...)
2020-06-17 13:22 ` [PATCH v2 6/7] block/nvme: keep BDRVNVMeState pointer in NVMeQueuePair Stefan Hajnoczi
@ 2020-06-17 13:22 ` Stefan Hajnoczi
2020-06-23 14:46 ` [PATCH v2 0/7] " Stefan Hajnoczi
7 siblings, 0 replies; 10+ messages in thread
From: Stefan Hajnoczi @ 2020-06-17 13:22 UTC (permalink / raw)
To: qemu-devel
Cc: Kevin Wolf, Fam Zheng, Sergio Lopez, qemu-block, Max Reitz,
Stefan Hajnoczi, Philippe Mathieu-Daudé
QEMU block drivers are supposed to support aio_poll() from I/O
completion callback functions. This means completion processing must be
re-entrant.
The standard approach is to schedule a BH during completion processing
and cancel it at the end of processing. If aio_poll() is invoked by a
callback function then the BH will run. The BH continues the suspended
completion processing.
All of this means that request A's cb() can synchronously wait for
request B to complete. Previously the nvme block driver would hang
because it didn't process completions from nested aio_poll().
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Sergio Lopez <slp@redhat.com>
---
block/nvme.c | 67 ++++++++++++++++++++++++++++++++++++++++------
block/trace-events | 2 +-
2 files changed, 60 insertions(+), 9 deletions(-)
diff --git a/block/nvme.c b/block/nvme.c
index 8dc68d3daa..374e268915 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -74,9 +74,11 @@ typedef struct {
int cq_phase;
int free_req_head;
NVMeRequest reqs[NVME_NUM_REQS];
- bool busy;
int need_kick;
int inflight;
+
+ /* Thread-safe, no lock necessary */
+ QEMUBH *completion_bh;
} NVMeQueuePair;
/* Memory mapped registers */
@@ -140,6 +142,8 @@ struct BDRVNVMeState {
#define NVME_BLOCK_OPT_DEVICE "device"
#define NVME_BLOCK_OPT_NAMESPACE "namespace"
+static void nvme_process_completion_bh(void *opaque);
+
static QemuOptsList runtime_opts = {
.name = "nvme",
.head = QTAILQ_HEAD_INITIALIZER(runtime_opts.head),
@@ -181,6 +185,9 @@ static void nvme_init_queue(BlockDriverState *bs, NVMeQueue *q,
static void nvme_free_queue_pair(NVMeQueuePair *q)
{
+ if (q->completion_bh) {
+ qemu_bh_delete(q->completion_bh);
+ }
qemu_vfree(q->prp_list_pages);
qemu_vfree(q->sq.queue);
qemu_vfree(q->cq.queue);
@@ -214,6 +221,8 @@ static NVMeQueuePair *nvme_create_queue_pair(BlockDriverState *bs,
q->index = idx;
qemu_co_queue_init(&q->free_req_queue);
q->prp_list_pages = qemu_blockalign0(bs, s->page_size * NVME_NUM_REQS);
+ q->completion_bh = aio_bh_new(bdrv_get_aio_context(bs),
+ nvme_process_completion_bh, q);
r = qemu_vfio_dma_map(s->vfio, q->prp_list_pages,
s->page_size * NVME_NUM_REQS,
false, &prp_list_iova);
@@ -352,11 +361,21 @@ static bool nvme_process_completion(NVMeQueuePair *q)
NvmeCqe *c;
trace_nvme_process_completion(s, q->index, q->inflight);
- if (q->busy || s->plugged) {
- trace_nvme_process_completion_queue_busy(s, q->index);
+ if (s->plugged) {
+ trace_nvme_process_completion_queue_plugged(s, q->index);
return false;
}
- q->busy = true;
+
+ /*
+ * Support re-entrancy when a request cb() function invokes aio_poll().
+ * Pending completions must be visible to aio_poll() so that a cb()
+ * function can wait for the completion of another request.
+ *
+ * The aio_poll() loop will execute our BH and we'll resume completion
+ * processing there.
+ */
+ qemu_bh_schedule(q->completion_bh);
+
assert(q->inflight >= 0);
while (q->inflight) {
int ret;
@@ -384,10 +403,10 @@ static bool nvme_process_completion(NVMeQueuePair *q)
assert(req.cb);
nvme_put_free_req_locked(q, preq);
preq->cb = preq->opaque = NULL;
- qemu_mutex_unlock(&q->lock);
- req.cb(req.opaque, ret);
- qemu_mutex_lock(&q->lock);
q->inflight--;
+ qemu_mutex_unlock(&q->lock);
+ req.cb(req.opaque, ret);
+ qemu_mutex_lock(&q->lock);
progress = true;
}
if (progress) {
@@ -396,10 +415,28 @@ static bool nvme_process_completion(NVMeQueuePair *q)
*q->cq.doorbell = cpu_to_le32(q->cq.head);
nvme_wake_free_req_locked(q);
}
- q->busy = false;
+
+ qemu_bh_cancel(q->completion_bh);
+
return progress;
}
+static void nvme_process_completion_bh(void *opaque)
+{
+ NVMeQueuePair *q = opaque;
+
+ /*
+ * We're being invoked because a nvme_process_completion() cb() function
+ * called aio_poll(). The callback may be waiting for further completions
+ * so notify the device that it has space to fill in more completions now.
+ */
+ smp_mb_release();
+ *q->cq.doorbell = cpu_to_le32(q->cq.head);
+ nvme_wake_free_req_locked(q);
+
+ nvme_process_completion(q);
+}
+
static void nvme_trace_command(const NvmeCmd *cmd)
{
int i;
@@ -1309,6 +1346,13 @@ static void nvme_detach_aio_context(BlockDriverState *bs)
{
BDRVNVMeState *s = bs->opaque;
+ for (int i = 0; i < s->nr_queues; i++) {
+ NVMeQueuePair *q = s->queues[i];
+
+ qemu_bh_delete(q->completion_bh);
+ q->completion_bh = NULL;
+ }
+
aio_set_event_notifier(bdrv_get_aio_context(bs), &s->irq_notifier,
false, NULL, NULL);
}
@@ -1321,6 +1365,13 @@ static void nvme_attach_aio_context(BlockDriverState *bs,
s->aio_context = new_context;
aio_set_event_notifier(new_context, &s->irq_notifier,
false, nvme_handle_event, nvme_poll_cb);
+
+ for (int i = 0; i < s->nr_queues; i++) {
+ NVMeQueuePair *q = s->queues[i];
+
+ q->completion_bh =
+ aio_bh_new(new_context, nvme_process_completion_bh, q);
+ }
}
static void nvme_aio_plug(BlockDriverState *bs)
diff --git a/block/trace-events b/block/trace-events
index 29dff8881c..dbe76a7613 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -158,7 +158,7 @@ nvme_kick(void *s, int queue) "s %p queue %d"
nvme_dma_flush_queue_wait(void *s) "s %p"
nvme_error(int cmd_specific, int sq_head, int sqid, int cid, int status) "cmd_specific %d sq_head %d sqid %d cid %d status 0x%x"
nvme_process_completion(void *s, int index, int inflight) "s %p queue %d inflight %d"
-nvme_process_completion_queue_busy(void *s, int index) "s %p queue %d"
+nvme_process_completion_queue_plugged(void *s, int index) "s %p queue %d"
nvme_complete_command(void *s, int index, int cid) "s %p queue %d cid %d"
nvme_submit_command(void *s, int index, int cid) "s %p queue %d cid %d"
nvme_submit_command_raw(int c0, int c1, int c2, int c3, int c4, int c5, int c6, int c7) "%02x %02x %02x %02x %02x %02x %02x %02x"
--
2.26.2
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH v2 1/7] block/nvme: poll queues without q->lock
2020-06-17 13:21 ` [PATCH v2 1/7] block/nvme: poll queues without q->lock Stefan Hajnoczi
@ 2020-06-22 13:54 ` Sergio Lopez
0 siblings, 0 replies; 10+ messages in thread
From: Sergio Lopez @ 2020-06-22 13:54 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: Kevin Wolf, Fam Zheng, qemu-block, qemu-devel, Max Reitz,
Philippe Mathieu-Daudé
[-- Attachment #1: Type: text/plain, Size: 1381 bytes --]
On Wed, Jun 17, 2020 at 02:21:55PM +0100, Stefan Hajnoczi wrote:
> A lot of CPU time is spent simply locking/unlocking q->lock during
> polling. Check for completion outside the lock to make q->lock disappear
> from the profile.
>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
> block/nvme.c | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/block/nvme.c b/block/nvme.c
> index eb2f54dd9d..e4375ec245 100644
> --- a/block/nvme.c
> +++ b/block/nvme.c
> @@ -512,6 +512,18 @@ static bool nvme_poll_queues(BDRVNVMeState *s)
>
> for (i = 0; i < s->nr_queues; i++) {
> NVMeQueuePair *q = s->queues[i];
> + const size_t cqe_offset = q->cq.head * NVME_CQ_ENTRY_BYTES;
> + NvmeCqe *cqe = (NvmeCqe *)&q->cq.queue[cqe_offset];
> +
> + /*
> + * Do an early check for completions. q->lock isn't needed because
> + * nvme_process_completion() only runs in the event loop thread and
> + * cannot race with itself.
> + */
> + if ((le16_to_cpu(cqe->status) & 0x1) == q->cq_phase) {
> + continue;
> + }
> +
> qemu_mutex_lock(&q->lock);
> while (nvme_process_completion(s, q)) {
> /* Keep polling */
> --
> 2.26.2
>
Thanks for extending the comment.
Reviewed-by: Sergio Lopez <slp@redhat.com>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 0/7] block/nvme: support nested aio_poll()
2020-06-17 13:21 [PATCH v2 0/7] block/nvme: support nested aio_poll() Stefan Hajnoczi
` (6 preceding siblings ...)
2020-06-17 13:22 ` [PATCH v2 7/7] block/nvme: support nested aio_poll() Stefan Hajnoczi
@ 2020-06-23 14:46 ` Stefan Hajnoczi
7 siblings, 0 replies; 10+ messages in thread
From: Stefan Hajnoczi @ 2020-06-23 14:46 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: Kevin Wolf, Fam Zheng, qemu-block, qemu-devel, Max Reitz,
Philippe Mathieu-Daudé
[-- Attachment #1: Type: text/plain, Size: 1295 bytes --]
On Wed, Jun 17, 2020 at 02:21:54PM +0100, Stefan Hajnoczi wrote:
> v2:
> * Reword comment in Patch 1 explainin why its safe not to hold q->lock [Sergio]
> * Fix s/unwiedly/unwieldy/ typo in the Patch 6 commit description [Philippe]
>
> This series allows aio_poll() to work from I/O request completion callbacks.
> QEMU block drivers are supposed to support this because some code paths rely on
> this behavior.
>
> There was no measurable performance difference with nested aio_poll() support.
>
> This patch series also contains cleanups that I made while examining and
> benchmarking the code.
>
> Stefan Hajnoczi (7):
> block/nvme: poll queues without q->lock
> block/nvme: drop tautologous assertion
> block/nvme: don't access CQE after moving cq.head
> block/nvme: switch to a NVMeRequest freelist
> block/nvme: clarify that free_req_queue is protected by q->lock
> block/nvme: keep BDRVNVMeState pointer in NVMeQueuePair
> block/nvme: support nested aio_poll()
>
> block/nvme.c | 220 ++++++++++++++++++++++++++++++++-------------
> block/trace-events | 2 +-
> 2 files changed, 160 insertions(+), 62 deletions(-)
>
> --
> 2.26.2
>
Thanks, applied to my HEAD tree:
https://github.com/stefanha/qemu/commits/HEAD
Stefan
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2020-06-23 15:03 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-17 13:21 [PATCH v2 0/7] block/nvme: support nested aio_poll() Stefan Hajnoczi
2020-06-17 13:21 ` [PATCH v2 1/7] block/nvme: poll queues without q->lock Stefan Hajnoczi
2020-06-22 13:54 ` Sergio Lopez
2020-06-17 13:21 ` [PATCH v2 2/7] block/nvme: drop tautologous assertion Stefan Hajnoczi
2020-06-17 13:21 ` [PATCH v2 3/7] block/nvme: don't access CQE after moving cq.head Stefan Hajnoczi
2020-06-17 13:21 ` [PATCH v2 4/7] block/nvme: switch to a NVMeRequest freelist Stefan Hajnoczi
2020-06-17 13:21 ` [PATCH v2 5/7] block/nvme: clarify that free_req_queue is protected by q->lock Stefan Hajnoczi
2020-06-17 13:22 ` [PATCH v2 6/7] block/nvme: keep BDRVNVMeState pointer in NVMeQueuePair Stefan Hajnoczi
2020-06-17 13:22 ` [PATCH v2 7/7] block/nvme: support nested aio_poll() Stefan Hajnoczi
2020-06-23 14:46 ` [PATCH v2 0/7] " Stefan Hajnoczi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).