linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] virtio_blk: blk-mq io_poll support
@ 2021-05-20 14:13 Stefan Hajnoczi
  2021-05-20 14:13 ` [PATCH 1/3] virtio: add virtioqueue_more_used() Stefan Hajnoczi
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Stefan Hajnoczi @ 2021-05-20 14:13 UTC (permalink / raw)
  To: virtualization
  Cc: linux-block, linux-kernel, Christoph Hellwig, Jason Wang,
	Paolo Bonzini, Jens Axboe, slp, sgarzare, Michael S. Tsirkin,
	Stefan Hajnoczi

This patch series implements blk_mq_ops->poll() so REQ_HIPRI requests can be
polled. IOPS for 4k and 16k block sizes increases by 5-18% on a virtio-blk
device with 4 virtqueues backed by an NVMe drive.

- Benchmark: fio ioengine=pvsync2 numjobs=4 direct=1
- Guest: 4 vCPUs with one virtio-blk device (4 virtqueues)
- Disk: Intel Corporation NVMe Datacenter SSD [Optane] [8086:2701]
- CPU: Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz

rw          bs hipri=0 hipri=1
------------------------------
randread    4k 149,426 170,763 +14%
randread   16k 118,939 134,269 +12%
randread   64k  34,886  34,906   0%
randread  128k  17,655  17,667   0%
randwrite   4k 138,578 163,600 +18%
randwrite  16k 102,089 120,950 +18%
randwrite  64k  32,364  32,561   0%
randwrite 128k  16,154  16,237   0%
read        4k 146,032 170,620 +16%
read       16k 117,097 130,437 +11%
read       64k  34,834  35,037   0%
read      128k  17,680  17,658   0%
write       4k 134,562 151,422 +12%
write      16k 101,796 107,606  +5%
write      64k  32,364  32,594   0%
write     128k  16,259  16,265   0%

Larger block sizes do not benefit from polling as much but the
improvement is worthwhile for smaller block sizes.

Stefan Hajnoczi (3):
  virtio: add virtioqueue_more_used()
  virtio_blk: avoid repeating vblk->vqs[qid]
  virtio_blk: implement blk_mq_ops->poll()

 include/linux/virtio.h       |   2 +
 drivers/block/virtio_blk.c   | 126 +++++++++++++++++++++++++++++------
 drivers/virtio/virtio_ring.c |  17 +++++
 3 files changed, 123 insertions(+), 22 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/3] virtio: add virtioqueue_more_used()
  2021-05-20 14:13 [PATCH 0/3] virtio_blk: blk-mq io_poll support Stefan Hajnoczi
@ 2021-05-20 14:13 ` Stefan Hajnoczi
  2021-05-25  2:23   ` Jason Wang
  2021-05-20 14:13 ` [PATCH 2/3] virtio_blk: avoid repeating vblk->vqs[qid] Stefan Hajnoczi
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 21+ messages in thread
From: Stefan Hajnoczi @ 2021-05-20 14:13 UTC (permalink / raw)
  To: virtualization
  Cc: linux-block, linux-kernel, Christoph Hellwig, Jason Wang,
	Paolo Bonzini, Jens Axboe, slp, sgarzare, Michael S. Tsirkin,
	Stefan Hajnoczi

Add an API to check whether there are pending used buffers. There is
already a similar API called virtqueue_poll() but it only works together
with virtqueue_enable_cb_prepare(). The patches that follow add blk-mq
->poll() support to virtio_blk and they need to check for used buffers
without re-enabling virtqueue callbacks, so introduce an API for it.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/linux/virtio.h       |  2 ++
 drivers/virtio/virtio_ring.c | 17 +++++++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index b1894e0323fa..c6ad0f25f412 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -63,6 +63,8 @@ bool virtqueue_kick_prepare(struct virtqueue *vq);
 
 bool virtqueue_notify(struct virtqueue *vq);
 
+bool virtqueue_more_used(const struct virtqueue *vq);
+
 void *virtqueue_get_buf(struct virtqueue *vq, unsigned int *len);
 
 void *virtqueue_get_buf_ctx(struct virtqueue *vq, unsigned int *len,
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 71e16b53e9c1..7c3da75da462 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -2032,6 +2032,23 @@ static inline bool more_used(const struct vring_virtqueue *vq)
 	return vq->packed_ring ? more_used_packed(vq) : more_used_split(vq);
 }
 
+/**
+ * virtqueue_more_used - check if there are used buffers pending
+ * @_vq: the struct virtqueue we're talking about.
+ *
+ * Returns true if there are used buffers, false otherwise. May be called at
+ * the same time as other virtqueue operations, but actually calling
+ * virtqueue_get_buf() requires serialization so be mindful of the race between
+ * calling virtqueue_more_used() and virtqueue_get_buf().
+ */
+bool virtqueue_more_used(const struct virtqueue *_vq)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+
+	return more_used(vq);
+}
+EXPORT_SYMBOL_GPL(virtqueue_more_used);
+
 irqreturn_t vring_interrupt(int irq, void *_vq)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/3] virtio_blk: avoid repeating vblk->vqs[qid]
  2021-05-20 14:13 [PATCH 0/3] virtio_blk: blk-mq io_poll support Stefan Hajnoczi
  2021-05-20 14:13 ` [PATCH 1/3] virtio: add virtioqueue_more_used() Stefan Hajnoczi
@ 2021-05-20 14:13 ` Stefan Hajnoczi
  2021-05-25  2:25   ` Jason Wang
  2021-05-20 14:13 ` [PATCH 3/3] virtio_blk: implement blk_mq_ops->poll() Stefan Hajnoczi
  2021-06-03 15:30 ` [PATCH 0/3] virtio_blk: blk-mq io_poll support Stefan Hajnoczi
  3 siblings, 1 reply; 21+ messages in thread
From: Stefan Hajnoczi @ 2021-05-20 14:13 UTC (permalink / raw)
  To: virtualization
  Cc: linux-block, linux-kernel, Christoph Hellwig, Jason Wang,
	Paolo Bonzini, Jens Axboe, slp, sgarzare, Michael S. Tsirkin,
	Stefan Hajnoczi

struct virtio_blk_vq is accessed in many places. Introduce "vbq" local
variables to avoid repeating vblk->vqs[qid] throughout the code. The
patches that follow will add more accesses, making the payoff even
greater.

virtio_commit_rqs() calls the local variable "vq", which is easily
confused with struct virtqueue. Rename to "vbq" for clarity.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 drivers/block/virtio_blk.c | 34 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index b9fa3ef5b57c..fc0fb1dcd399 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -174,16 +174,16 @@ static inline void virtblk_request_done(struct request *req)
 static void virtblk_done(struct virtqueue *vq)
 {
 	struct virtio_blk *vblk = vq->vdev->priv;
+	struct virtio_blk_vq *vbq = &vblk->vqs[vq->index];
 	bool req_done = false;
-	int qid = vq->index;
 	struct virtblk_req *vbr;
 	unsigned long flags;
 	unsigned int len;
 
-	spin_lock_irqsave(&vblk->vqs[qid].lock, flags);
+	spin_lock_irqsave(&vbq->lock, flags);
 	do {
 		virtqueue_disable_cb(vq);
-		while ((vbr = virtqueue_get_buf(vblk->vqs[qid].vq, &len)) != NULL) {
+		while ((vbr = virtqueue_get_buf(vq, &len)) != NULL) {
 			struct request *req = blk_mq_rq_from_pdu(vbr);
 
 			if (likely(!blk_should_fake_timeout(req->q)))
@@ -197,32 +197,32 @@ static void virtblk_done(struct virtqueue *vq)
 	/* In case queue is stopped waiting for more buffers. */
 	if (req_done)
 		blk_mq_start_stopped_hw_queues(vblk->disk->queue, true);
-	spin_unlock_irqrestore(&vblk->vqs[qid].lock, flags);
+	spin_unlock_irqrestore(&vbq->lock, flags);
 }
 
 static void virtio_commit_rqs(struct blk_mq_hw_ctx *hctx)
 {
 	struct virtio_blk *vblk = hctx->queue->queuedata;
-	struct virtio_blk_vq *vq = &vblk->vqs[hctx->queue_num];
+	struct virtio_blk_vq *vbq = &vblk->vqs[hctx->queue_num];
 	bool kick;
 
-	spin_lock_irq(&vq->lock);
-	kick = virtqueue_kick_prepare(vq->vq);
-	spin_unlock_irq(&vq->lock);
+	spin_lock_irq(&vbq->lock);
+	kick = virtqueue_kick_prepare(vbq->vq);
+	spin_unlock_irq(&vbq->lock);
 
 	if (kick)
-		virtqueue_notify(vq->vq);
+		virtqueue_notify(vbq->vq);
 }
 
 static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
 			   const struct blk_mq_queue_data *bd)
 {
 	struct virtio_blk *vblk = hctx->queue->queuedata;
+	struct virtio_blk_vq *vbq = &vblk->vqs[hctx->queue_num];
 	struct request *req = bd->rq;
 	struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
 	unsigned long flags;
 	unsigned int num;
-	int qid = hctx->queue_num;
 	int err;
 	bool notify = false;
 	bool unmap = false;
@@ -274,16 +274,16 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
 			vbr->out_hdr.type |= cpu_to_virtio32(vblk->vdev, VIRTIO_BLK_T_IN);
 	}
 
-	spin_lock_irqsave(&vblk->vqs[qid].lock, flags);
-	err = virtblk_add_req(vblk->vqs[qid].vq, vbr, vbr->sg, num);
+	spin_lock_irqsave(&vbq->lock, flags);
+	err = virtblk_add_req(vbq->vq, vbr, vbr->sg, num);
 	if (err) {
-		virtqueue_kick(vblk->vqs[qid].vq);
+		virtqueue_kick(vbq->vq);
 		/* Don't stop the queue if -ENOMEM: we may have failed to
 		 * bounce the buffer due to global resource outage.
 		 */
 		if (err == -ENOSPC)
 			blk_mq_stop_hw_queue(hctx);
-		spin_unlock_irqrestore(&vblk->vqs[qid].lock, flags);
+		spin_unlock_irqrestore(&vbq->lock, flags);
 		switch (err) {
 		case -ENOSPC:
 			return BLK_STS_DEV_RESOURCE;
@@ -294,12 +294,12 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
 		}
 	}
 
-	if (bd->last && virtqueue_kick_prepare(vblk->vqs[qid].vq))
+	if (bd->last && virtqueue_kick_prepare(vbq->vq))
 		notify = true;
-	spin_unlock_irqrestore(&vblk->vqs[qid].lock, flags);
+	spin_unlock_irqrestore(&vbq->lock, flags);
 
 	if (notify)
-		virtqueue_notify(vblk->vqs[qid].vq);
+		virtqueue_notify(vbq->vq);
 	return BLK_STS_OK;
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/3] virtio_blk: implement blk_mq_ops->poll()
  2021-05-20 14:13 [PATCH 0/3] virtio_blk: blk-mq io_poll support Stefan Hajnoczi
  2021-05-20 14:13 ` [PATCH 1/3] virtio: add virtioqueue_more_used() Stefan Hajnoczi
  2021-05-20 14:13 ` [PATCH 2/3] virtio_blk: avoid repeating vblk->vqs[qid] Stefan Hajnoczi
@ 2021-05-20 14:13 ` Stefan Hajnoczi
  2021-05-24 14:59   ` Christoph Hellwig
                     ` (2 more replies)
  2021-06-03 15:30 ` [PATCH 0/3] virtio_blk: blk-mq io_poll support Stefan Hajnoczi
  3 siblings, 3 replies; 21+ messages in thread
From: Stefan Hajnoczi @ 2021-05-20 14:13 UTC (permalink / raw)
  To: virtualization
  Cc: linux-block, linux-kernel, Christoph Hellwig, Jason Wang,
	Paolo Bonzini, Jens Axboe, slp, sgarzare, Michael S. Tsirkin,
	Stefan Hajnoczi

Request completion latency can be reduced by using polling instead of
irqs. Even Posted Interrupts or similar hardware support doesn't beat
polling. The reason is that disabling virtqueue notifications saves
critical-path CPU cycles on the host by skipping irq injection and in
the guest by skipping the irq handler. So let's add blk_mq_ops->poll()
support to virtio_blk.

The approach taken by this patch differs from the NVMe driver's
approach. NVMe dedicates hardware queues to polling and submits
REQ_HIPRI requests only on those queues. This patch does not require
exclusive polling queues for virtio_blk. Instead, it switches between
irqs and polling when one or more REQ_HIPRI requests are in flight on a
virtqueue.

This is possible because toggling virtqueue notifications is cheap even
while the virtqueue is running. NVMe cqs can't do this because irqs are
only enabled/disabled at queue creation time.

This toggling approach requires no configuration. There is no need to
dedicate queues ahead of time or to teach users and orchestration tools
how to set up polling queues.

Possible drawbacks of this approach:

- Hardware virtio_blk implementations may find virtqueue_disable_cb()
  expensive since it requires DMA. If such devices become popular then
  the virtio_blk driver could use a similar approach to NVMe when
  VIRTIO_F_ACCESS_PLATFORM is detected in the future.

- If a blk_poll() thread is descheduled it not only hurts polling
  performance but also delays completion of non-REQ_HIPRI requests on
  that virtqueue since vq notifications are disabled.

Performance:

- Benchmark: fio ioengine=pvsync2 numjobs=4 direct=1
- Guest: 4 vCPUs with one virtio-blk device (4 virtqueues)
- Disk: Intel Corporation NVMe Datacenter SSD [Optane] [8086:2701]
- CPU: Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz

rw          bs hipri=0 hipri=1
------------------------------
randread    4k 149,426 170,763 +14%
randread   16k 118,939 134,269 +12%
randread   64k  34,886  34,906   0%
randread  128k  17,655  17,667   0%
randwrite   4k 138,578 163,600 +18%
randwrite  16k 102,089 120,950 +18%
randwrite  64k  32,364  32,561   0%
randwrite 128k  16,154  16,237   0%
read        4k 146,032 170,620 +16%
read       16k 117,097 130,437 +11%
read       64k  34,834  35,037   0%
read      128k  17,680  17,658   0%
write       4k 134,562 151,422 +12%
write      16k 101,796 107,606  +5%
write      64k  32,364  32,594   0%
write     128k  16,259  16,265   0%

Larger block sizes do not benefit from polling as much but the
improvement is worthwhile for smaller block sizes.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 drivers/block/virtio_blk.c | 92 +++++++++++++++++++++++++++++++++++---
 1 file changed, 87 insertions(+), 5 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index fc0fb1dcd399..f0243dcd745a 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -29,6 +29,16 @@ static struct workqueue_struct *virtblk_wq;
 struct virtio_blk_vq {
 	struct virtqueue *vq;
 	spinlock_t lock;
+
+	/* Number of non-REQ_HIPRI requests in flight. Protected by lock. */
+	unsigned int num_lopri;
+
+	/* Number of REQ_HIPRI requests in flight. Protected by lock. */
+	unsigned int num_hipri;
+
+	/* Are vq notifications enabled? Protected by lock. */
+	bool cb_enabled;
+
 	char name[VQ_NAME_LEN];
 } ____cacheline_aligned_in_smp;
 
@@ -171,33 +181,67 @@ static inline void virtblk_request_done(struct request *req)
 	blk_mq_end_request(req, virtblk_result(vbr));
 }
 
-static void virtblk_done(struct virtqueue *vq)
+/* Returns true if one or more requests completed */
+static bool virtblk_complete_requests(struct virtqueue *vq)
 {
 	struct virtio_blk *vblk = vq->vdev->priv;
 	struct virtio_blk_vq *vbq = &vblk->vqs[vq->index];
 	bool req_done = false;
+	bool last_hipri_done = false;
 	struct virtblk_req *vbr;
 	unsigned long flags;
 	unsigned int len;
 
 	spin_lock_irqsave(&vbq->lock, flags);
+
 	do {
-		virtqueue_disable_cb(vq);
+		if (vbq->cb_enabled)
+			virtqueue_disable_cb(vq);
 		while ((vbr = virtqueue_get_buf(vq, &len)) != NULL) {
 			struct request *req = blk_mq_rq_from_pdu(vbr);
 
+			if (req->cmd_flags & REQ_HIPRI) {
+				if (--vbq->num_hipri == 0)
+					last_hipri_done = true;
+			} else
+				vbq->num_lopri--;
+
 			if (likely(!blk_should_fake_timeout(req->q)))
 				blk_mq_complete_request(req);
 			req_done = true;
 		}
 		if (unlikely(virtqueue_is_broken(vq)))
 			break;
-	} while (!virtqueue_enable_cb(vq));
+
+		/* Enable vq notifications if non-polled requests remain */
+		if (last_hipri_done && vbq->num_lopri > 0) {
+			last_hipri_done = false;
+			vbq->cb_enabled = true;
+		}
+	} while (vbq->cb_enabled && !virtqueue_enable_cb(vq));
 
 	/* In case queue is stopped waiting for more buffers. */
 	if (req_done)
 		blk_mq_start_stopped_hw_queues(vblk->disk->queue, true);
 	spin_unlock_irqrestore(&vbq->lock, flags);
+
+	return req_done;
+}
+
+static int virtblk_poll(struct blk_mq_hw_ctx *hctx)
+{
+	struct virtio_blk *vblk = hctx->queue->queuedata;
+	struct virtqueue *vq = vblk->vqs[hctx->queue_num].vq;
+
+	if (!virtqueue_more_used(vq))
+		return 0;
+
+	return virtblk_complete_requests(vq);
+}
+
+static void virtblk_done(struct virtqueue *vq)
+{
+	virtblk_complete_requests(vq);
 }
 
 static void virtio_commit_rqs(struct blk_mq_hw_ctx *hctx)
@@ -275,6 +319,16 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
 	}
 
 	spin_lock_irqsave(&vbq->lock, flags);
+
+	/* Re-enable vq notifications if first req is non-polling */
+	if (!(req->cmd_flags & REQ_HIPRI) &&
+	    vbq->num_lopri == 0 && vbq->num_hipri == 0 &&
+	    !vbq->cb_enabled) {
+		/* Can't return false since there are no in-flight reqs */
+		virtqueue_enable_cb(vbq->vq);
+		vbq->cb_enabled = true;
+	}
+
 	err = virtblk_add_req(vbq->vq, vbr, vbr->sg, num);
 	if (err) {
 		virtqueue_kick(vbq->vq);
@@ -294,6 +348,21 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
 		}
 	}
 
+	/*
+	 * Disable vq notifications when polled reqs are submitted.
+	 *
+	 * The virtqueue lock is held so req is still valid here even if the
+	 * device polls the virtqueue and completes the request before we call
+	 * virtqueue_notify().
+	 */
+	if (req->cmd_flags & REQ_HIPRI) {
+		if (vbq->num_hipri++ == 0 && vbq->cb_enabled) {
+			virtqueue_disable_cb(vbq->vq);
+			vbq->cb_enabled = false;
+		}
+	} else
+		vbq->num_lopri++;
+
 	if (bd->last && virtqueue_kick_prepare(vbq->vq))
 		notify = true;
 	spin_unlock_irqrestore(&vbq->lock, flags);
@@ -533,6 +602,9 @@ static int init_vq(struct virtio_blk *vblk)
 	for (i = 0; i < num_vqs; i++) {
 		spin_lock_init(&vblk->vqs[i].lock);
 		vblk->vqs[i].vq = vqs[i];
+		vblk->vqs[i].num_lopri = 0;
+		vblk->vqs[i].num_hipri = 0;
+		vblk->vqs[i].cb_enabled = true;
 	}
 	vblk->num_vqs = num_vqs;
 
@@ -681,8 +753,16 @@ static int virtblk_map_queues(struct blk_mq_tag_set *set)
 {
 	struct virtio_blk *vblk = set->driver_data;
 
-	return blk_mq_virtio_map_queues(&set->map[HCTX_TYPE_DEFAULT],
-					vblk->vdev, 0);
+	set->map[HCTX_TYPE_DEFAULT].nr_queues = vblk->num_vqs;
+	blk_mq_virtio_map_queues(&set->map[HCTX_TYPE_DEFAULT], vblk->vdev, 0);
+
+	set->map[HCTX_TYPE_READ].nr_queues = 0;
+
+	/* HCTX_TYPE_DEFAULT queues are shared with HCTX_TYPE_POLL */
+	set->map[HCTX_TYPE_POLL].nr_queues = vblk->num_vqs;
+	blk_mq_virtio_map_queues(&set->map[HCTX_TYPE_POLL], vblk->vdev, 0);
+
+	return 0;
 }
 
 static const struct blk_mq_ops virtio_mq_ops = {
@@ -691,6 +771,7 @@ static const struct blk_mq_ops virtio_mq_ops = {
 	.complete	= virtblk_request_done,
 	.init_request	= virtblk_init_request,
 	.map_queues	= virtblk_map_queues,
+	.poll		= virtblk_poll,
 };
 
 static unsigned int virtblk_queue_depth;
@@ -768,6 +849,7 @@ static int virtblk_probe(struct virtio_device *vdev)
 
 	memset(&vblk->tag_set, 0, sizeof(vblk->tag_set));
 	vblk->tag_set.ops = &virtio_mq_ops;
+	vblk->tag_set.nr_maps = 3; /* default, read, and poll */
 	vblk->tag_set.queue_depth = queue_depth;
 	vblk->tag_set.numa_node = NUMA_NO_NODE;
 	vblk->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] virtio_blk: implement blk_mq_ops->poll()
  2021-05-20 14:13 ` [PATCH 3/3] virtio_blk: implement blk_mq_ops->poll() Stefan Hajnoczi
@ 2021-05-24 14:59   ` Christoph Hellwig
  2021-05-25  7:22     ` Paolo Bonzini
  2021-05-25 13:19     ` Stefan Hajnoczi
  2021-05-25  3:21   ` Jason Wang
  2021-05-27  2:44   ` Ming Lei
  2 siblings, 2 replies; 21+ messages in thread
From: Christoph Hellwig @ 2021-05-24 14:59 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: virtualization, linux-block, linux-kernel, Christoph Hellwig,
	Jason Wang, Paolo Bonzini, Jens Axboe, slp, sgarzare,
	Michael S. Tsirkin

On Thu, May 20, 2021 at 03:13:05PM +0100, Stefan Hajnoczi wrote:
> Possible drawbacks of this approach:
> 
> - Hardware virtio_blk implementations may find virtqueue_disable_cb()
>   expensive since it requires DMA. If such devices become popular then
>   the virtio_blk driver could use a similar approach to NVMe when
>   VIRTIO_F_ACCESS_PLATFORM is detected in the future.
> 
> - If a blk_poll() thread is descheduled it not only hurts polling
>   performance but also delays completion of non-REQ_HIPRI requests on
>   that virtqueue since vq notifications are disabled.

Yes, I think this is a dangerous configuration.  What argument exists
again just using dedicated poll queues?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] virtio: add virtioqueue_more_used()
  2021-05-20 14:13 ` [PATCH 1/3] virtio: add virtioqueue_more_used() Stefan Hajnoczi
@ 2021-05-25  2:23   ` Jason Wang
  2021-05-25  8:48     ` Stefan Hajnoczi
  0 siblings, 1 reply; 21+ messages in thread
From: Jason Wang @ 2021-05-25  2:23 UTC (permalink / raw)
  To: Stefan Hajnoczi, virtualization
  Cc: linux-block, linux-kernel, Christoph Hellwig, Paolo Bonzini,
	Jens Axboe, slp, sgarzare, Michael S. Tsirkin


在 2021/5/20 下午10:13, Stefan Hajnoczi 写道:
> Add an API to check whether there are pending used buffers. There is
> already a similar API called virtqueue_poll() but it only works together
> with virtqueue_enable_cb_prepare(). The patches that follow add blk-mq
> ->poll() support to virtio_blk and they need to check for used buffers
> without re-enabling virtqueue callbacks, so introduce an API for it.
>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>


Typo in the subject.


> ---
>   include/linux/virtio.h       |  2 ++
>   drivers/virtio/virtio_ring.c | 17 +++++++++++++++++
>   2 files changed, 19 insertions(+)
>
> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> index b1894e0323fa..c6ad0f25f412 100644
> --- a/include/linux/virtio.h
> +++ b/include/linux/virtio.h
> @@ -63,6 +63,8 @@ bool virtqueue_kick_prepare(struct virtqueue *vq);
>   
>   bool virtqueue_notify(struct virtqueue *vq);
>   
> +bool virtqueue_more_used(const struct virtqueue *vq);
> +
>   void *virtqueue_get_buf(struct virtqueue *vq, unsigned int *len);
>   
>   void *virtqueue_get_buf_ctx(struct virtqueue *vq, unsigned int *len,
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 71e16b53e9c1..7c3da75da462 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -2032,6 +2032,23 @@ static inline bool more_used(const struct vring_virtqueue *vq)
>   	return vq->packed_ring ? more_used_packed(vq) : more_used_split(vq);
>   }
>   
> +/**
> + * virtqueue_more_used - check if there are used buffers pending
> + * @_vq: the struct virtqueue we're talking about.
> + *
> + * Returns true if there are used buffers, false otherwise. May be called at
> + * the same time as other virtqueue operations, but actually calling
> + * virtqueue_get_buf() requires serialization so be mindful of the race between
> + * calling virtqueue_more_used() and virtqueue_get_buf().
> + */
> +bool virtqueue_more_used(const struct virtqueue *_vq)
> +{
> +	struct vring_virtqueue *vq = to_vvq(_vq);
> +
> +	return more_used(vq);
> +}
> +EXPORT_SYMBOL_GPL(virtqueue_more_used);


It's worth to mention that the function is not serialized (no barriers).

Thanks


> +
>   irqreturn_t vring_interrupt(int irq, void *_vq)
>   {
>   	struct vring_virtqueue *vq = to_vvq(_vq);


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/3] virtio_blk: avoid repeating vblk->vqs[qid]
  2021-05-20 14:13 ` [PATCH 2/3] virtio_blk: avoid repeating vblk->vqs[qid] Stefan Hajnoczi
@ 2021-05-25  2:25   ` Jason Wang
  0 siblings, 0 replies; 21+ messages in thread
From: Jason Wang @ 2021-05-25  2:25 UTC (permalink / raw)
  To: Stefan Hajnoczi, virtualization
  Cc: linux-block, linux-kernel, Christoph Hellwig, Paolo Bonzini,
	Jens Axboe, slp, sgarzare, Michael S. Tsirkin


在 2021/5/20 下午10:13, Stefan Hajnoczi 写道:
> struct virtio_blk_vq is accessed in many places. Introduce "vbq" local
> variables to avoid repeating vblk->vqs[qid] throughout the code. The
> patches that follow will add more accesses, making the payoff even
> greater.
>
> virtio_commit_rqs() calls the local variable "vq", which is easily
> confused with struct virtqueue. Rename to "vbq" for clarity.
>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>


Acked-by: Jason Wang <jasowang@redhat.com>


> ---
>   drivers/block/virtio_blk.c | 34 +++++++++++++++++-----------------
>   1 file changed, 17 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index b9fa3ef5b57c..fc0fb1dcd399 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -174,16 +174,16 @@ static inline void virtblk_request_done(struct request *req)
>   static void virtblk_done(struct virtqueue *vq)
>   {
>   	struct virtio_blk *vblk = vq->vdev->priv;
> +	struct virtio_blk_vq *vbq = &vblk->vqs[vq->index];
>   	bool req_done = false;
> -	int qid = vq->index;
>   	struct virtblk_req *vbr;
>   	unsigned long flags;
>   	unsigned int len;
>   
> -	spin_lock_irqsave(&vblk->vqs[qid].lock, flags);
> +	spin_lock_irqsave(&vbq->lock, flags);
>   	do {
>   		virtqueue_disable_cb(vq);
> -		while ((vbr = virtqueue_get_buf(vblk->vqs[qid].vq, &len)) != NULL) {
> +		while ((vbr = virtqueue_get_buf(vq, &len)) != NULL) {
>   			struct request *req = blk_mq_rq_from_pdu(vbr);
>   
>   			if (likely(!blk_should_fake_timeout(req->q)))
> @@ -197,32 +197,32 @@ static void virtblk_done(struct virtqueue *vq)
>   	/* In case queue is stopped waiting for more buffers. */
>   	if (req_done)
>   		blk_mq_start_stopped_hw_queues(vblk->disk->queue, true);
> -	spin_unlock_irqrestore(&vblk->vqs[qid].lock, flags);
> +	spin_unlock_irqrestore(&vbq->lock, flags);
>   }
>   
>   static void virtio_commit_rqs(struct blk_mq_hw_ctx *hctx)
>   {
>   	struct virtio_blk *vblk = hctx->queue->queuedata;
> -	struct virtio_blk_vq *vq = &vblk->vqs[hctx->queue_num];
> +	struct virtio_blk_vq *vbq = &vblk->vqs[hctx->queue_num];
>   	bool kick;
>   
> -	spin_lock_irq(&vq->lock);
> -	kick = virtqueue_kick_prepare(vq->vq);
> -	spin_unlock_irq(&vq->lock);
> +	spin_lock_irq(&vbq->lock);
> +	kick = virtqueue_kick_prepare(vbq->vq);
> +	spin_unlock_irq(&vbq->lock);
>   
>   	if (kick)
> -		virtqueue_notify(vq->vq);
> +		virtqueue_notify(vbq->vq);
>   }
>   
>   static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
>   			   const struct blk_mq_queue_data *bd)
>   {
>   	struct virtio_blk *vblk = hctx->queue->queuedata;
> +	struct virtio_blk_vq *vbq = &vblk->vqs[hctx->queue_num];
>   	struct request *req = bd->rq;
>   	struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
>   	unsigned long flags;
>   	unsigned int num;
> -	int qid = hctx->queue_num;
>   	int err;
>   	bool notify = false;
>   	bool unmap = false;
> @@ -274,16 +274,16 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
>   			vbr->out_hdr.type |= cpu_to_virtio32(vblk->vdev, VIRTIO_BLK_T_IN);
>   	}
>   
> -	spin_lock_irqsave(&vblk->vqs[qid].lock, flags);
> -	err = virtblk_add_req(vblk->vqs[qid].vq, vbr, vbr->sg, num);
> +	spin_lock_irqsave(&vbq->lock, flags);
> +	err = virtblk_add_req(vbq->vq, vbr, vbr->sg, num);
>   	if (err) {
> -		virtqueue_kick(vblk->vqs[qid].vq);
> +		virtqueue_kick(vbq->vq);
>   		/* Don't stop the queue if -ENOMEM: we may have failed to
>   		 * bounce the buffer due to global resource outage.
>   		 */
>   		if (err == -ENOSPC)
>   			blk_mq_stop_hw_queue(hctx);
> -		spin_unlock_irqrestore(&vblk->vqs[qid].lock, flags);
> +		spin_unlock_irqrestore(&vbq->lock, flags);
>   		switch (err) {
>   		case -ENOSPC:
>   			return BLK_STS_DEV_RESOURCE;
> @@ -294,12 +294,12 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
>   		}
>   	}
>   
> -	if (bd->last && virtqueue_kick_prepare(vblk->vqs[qid].vq))
> +	if (bd->last && virtqueue_kick_prepare(vbq->vq))
>   		notify = true;
> -	spin_unlock_irqrestore(&vblk->vqs[qid].lock, flags);
> +	spin_unlock_irqrestore(&vbq->lock, flags);
>   
>   	if (notify)
> -		virtqueue_notify(vblk->vqs[qid].vq);
> +		virtqueue_notify(vbq->vq);
>   	return BLK_STS_OK;
>   }
>   


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] virtio_blk: implement blk_mq_ops->poll()
  2021-05-20 14:13 ` [PATCH 3/3] virtio_blk: implement blk_mq_ops->poll() Stefan Hajnoczi
  2021-05-24 14:59   ` Christoph Hellwig
@ 2021-05-25  3:21   ` Jason Wang
  2021-05-25  8:59     ` Stefan Hajnoczi
  2021-05-27  2:44   ` Ming Lei
  2 siblings, 1 reply; 21+ messages in thread
From: Jason Wang @ 2021-05-25  3:21 UTC (permalink / raw)
  To: Stefan Hajnoczi, virtualization
  Cc: linux-block, linux-kernel, Christoph Hellwig, Paolo Bonzini,
	Jens Axboe, slp, sgarzare, Michael S. Tsirkin


在 2021/5/20 下午10:13, Stefan Hajnoczi 写道:
> Request completion latency can be reduced by using polling instead of
> irqs. Even Posted Interrupts or similar hardware support doesn't beat
> polling. The reason is that disabling virtqueue notifications saves
> critical-path CPU cycles on the host by skipping irq injection and in
> the guest by skipping the irq handler. So let's add blk_mq_ops->poll()
> support to virtio_blk.
>
> The approach taken by this patch differs from the NVMe driver's
> approach. NVMe dedicates hardware queues to polling and submits
> REQ_HIPRI requests only on those queues. This patch does not require
> exclusive polling queues for virtio_blk. Instead, it switches between
> irqs and polling when one or more REQ_HIPRI requests are in flight on a
> virtqueue.
>
> This is possible because toggling virtqueue notifications is cheap even
> while the virtqueue is running. NVMe cqs can't do this because irqs are
> only enabled/disabled at queue creation time.
>
> This toggling approach requires no configuration. There is no need to
> dedicate queues ahead of time or to teach users and orchestration tools
> how to set up polling queues.
>
> Possible drawbacks of this approach:
>
> - Hardware virtio_blk implementations may find virtqueue_disable_cb()
>    expensive since it requires DMA.


Note that it's probably not related to the behavior of the driver but 
the design of the event suppression mechanism.

Device can choose to ignore the suppression flag and keep sending 
interrupts.


>   If such devices become popular then
>    the virtio_blk driver could use a similar approach to NVMe when
>    VIRTIO_F_ACCESS_PLATFORM is detected in the future.
>
> - If a blk_poll() thread is descheduled it not only hurts polling
>    performance but also delays completion of non-REQ_HIPRI requests on
>    that virtqueue since vq notifications are disabled.


Can we poll only when only high pri requests are pending?

If the backend is a remote one, I think the polling may cause more cpu 
cycles.


>
> Performance:
>
> - Benchmark: fio ioengine=pvsync2 numjobs=4 direct=1
> - Guest: 4 vCPUs with one virtio-blk device (4 virtqueues)
> - Disk: Intel Corporation NVMe Datacenter SSD [Optane] [8086:2701]
> - CPU: Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz
>
> rw          bs hipri=0 hipri=1
> ------------------------------
> randread    4k 149,426 170,763 +14%
> randread   16k 118,939 134,269 +12%
> randread   64k  34,886  34,906   0%
> randread  128k  17,655  17,667   0%
> randwrite   4k 138,578 163,600 +18%
> randwrite  16k 102,089 120,950 +18%
> randwrite  64k  32,364  32,561   0%
> randwrite 128k  16,154  16,237   0%
> read        4k 146,032 170,620 +16%
> read       16k 117,097 130,437 +11%
> read       64k  34,834  35,037   0%
> read      128k  17,680  17,658   0%
> write       4k 134,562 151,422 +12%
> write      16k 101,796 107,606  +5%
> write      64k  32,364  32,594   0%
> write     128k  16,259  16,265   0%
>
> Larger block sizes do not benefit from polling as much but the
> improvement is worthwhile for smaller block sizes.
>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>   drivers/block/virtio_blk.c | 92 +++++++++++++++++++++++++++++++++++---
>   1 file changed, 87 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index fc0fb1dcd399..f0243dcd745a 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -29,6 +29,16 @@ static struct workqueue_struct *virtblk_wq;
>   struct virtio_blk_vq {
>   	struct virtqueue *vq;
>   	spinlock_t lock;
> +
> +	/* Number of non-REQ_HIPRI requests in flight. Protected by lock. */
> +	unsigned int num_lopri;
> +
> +	/* Number of REQ_HIPRI requests in flight. Protected by lock. */
> +	unsigned int num_hipri;
> +
> +	/* Are vq notifications enabled? Protected by lock. */
> +	bool cb_enabled;


We had event_flag_shadow, is it sufficient to introduce a new helper 
virtqueue_cb_is_enabled()?


> +
>   	char name[VQ_NAME_LEN];
>   } ____cacheline_aligned_in_smp;
>   
> @@ -171,33 +181,67 @@ static inline void virtblk_request_done(struct request *req)
>   	blk_mq_end_request(req, virtblk_result(vbr));
>   }
>   
> -static void virtblk_done(struct virtqueue *vq)
> +/* Returns true if one or more requests completed */
> +static bool virtblk_complete_requests(struct virtqueue *vq)
>   {
>   	struct virtio_blk *vblk = vq->vdev->priv;
>   	struct virtio_blk_vq *vbq = &vblk->vqs[vq->index];
>   	bool req_done = false;
> +	bool last_hipri_done = false;
>   	struct virtblk_req *vbr;
>   	unsigned long flags;
>   	unsigned int len;
>   
>   	spin_lock_irqsave(&vbq->lock, flags);
> +
>   	do {
> -		virtqueue_disable_cb(vq);
> +		if (vbq->cb_enabled)
> +			virtqueue_disable_cb(vq);
>   		while ((vbr = virtqueue_get_buf(vq, &len)) != NULL) {
>   			struct request *req = blk_mq_rq_from_pdu(vbr);
>   
> +			if (req->cmd_flags & REQ_HIPRI) {
> +				if (--vbq->num_hipri == 0)
> +					last_hipri_done = true;
> +			} else
> +				vbq->num_lopri--;
> +
>   			if (likely(!blk_should_fake_timeout(req->q)))
>   				blk_mq_complete_request(req);
>   			req_done = true;
>   		}
>   		if (unlikely(virtqueue_is_broken(vq)))
>   			break;
> -	} while (!virtqueue_enable_cb(vq));
> +
> +		/* Enable vq notifications if non-polled requests remain */
> +		if (last_hipri_done && vbq->num_lopri > 0) {
> +			last_hipri_done = false;
> +			vbq->cb_enabled = true;
> +		}
> +	} while (vbq->cb_enabled && !virtqueue_enable_cb(vq));
>   
>   	/* In case queue is stopped waiting for more buffers. */
>   	if (req_done)
>   		blk_mq_start_stopped_hw_queues(vblk->disk->queue, true);
>   	spin_unlock_irqrestore(&vbq->lock, flags);
> +
> +	return req_done;
> +}
> +
> +static int virtblk_poll(struct blk_mq_hw_ctx *hctx)
> +{
> +	struct virtio_blk *vblk = hctx->queue->queuedata;
> +	struct virtqueue *vq = vblk->vqs[hctx->queue_num].vq;
> +
> +	if (!virtqueue_more_used(vq))


I'm not familiar with block polling but what happens if a buffer is made 
available after virtqueue_more_used() returns false here?

Thanks


> +		return 0;
> +
> +	return virtblk_complete_requests(vq);
> +}
> +
> +static void virtblk_done(struct virtqueue *vq)
> +{
> +	virtblk_complete_requests(vq);
>   }
>   
>   static void virtio_commit_rqs(struct blk_mq_hw_ctx *hctx)
> @@ -275,6 +319,16 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
>   	}
>   
>   	spin_lock_irqsave(&vbq->lock, flags);
> +
> +	/* Re-enable vq notifications if first req is non-polling */
> +	if (!(req->cmd_flags & REQ_HIPRI) &&
> +	    vbq->num_lopri == 0 && vbq->num_hipri == 0 &&
> +	    !vbq->cb_enabled) {
> +		/* Can't return false since there are no in-flight reqs */
> +		virtqueue_enable_cb(vbq->vq);
> +		vbq->cb_enabled = true;
> +	}
> +
>   	err = virtblk_add_req(vbq->vq, vbr, vbr->sg, num);
>   	if (err) {
>   		virtqueue_kick(vbq->vq);
> @@ -294,6 +348,21 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
>   		}
>   	}
>   
> +	/*
> +	 * Disable vq notifications when polled reqs are submitted.
> +	 *
> +	 * The virtqueue lock is held so req is still valid here even if the
> +	 * device polls the virtqueue and completes the request before we call
> +	 * virtqueue_notify().
> +	 */
> +	if (req->cmd_flags & REQ_HIPRI) {
> +		if (vbq->num_hipri++ == 0 && vbq->cb_enabled) {
> +			virtqueue_disable_cb(vbq->vq);
> +			vbq->cb_enabled = false;
> +		}
> +	} else
> +		vbq->num_lopri++;
> +
>   	if (bd->last && virtqueue_kick_prepare(vbq->vq))
>   		notify = true;
>   	spin_unlock_irqrestore(&vbq->lock, flags);
> @@ -533,6 +602,9 @@ static int init_vq(struct virtio_blk *vblk)
>   	for (i = 0; i < num_vqs; i++) {
>   		spin_lock_init(&vblk->vqs[i].lock);
>   		vblk->vqs[i].vq = vqs[i];
> +		vblk->vqs[i].num_lopri = 0;
> +		vblk->vqs[i].num_hipri = 0;
> +		vblk->vqs[i].cb_enabled = true;
>   	}
>   	vblk->num_vqs = num_vqs;
>   
> @@ -681,8 +753,16 @@ static int virtblk_map_queues(struct blk_mq_tag_set *set)
>   {
>   	struct virtio_blk *vblk = set->driver_data;
>   
> -	return blk_mq_virtio_map_queues(&set->map[HCTX_TYPE_DEFAULT],
> -					vblk->vdev, 0);
> +	set->map[HCTX_TYPE_DEFAULT].nr_queues = vblk->num_vqs;
> +	blk_mq_virtio_map_queues(&set->map[HCTX_TYPE_DEFAULT], vblk->vdev, 0);
> +
> +	set->map[HCTX_TYPE_READ].nr_queues = 0;
> +
> +	/* HCTX_TYPE_DEFAULT queues are shared with HCTX_TYPE_POLL */
> +	set->map[HCTX_TYPE_POLL].nr_queues = vblk->num_vqs;
> +	blk_mq_virtio_map_queues(&set->map[HCTX_TYPE_POLL], vblk->vdev, 0);
> +
> +	return 0;
>   }
>   
>   static const struct blk_mq_ops virtio_mq_ops = {
> @@ -691,6 +771,7 @@ static const struct blk_mq_ops virtio_mq_ops = {
>   	.complete	= virtblk_request_done,
>   	.init_request	= virtblk_init_request,
>   	.map_queues	= virtblk_map_queues,
> +	.poll		= virtblk_poll,
>   };
>   
>   static unsigned int virtblk_queue_depth;
> @@ -768,6 +849,7 @@ static int virtblk_probe(struct virtio_device *vdev)
>   
>   	memset(&vblk->tag_set, 0, sizeof(vblk->tag_set));
>   	vblk->tag_set.ops = &virtio_mq_ops;
> +	vblk->tag_set.nr_maps = 3; /* default, read, and poll */
>   	vblk->tag_set.queue_depth = queue_depth;
>   	vblk->tag_set.numa_node = NUMA_NO_NODE;
>   	vblk->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] virtio_blk: implement blk_mq_ops->poll()
  2021-05-24 14:59   ` Christoph Hellwig
@ 2021-05-25  7:22     ` Paolo Bonzini
  2021-05-25  7:38       ` Ming Lei
  2021-05-25 13:19     ` Stefan Hajnoczi
  1 sibling, 1 reply; 21+ messages in thread
From: Paolo Bonzini @ 2021-05-25  7:22 UTC (permalink / raw)
  To: Christoph Hellwig, Stefan Hajnoczi
  Cc: virtualization, linux-block, linux-kernel, Jason Wang,
	Jens Axboe, slp, sgarzare, Michael S. Tsirkin

On 24/05/21 16:59, Christoph Hellwig wrote:
> On Thu, May 20, 2021 at 03:13:05PM +0100, Stefan Hajnoczi wrote:
>> Possible drawbacks of this approach:
>>
>> - Hardware virtio_blk implementations may find virtqueue_disable_cb()
>>    expensive since it requires DMA. If such devices become popular then
>>    the virtio_blk driver could use a similar approach to NVMe when
>>    VIRTIO_F_ACCESS_PLATFORM is detected in the future.
>>
>> - If a blk_poll() thread is descheduled it not only hurts polling
>>    performance but also delays completion of non-REQ_HIPRI requests on
>>    that virtqueue since vq notifications are disabled.
> 
> Yes, I think this is a dangerous configuration.  What argument exists
> again just using dedicated poll queues?

There isn't an equivalent of the admin queue in virtio-blk, which would 
allow the guest to configure the desired number of poll queues.  The 
number of queues is fixed.

Could the blk_poll() thread use preempt notifiers to enable/disable 
callbacks, for example using two new .poll_start and .end_poll callbacks 
to struct blk_mq_ops?

Paolo


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] virtio_blk: implement blk_mq_ops->poll()
  2021-05-25  7:22     ` Paolo Bonzini
@ 2021-05-25  7:38       ` Ming Lei
  2021-05-25  8:06         ` Paolo Bonzini
  2021-05-25 13:20         ` Stefan Hajnoczi
  0 siblings, 2 replies; 21+ messages in thread
From: Ming Lei @ 2021-05-25  7:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Christoph Hellwig, Stefan Hajnoczi, virtualization, linux-block,
	linux-kernel, Jason Wang, Jens Axboe, slp, sgarzare,
	Michael S. Tsirkin

On Tue, May 25, 2021 at 09:22:48AM +0200, Paolo Bonzini wrote:
> On 24/05/21 16:59, Christoph Hellwig wrote:
> > On Thu, May 20, 2021 at 03:13:05PM +0100, Stefan Hajnoczi wrote:
> > > Possible drawbacks of this approach:
> > > 
> > > - Hardware virtio_blk implementations may find virtqueue_disable_cb()
> > >    expensive since it requires DMA. If such devices become popular then
> > >    the virtio_blk driver could use a similar approach to NVMe when
> > >    VIRTIO_F_ACCESS_PLATFORM is detected in the future.
> > > 
> > > - If a blk_poll() thread is descheduled it not only hurts polling
> > >    performance but also delays completion of non-REQ_HIPRI requests on
> > >    that virtqueue since vq notifications are disabled.
> > 
> > Yes, I think this is a dangerous configuration.  What argument exists
> > again just using dedicated poll queues?
> 
> There isn't an equivalent of the admin queue in virtio-blk, which would
> allow the guest to configure the desired number of poll queues.  The number
> of queues is fixed.

Dedicated vqs can be used for poll only, and I understand VM needn't to know
if the vq is polled or driven by IRQ in VM.

I tried that in v5.4, but not see obvious IOPS boost, so give up.

https://github.com/ming1/linux/commits/my_v5.4-virtio-irq-poll


Thanks,
Ming


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] virtio_blk: implement blk_mq_ops->poll()
  2021-05-25  7:38       ` Ming Lei
@ 2021-05-25  8:06         ` Paolo Bonzini
  2021-05-25 13:20         ` Stefan Hajnoczi
  1 sibling, 0 replies; 21+ messages in thread
From: Paolo Bonzini @ 2021-05-25  8:06 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Stefan Hajnoczi, virtualization, linux-block,
	linux-kernel, Jason Wang, Jens Axboe, slp, sgarzare,
	Michael S. Tsirkin

On 25/05/21 09:38, Ming Lei wrote:
> On Tue, May 25, 2021 at 09:22:48AM +0200, Paolo Bonzini wrote:
>> On 24/05/21 16:59, Christoph Hellwig wrote:
>>> On Thu, May 20, 2021 at 03:13:05PM +0100, Stefan Hajnoczi wrote:
>>>> Possible drawbacks of this approach:
>>>>
>>>> - Hardware virtio_blk implementations may find virtqueue_disable_cb()
>>>>     expensive since it requires DMA. If such devices become popular then
>>>>     the virtio_blk driver could use a similar approach to NVMe when
>>>>     VIRTIO_F_ACCESS_PLATFORM is detected in the future.
>>>>
>>>> - If a blk_poll() thread is descheduled it not only hurts polling
>>>>     performance but also delays completion of non-REQ_HIPRI requests on
>>>>     that virtqueue since vq notifications are disabled.
>>>
>>> Yes, I think this is a dangerous configuration.  What argument exists
>>> again just using dedicated poll queues?
>>
>> There isn't an equivalent of the admin queue in virtio-blk, which would
>> allow the guest to configure the desired number of poll queues.  The number
>> of queues is fixed.
> 
> Dedicated vqs can be used for poll only, and I understand VM needn't to know
> if the vq is polled or driven by IRQ in VM.
> 
> I tried that in v5.4, but not see obvious IOPS boost, so give up.
> 
> https://github.com/ming1/linux/commits/my_v5.4-virtio-irq-poll

Sure, but polling can be beneficial even for a single queue.  Queues 
have a cost on the host side as well, so a 1 vCPU - 1 queue model may 
not be always the best.

Paolo


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] virtio: add virtioqueue_more_used()
  2021-05-25  2:23   ` Jason Wang
@ 2021-05-25  8:48     ` Stefan Hajnoczi
  0 siblings, 0 replies; 21+ messages in thread
From: Stefan Hajnoczi @ 2021-05-25  8:48 UTC (permalink / raw)
  To: Jason Wang
  Cc: virtualization, linux-block, linux-kernel, Christoph Hellwig,
	Paolo Bonzini, Jens Axboe, slp, sgarzare, Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 1403 bytes --]

On Tue, May 25, 2021 at 10:23:39AM +0800, Jason Wang wrote:
> 
> 在 2021/5/20 下午10:13, Stefan Hajnoczi 写道:
> > Add an API to check whether there are pending used buffers. There is
> > already a similar API called virtqueue_poll() but it only works together
> > with virtqueue_enable_cb_prepare(). The patches that follow add blk-mq
> > ->poll() support to virtio_blk and they need to check for used buffers
> > without re-enabling virtqueue callbacks, so introduce an API for it.
> > 
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> 
> 
> Typo in the subject.

Thanks, will fix.

> > +/**
> > + * virtqueue_more_used - check if there are used buffers pending
> > + * @_vq: the struct virtqueue we're talking about.
> > + *
> > + * Returns true if there are used buffers, false otherwise. May be called at
> > + * the same time as other virtqueue operations, but actually calling
> > + * virtqueue_get_buf() requires serialization so be mindful of the race between
> > + * calling virtqueue_more_used() and virtqueue_get_buf().
> > + */
> > +bool virtqueue_more_used(const struct virtqueue *_vq)
> > +{
> > +	struct vring_virtqueue *vq = to_vvq(_vq);
> > +
> > +	return more_used(vq);
> > +}
> > +EXPORT_SYMBOL_GPL(virtqueue_more_used);
> 
> 
> It's worth to mention that the function is not serialized (no barriers).

Thanks, will fix.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] virtio_blk: implement blk_mq_ops->poll()
  2021-05-25  3:21   ` Jason Wang
@ 2021-05-25  8:59     ` Stefan Hajnoczi
  2021-05-27  5:48       ` Jason Wang
  0 siblings, 1 reply; 21+ messages in thread
From: Stefan Hajnoczi @ 2021-05-25  8:59 UTC (permalink / raw)
  To: Jason Wang
  Cc: virtualization, linux-block, linux-kernel, Christoph Hellwig,
	Paolo Bonzini, Jens Axboe, slp, sgarzare, Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 6256 bytes --]

On Tue, May 25, 2021 at 11:21:41AM +0800, Jason Wang wrote:
> 
> 在 2021/5/20 下午10:13, Stefan Hajnoczi 写道:
> > Request completion latency can be reduced by using polling instead of
> > irqs. Even Posted Interrupts or similar hardware support doesn't beat
> > polling. The reason is that disabling virtqueue notifications saves
> > critical-path CPU cycles on the host by skipping irq injection and in
> > the guest by skipping the irq handler. So let's add blk_mq_ops->poll()
> > support to virtio_blk.
> > 
> > The approach taken by this patch differs from the NVMe driver's
> > approach. NVMe dedicates hardware queues to polling and submits
> > REQ_HIPRI requests only on those queues. This patch does not require
> > exclusive polling queues for virtio_blk. Instead, it switches between
> > irqs and polling when one or more REQ_HIPRI requests are in flight on a
> > virtqueue.
> > 
> > This is possible because toggling virtqueue notifications is cheap even
> > while the virtqueue is running. NVMe cqs can't do this because irqs are
> > only enabled/disabled at queue creation time.
> > 
> > This toggling approach requires no configuration. There is no need to
> > dedicate queues ahead of time or to teach users and orchestration tools
> > how to set up polling queues.
> > 
> > Possible drawbacks of this approach:
> > 
> > - Hardware virtio_blk implementations may find virtqueue_disable_cb()
> >    expensive since it requires DMA.
> 
> 
> Note that it's probably not related to the behavior of the driver but the
> design of the event suppression mechanism.
> 
> Device can choose to ignore the suppression flag and keep sending
> interrupts.

Yes, it's the design of the event suppression mechanism.

If we use dedicated polling virtqueues then the hardware doesn't need to
check whether interrupts are enabled for each notification. However,
there's no mechanism to tell the device that virtqueue interrupts are
permanently disabled. This means that as of today, even dedicated
virtqueues cannot suppress interrupts without the device checking for
each notification.

> >   If such devices become popular then
> >    the virtio_blk driver could use a similar approach to NVMe when
> >    VIRTIO_F_ACCESS_PLATFORM is detected in the future.
> > 
> > - If a blk_poll() thread is descheduled it not only hurts polling
> >    performance but also delays completion of non-REQ_HIPRI requests on
> >    that virtqueue since vq notifications are disabled.
> 
> 
> Can we poll only when only high pri requests are pending?

Yes, that's what this patch does.

> If the backend is a remote one, I think the polling may cause more cpu
> cycles.

Right, but polling is only done when userspace sets the RWF_HIPRI
request flag. Most applications don't support it and for those that do
it's probably an option that the user needs to enable explicitly.

Stefan

> > diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> > index fc0fb1dcd399..f0243dcd745a 100644
> > --- a/drivers/block/virtio_blk.c
> > +++ b/drivers/block/virtio_blk.c
> > @@ -29,6 +29,16 @@ static struct workqueue_struct *virtblk_wq;
> >   struct virtio_blk_vq {
> >   	struct virtqueue *vq;
> >   	spinlock_t lock;
> > +
> > +	/* Number of non-REQ_HIPRI requests in flight. Protected by lock. */
> > +	unsigned int num_lopri;
> > +
> > +	/* Number of REQ_HIPRI requests in flight. Protected by lock. */
> > +	unsigned int num_hipri;
> > +
> > +	/* Are vq notifications enabled? Protected by lock. */
> > +	bool cb_enabled;
> 
> 
> We had event_flag_shadow, is it sufficient to introduce a new helper
> virtqueue_cb_is_enabled()?

Yes, I'll try that in the next revision.

> > +
> >   	char name[VQ_NAME_LEN];
> >   } ____cacheline_aligned_in_smp;
> > @@ -171,33 +181,67 @@ static inline void virtblk_request_done(struct request *req)
> >   	blk_mq_end_request(req, virtblk_result(vbr));
> >   }
> > -static void virtblk_done(struct virtqueue *vq)
> > +/* Returns true if one or more requests completed */
> > +static bool virtblk_complete_requests(struct virtqueue *vq)
> >   {
> >   	struct virtio_blk *vblk = vq->vdev->priv;
> >   	struct virtio_blk_vq *vbq = &vblk->vqs[vq->index];
> >   	bool req_done = false;
> > +	bool last_hipri_done = false;
> >   	struct virtblk_req *vbr;
> >   	unsigned long flags;
> >   	unsigned int len;
> >   	spin_lock_irqsave(&vbq->lock, flags);
> > +
> >   	do {
> > -		virtqueue_disable_cb(vq);
> > +		if (vbq->cb_enabled)
> > +			virtqueue_disable_cb(vq);
> >   		while ((vbr = virtqueue_get_buf(vq, &len)) != NULL) {
> >   			struct request *req = blk_mq_rq_from_pdu(vbr);
> > +			if (req->cmd_flags & REQ_HIPRI) {
> > +				if (--vbq->num_hipri == 0)
> > +					last_hipri_done = true;
> > +			} else
> > +				vbq->num_lopri--;
> > +
> >   			if (likely(!blk_should_fake_timeout(req->q)))
> >   				blk_mq_complete_request(req);
> >   			req_done = true;
> >   		}
> >   		if (unlikely(virtqueue_is_broken(vq)))
> >   			break;
> > -	} while (!virtqueue_enable_cb(vq));
> > +
> > +		/* Enable vq notifications if non-polled requests remain */
> > +		if (last_hipri_done && vbq->num_lopri > 0) {
> > +			last_hipri_done = false;
> > +			vbq->cb_enabled = true;
> > +		}
> > +	} while (vbq->cb_enabled && !virtqueue_enable_cb(vq));
> >   	/* In case queue is stopped waiting for more buffers. */
> >   	if (req_done)
> >   		blk_mq_start_stopped_hw_queues(vblk->disk->queue, true);
> >   	spin_unlock_irqrestore(&vbq->lock, flags);
> > +
> > +	return req_done;
> > +}
> > +
> > +static int virtblk_poll(struct blk_mq_hw_ctx *hctx)
> > +{
> > +	struct virtio_blk *vblk = hctx->queue->queuedata;
> > +	struct virtqueue *vq = vblk->vqs[hctx->queue_num].vq;
> > +
> > +	if (!virtqueue_more_used(vq))
> 
> 
> I'm not familiar with block polling but what happens if a buffer is made
> available after virtqueue_more_used() returns false here?

Can you explain the scenario, I'm not sure I understand? "buffer is made 
available" -> are you thinking about additional requests being submitted
by the driver or an in-flight request being marked used by the device?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] virtio_blk: implement blk_mq_ops->poll()
  2021-05-24 14:59   ` Christoph Hellwig
  2021-05-25  7:22     ` Paolo Bonzini
@ 2021-05-25 13:19     ` Stefan Hajnoczi
  1 sibling, 0 replies; 21+ messages in thread
From: Stefan Hajnoczi @ 2021-05-25 13:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: virtualization, linux-block, linux-kernel, Jason Wang,
	Paolo Bonzini, Jens Axboe, slp, sgarzare, Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 1946 bytes --]

On Mon, May 24, 2021 at 04:59:28PM +0200, Christoph Hellwig wrote:
> On Thu, May 20, 2021 at 03:13:05PM +0100, Stefan Hajnoczi wrote:
> > Possible drawbacks of this approach:
> > 
> > - Hardware virtio_blk implementations may find virtqueue_disable_cb()
> >   expensive since it requires DMA. If such devices become popular then
> >   the virtio_blk driver could use a similar approach to NVMe when
> >   VIRTIO_F_ACCESS_PLATFORM is detected in the future.
> > 
> > - If a blk_poll() thread is descheduled it not only hurts polling
> >   performance but also delays completion of non-REQ_HIPRI requests on
> >   that virtqueue since vq notifications are disabled.
> 
> Yes, I think this is a dangerous configuration.  What argument exists
> again just using dedicated poll queues?

Aside from what Paolo described (the lack of a hardware interface to
designate polling queues), the poll_queues=N parameter needs to be added
to the full guest and host software stack. Users also need to learn
about this so they can enable it in all the places. This is much more
hassle for users to configure. The dynamic polling mode approach
requires no configuration.

Paolo's suggestion is to notify the driver when irqs need to be
re-enabled if the polling thread is descheduled. I actually have a
prototype that achieves something similar here:
https://gitlab.com/stefanha/linux/-/commits/cpuidle-haltpoll-virtqueue

It's a different approach from the current patch series: the cpuidle
driver provides poll_start/stop() callbacks and virtio_blk participates
in cpuidle-haltpoll. That means all virtio-blk devices temporarily use
polling mode while the vCPU is doing cpuidle-haltpoll polling. The neat
thing about the cpuidle approach is:
1. Applications don't need to set the RWF_HIPRI flag!
2. Other drivers besides virtio_blk can participate in polling too.

Maybe we should go with cpuidle polling instead?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] virtio_blk: implement blk_mq_ops->poll()
  2021-05-25  7:38       ` Ming Lei
  2021-05-25  8:06         ` Paolo Bonzini
@ 2021-05-25 13:20         ` Stefan Hajnoczi
  1 sibling, 0 replies; 21+ messages in thread
From: Stefan Hajnoczi @ 2021-05-25 13:20 UTC (permalink / raw)
  To: Ming Lei
  Cc: Paolo Bonzini, Christoph Hellwig, virtualization, linux-block,
	linux-kernel, Jason Wang, Jens Axboe, slp, sgarzare,
	Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 1507 bytes --]

On Tue, May 25, 2021 at 03:38:42PM +0800, Ming Lei wrote:
> On Tue, May 25, 2021 at 09:22:48AM +0200, Paolo Bonzini wrote:
> > On 24/05/21 16:59, Christoph Hellwig wrote:
> > > On Thu, May 20, 2021 at 03:13:05PM +0100, Stefan Hajnoczi wrote:
> > > > Possible drawbacks of this approach:
> > > > 
> > > > - Hardware virtio_blk implementations may find virtqueue_disable_cb()
> > > >    expensive since it requires DMA. If such devices become popular then
> > > >    the virtio_blk driver could use a similar approach to NVMe when
> > > >    VIRTIO_F_ACCESS_PLATFORM is detected in the future.
> > > > 
> > > > - If a blk_poll() thread is descheduled it not only hurts polling
> > > >    performance but also delays completion of non-REQ_HIPRI requests on
> > > >    that virtqueue since vq notifications are disabled.
> > > 
> > > Yes, I think this is a dangerous configuration.  What argument exists
> > > again just using dedicated poll queues?
> > 
> > There isn't an equivalent of the admin queue in virtio-blk, which would
> > allow the guest to configure the desired number of poll queues.  The number
> > of queues is fixed.
> 
> Dedicated vqs can be used for poll only, and I understand VM needn't to know
> if the vq is polled or driven by IRQ in VM.
> 
> I tried that in v5.4, but not see obvious IOPS boost, so give up.
> 
> https://github.com/ming1/linux/commits/my_v5.4-virtio-irq-poll

Hey, that's cool. I see a lot of similarity between our patches :).

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] virtio_blk: implement blk_mq_ops->poll()
  2021-05-20 14:13 ` [PATCH 3/3] virtio_blk: implement blk_mq_ops->poll() Stefan Hajnoczi
  2021-05-24 14:59   ` Christoph Hellwig
  2021-05-25  3:21   ` Jason Wang
@ 2021-05-27  2:44   ` Ming Lei
  2021-06-03 15:12     ` Stefan Hajnoczi
  2 siblings, 1 reply; 21+ messages in thread
From: Ming Lei @ 2021-05-27  2:44 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: virtualization, linux-block, linux-kernel, Christoph Hellwig,
	Jason Wang, Paolo Bonzini, Jens Axboe, slp, sgarzare,
	Michael S. Tsirkin

On Thu, May 20, 2021 at 03:13:05PM +0100, Stefan Hajnoczi wrote:
> Request completion latency can be reduced by using polling instead of
> irqs. Even Posted Interrupts or similar hardware support doesn't beat
> polling. The reason is that disabling virtqueue notifications saves
> critical-path CPU cycles on the host by skipping irq injection and in
> the guest by skipping the irq handler. So let's add blk_mq_ops->poll()
> support to virtio_blk.
> 
> The approach taken by this patch differs from the NVMe driver's
> approach. NVMe dedicates hardware queues to polling and submits
> REQ_HIPRI requests only on those queues. This patch does not require
> exclusive polling queues for virtio_blk. Instead, it switches between
> irqs and polling when one or more REQ_HIPRI requests are in flight on a
> virtqueue.
> 
> This is possible because toggling virtqueue notifications is cheap even
> while the virtqueue is running. NVMe cqs can't do this because irqs are
> only enabled/disabled at queue creation time.
> 
> This toggling approach requires no configuration. There is no need to
> dedicate queues ahead of time or to teach users and orchestration tools
> how to set up polling queues.

This approach looks good, and very neat thanks per-vq lock.

BTW, is there any virt-exit saved by disabling vq interrupt? I understand
there isn't since virt-exit may only be involved in remote completion
via sending IPI.

> 
> Possible drawbacks of this approach:
> 
> - Hardware virtio_blk implementations may find virtqueue_disable_cb()
>   expensive since it requires DMA. If such devices become popular then

You mean the hardware need to consider order between DMA completion and
interrupt notify? But it is disabling notify, guest just calls
virtqueue_get_buf() to see if there is buffer available, if not, it will be
polled again.

>   the virtio_blk driver could use a similar approach to NVMe when
>   VIRTIO_F_ACCESS_PLATFORM is detected in the future.
> 
> - If a blk_poll() thread is descheduled it not only hurts polling
>   performance but also delays completion of non-REQ_HIPRI requests on
>   that virtqueue since vq notifications are disabled.
> 
> Performance:
> 
> - Benchmark: fio ioengine=pvsync2 numjobs=4 direct=1
> - Guest: 4 vCPUs with one virtio-blk device (4 virtqueues)

4 jobs can consume up all 4 vCPUs. Just run a quick fio test with
'ioengine=io_uring --numjobs=1' on single vq, and IOPS can be improved
by ~20%(hipri=1 vs hipri=0) with the 3 patches, and the virtio-blk is
still backed on NVMe SSD.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] virtio_blk: implement blk_mq_ops->poll()
  2021-05-25  8:59     ` Stefan Hajnoczi
@ 2021-05-27  5:48       ` Jason Wang
  2021-06-03 15:24         ` Stefan Hajnoczi
  0 siblings, 1 reply; 21+ messages in thread
From: Jason Wang @ 2021-05-27  5:48 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: virtualization, linux-block, linux-kernel, Christoph Hellwig,
	Paolo Bonzini, Jens Axboe, slp, sgarzare, Michael S. Tsirkin


在 2021/5/25 下午4:59, Stefan Hajnoczi 写道:
> On Tue, May 25, 2021 at 11:21:41AM +0800, Jason Wang wrote:
>> 在 2021/5/20 下午10:13, Stefan Hajnoczi 写道:
>>> Request completion latency can be reduced by using polling instead of
>>> irqs. Even Posted Interrupts or similar hardware support doesn't beat
>>> polling. The reason is that disabling virtqueue notifications saves
>>> critical-path CPU cycles on the host by skipping irq injection and in
>>> the guest by skipping the irq handler. So let's add blk_mq_ops->poll()
>>> support to virtio_blk.
>>>
>>> The approach taken by this patch differs from the NVMe driver's
>>> approach. NVMe dedicates hardware queues to polling and submits
>>> REQ_HIPRI requests only on those queues. This patch does not require
>>> exclusive polling queues for virtio_blk. Instead, it switches between
>>> irqs and polling when one or more REQ_HIPRI requests are in flight on a
>>> virtqueue.
>>>
>>> This is possible because toggling virtqueue notifications is cheap even
>>> while the virtqueue is running. NVMe cqs can't do this because irqs are
>>> only enabled/disabled at queue creation time.
>>>
>>> This toggling approach requires no configuration. There is no need to
>>> dedicate queues ahead of time or to teach users and orchestration tools
>>> how to set up polling queues.
>>>
>>> Possible drawbacks of this approach:
>>>
>>> - Hardware virtio_blk implementations may find virtqueue_disable_cb()
>>>     expensive since it requires DMA.
>>
>> Note that it's probably not related to the behavior of the driver but the
>> design of the event suppression mechanism.
>>
>> Device can choose to ignore the suppression flag and keep sending
>> interrupts.
> Yes, it's the design of the event suppression mechanism.
>
> If we use dedicated polling virtqueues then the hardware doesn't need to
> check whether interrupts are enabled for each notification. However,
> there's no mechanism to tell the device that virtqueue interrupts are
> permanently disabled. This means that as of today, even dedicated
> virtqueues cannot suppress interrupts without the device checking for
> each notification.


This can be detected via a transport specific way.

E.g in the case of MSI, VIRTIO_MSI_NO_VECTOR could be a hint.


>
>>>    If such devices become popular then
>>>     the virtio_blk driver could use a similar approach to NVMe when
>>>     VIRTIO_F_ACCESS_PLATFORM is detected in the future.
>>>
>>> - If a blk_poll() thread is descheduled it not only hurts polling
>>>     performance but also delays completion of non-REQ_HIPRI requests on
>>>     that virtqueue since vq notifications are disabled.
>>
>> Can we poll only when only high pri requests are pending?
> Yes, that's what this patch does.
>
>> If the backend is a remote one, I think the polling may cause more cpu
>> cycles.
> Right, but polling is only done when userspace sets the RWF_HIPRI
> request flag. Most applications don't support it and for those that do
> it's probably an option that the user needs to enable explicitly.


I see.


>
> Stefan
>
>>> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
>>> index fc0fb1dcd399..f0243dcd745a 100644
>>> --- a/drivers/block/virtio_blk.c
>>> +++ b/drivers/block/virtio_blk.c
>>> @@ -29,6 +29,16 @@ static struct workqueue_struct *virtblk_wq;
>>>    struct virtio_blk_vq {
>>>    	struct virtqueue *vq;
>>>    	spinlock_t lock;
>>> +
>>> +	/* Number of non-REQ_HIPRI requests in flight. Protected by lock. */
>>> +	unsigned int num_lopri;
>>> +
>>> +	/* Number of REQ_HIPRI requests in flight. Protected by lock. */
>>> +	unsigned int num_hipri;
>>> +
>>> +	/* Are vq notifications enabled? Protected by lock. */
>>> +	bool cb_enabled;
>>
>> We had event_flag_shadow, is it sufficient to introduce a new helper
>> virtqueue_cb_is_enabled()?
> Yes, I'll try that in the next revision.
>
>>> +
>>>    	char name[VQ_NAME_LEN];
>>>    } ____cacheline_aligned_in_smp;
>>> @@ -171,33 +181,67 @@ static inline void virtblk_request_done(struct request *req)
>>>    	blk_mq_end_request(req, virtblk_result(vbr));
>>>    }
>>> -static void virtblk_done(struct virtqueue *vq)
>>> +/* Returns true if one or more requests completed */
>>> +static bool virtblk_complete_requests(struct virtqueue *vq)
>>>    {
>>>    	struct virtio_blk *vblk = vq->vdev->priv;
>>>    	struct virtio_blk_vq *vbq = &vblk->vqs[vq->index];
>>>    	bool req_done = false;
>>> +	bool last_hipri_done = false;
>>>    	struct virtblk_req *vbr;
>>>    	unsigned long flags;
>>>    	unsigned int len;
>>>    	spin_lock_irqsave(&vbq->lock, flags);
>>> +
>>>    	do {
>>> -		virtqueue_disable_cb(vq);
>>> +		if (vbq->cb_enabled)
>>> +			virtqueue_disable_cb(vq);
>>>    		while ((vbr = virtqueue_get_buf(vq, &len)) != NULL) {
>>>    			struct request *req = blk_mq_rq_from_pdu(vbr);
>>> +			if (req->cmd_flags & REQ_HIPRI) {
>>> +				if (--vbq->num_hipri == 0)
>>> +					last_hipri_done = true;
>>> +			} else
>>> +				vbq->num_lopri--;
>>> +
>>>    			if (likely(!blk_should_fake_timeout(req->q)))
>>>    				blk_mq_complete_request(req);
>>>    			req_done = true;
>>>    		}
>>>    		if (unlikely(virtqueue_is_broken(vq)))
>>>    			break;
>>> -	} while (!virtqueue_enable_cb(vq));
>>> +
>>> +		/* Enable vq notifications if non-polled requests remain */
>>> +		if (last_hipri_done && vbq->num_lopri > 0) {
>>> +			last_hipri_done = false;
>>> +			vbq->cb_enabled = true;
>>> +		}
>>> +	} while (vbq->cb_enabled && !virtqueue_enable_cb(vq));
>>>    	/* In case queue is stopped waiting for more buffers. */
>>>    	if (req_done)
>>>    		blk_mq_start_stopped_hw_queues(vblk->disk->queue, true);
>>>    	spin_unlock_irqrestore(&vbq->lock, flags);
>>> +
>>> +	return req_done;
>>> +}
>>> +
>>> +static int virtblk_poll(struct blk_mq_hw_ctx *hctx)
>>> +{
>>> +	struct virtio_blk *vblk = hctx->queue->queuedata;
>>> +	struct virtqueue *vq = vblk->vqs[hctx->queue_num].vq;
>>> +
>>> +	if (!virtqueue_more_used(vq))
>>
>> I'm not familiar with block polling but what happens if a buffer is made
>> available after virtqueue_more_used() returns false here?
> Can you explain the scenario, I'm not sure I understand? "buffer is made
> available" -> are you thinking about additional requests being submitted
> by the driver or an in-flight request being marked used by the device?


Something like that:

1) requests are submitted
2) poll but virtqueue_more_used() return false
3) device make buffer used

In this case, will poll() be triggered again by somebody else? (I think 
interrupt is disabled here).

Thanks



>
> Stefan


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] virtio_blk: implement blk_mq_ops->poll()
  2021-05-27  2:44   ` Ming Lei
@ 2021-06-03 15:12     ` Stefan Hajnoczi
  0 siblings, 0 replies; 21+ messages in thread
From: Stefan Hajnoczi @ 2021-06-03 15:12 UTC (permalink / raw)
  To: Ming Lei
  Cc: virtualization, linux-block, linux-kernel, Christoph Hellwig,
	Jason Wang, Paolo Bonzini, Jens Axboe, slp, sgarzare,
	Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 3720 bytes --]

On Thu, May 27, 2021 at 10:44:51AM +0800, Ming Lei wrote:
> On Thu, May 20, 2021 at 03:13:05PM +0100, Stefan Hajnoczi wrote:
> > Request completion latency can be reduced by using polling instead of
> > irqs. Even Posted Interrupts or similar hardware support doesn't beat
> > polling. The reason is that disabling virtqueue notifications saves
> > critical-path CPU cycles on the host by skipping irq injection and in
> > the guest by skipping the irq handler. So let's add blk_mq_ops->poll()
> > support to virtio_blk.
> > 
> > The approach taken by this patch differs from the NVMe driver's
> > approach. NVMe dedicates hardware queues to polling and submits
> > REQ_HIPRI requests only on those queues. This patch does not require
> > exclusive polling queues for virtio_blk. Instead, it switches between
> > irqs and polling when one or more REQ_HIPRI requests are in flight on a
> > virtqueue.
> > 
> > This is possible because toggling virtqueue notifications is cheap even
> > while the virtqueue is running. NVMe cqs can't do this because irqs are
> > only enabled/disabled at queue creation time.
> > 
> > This toggling approach requires no configuration. There is no need to
> > dedicate queues ahead of time or to teach users and orchestration tools
> > how to set up polling queues.
> 
> This approach looks good, and very neat thanks per-vq lock.
> 
> BTW, is there any virt-exit saved by disabling vq interrupt? I understand
> there isn't since virt-exit may only be involved in remote completion
> via sending IPI.

This patch doesn't eliminate vmexits. QEMU already has virtqueue polling
code that disables the vq notification (the virtio-pci hardware register
write that causes a vmexit).

However, when both the guest
driver and the emulated device are polling then there are no vmexits or
interrupt injections with this patch.

> > 
> > Possible drawbacks of this approach:
> > 
> > - Hardware virtio_blk implementations may find virtqueue_disable_cb()
> >   expensive since it requires DMA. If such devices become popular then
> 
> You mean the hardware need to consider order between DMA completion and
> interrupt notify? But it is disabling notify, guest just calls
> virtqueue_get_buf() to see if there is buffer available, if not, it will be
> polled again.

Software devices have cheap access to guest RAM for looking at the
virtqueue_disable_cb() state before injecting an irq. Hardware devices
need to perform a DMA transaction to read that state. They have to do
this every time they want to raise an irq because the guest driver may
have changed the value.

I'm not sure if the DMA overhead is acceptable. This problem is not
introduced by this patch, it's a VIRTIO spec design issue.

I was trying to express that dedicated polling queues would avoid the
DMA since the device knows that irqs are never needed for this virtqueue.

> 
> >   the virtio_blk driver could use a similar approach to NVMe when
> >   VIRTIO_F_ACCESS_PLATFORM is detected in the future.
> > 
> > - If a blk_poll() thread is descheduled it not only hurts polling
> >   performance but also delays completion of non-REQ_HIPRI requests on
> >   that virtqueue since vq notifications are disabled.
> > 
> > Performance:
> > 
> > - Benchmark: fio ioengine=pvsync2 numjobs=4 direct=1
> > - Guest: 4 vCPUs with one virtio-blk device (4 virtqueues)
> 
> 4 jobs can consume up all 4 vCPUs. Just run a quick fio test with
> 'ioengine=io_uring --numjobs=1' on single vq, and IOPS can be improved
> by ~20%(hipri=1 vs hipri=0) with the 3 patches, and the virtio-blk is
> still backed on NVMe SSD.

Nice, thank you for sharing the data!

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] virtio_blk: implement blk_mq_ops->poll()
  2021-05-27  5:48       ` Jason Wang
@ 2021-06-03 15:24         ` Stefan Hajnoczi
  0 siblings, 0 replies; 21+ messages in thread
From: Stefan Hajnoczi @ 2021-06-03 15:24 UTC (permalink / raw)
  To: Jason Wang
  Cc: virtualization, linux-block, linux-kernel, Christoph Hellwig,
	Paolo Bonzini, Jens Axboe, slp, sgarzare, Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 5053 bytes --]

On Thu, May 27, 2021 at 01:48:36PM +0800, Jason Wang wrote:
> 
> 在 2021/5/25 下午4:59, Stefan Hajnoczi 写道:
> > On Tue, May 25, 2021 at 11:21:41AM +0800, Jason Wang wrote:
> > > 在 2021/5/20 下午10:13, Stefan Hajnoczi 写道:
> > > > Request completion latency can be reduced by using polling instead of
> > > > irqs. Even Posted Interrupts or similar hardware support doesn't beat
> > > > polling. The reason is that disabling virtqueue notifications saves
> > > > critical-path CPU cycles on the host by skipping irq injection and in
> > > > the guest by skipping the irq handler. So let's add blk_mq_ops->poll()
> > > > support to virtio_blk.
> > > > 
> > > > The approach taken by this patch differs from the NVMe driver's
> > > > approach. NVMe dedicates hardware queues to polling and submits
> > > > REQ_HIPRI requests only on those queues. This patch does not require
> > > > exclusive polling queues for virtio_blk. Instead, it switches between
> > > > irqs and polling when one or more REQ_HIPRI requests are in flight on a
> > > > virtqueue.
> > > > 
> > > > This is possible because toggling virtqueue notifications is cheap even
> > > > while the virtqueue is running. NVMe cqs can't do this because irqs are
> > > > only enabled/disabled at queue creation time.
> > > > 
> > > > This toggling approach requires no configuration. There is no need to
> > > > dedicate queues ahead of time or to teach users and orchestration tools
> > > > how to set up polling queues.
> > > > 
> > > > Possible drawbacks of this approach:
> > > > 
> > > > - Hardware virtio_blk implementations may find virtqueue_disable_cb()
> > > >     expensive since it requires DMA.
> > > 
> > > Note that it's probably not related to the behavior of the driver but the
> > > design of the event suppression mechanism.
> > > 
> > > Device can choose to ignore the suppression flag and keep sending
> > > interrupts.
> > Yes, it's the design of the event suppression mechanism.
> > 
> > If we use dedicated polling virtqueues then the hardware doesn't need to
> > check whether interrupts are enabled for each notification. However,
> > there's no mechanism to tell the device that virtqueue interrupts are
> > permanently disabled. This means that as of today, even dedicated
> > virtqueues cannot suppress interrupts without the device checking for
> > each notification.
> 
> 
> This can be detected via a transport specific way.
> 
> E.g in the case of MSI, VIRTIO_MSI_NO_VECTOR could be a hint.

Nice idea :). Then there would be no need for changes to the hardware
interface. IRQ-less virtqueues is could still be mentioned explicitly in
the VIRTIO spec so that driver/device authors are aware of the
VIRTIO_MSI_NO_VECTOR trick.

> > > > +static int virtblk_poll(struct blk_mq_hw_ctx *hctx)
> > > > +{
> > > > +	struct virtio_blk *vblk = hctx->queue->queuedata;
> > > > +	struct virtqueue *vq = vblk->vqs[hctx->queue_num].vq;
> > > > +
> > > > +	if (!virtqueue_more_used(vq))
> > > 
> > > I'm not familiar with block polling but what happens if a buffer is made
> > > available after virtqueue_more_used() returns false here?
> > Can you explain the scenario, I'm not sure I understand? "buffer is made
> > available" -> are you thinking about additional requests being submitted
> > by the driver or an in-flight request being marked used by the device?
> 
> 
> Something like that:
> 
> 1) requests are submitted
> 2) poll but virtqueue_more_used() return false
> 3) device make buffer used
> 
> In this case, will poll() be triggered again by somebody else? (I think
> interrupt is disabled here).

Yes. An example blk_poll() user is
fs/block_dev.c:__blkdev_direct_IO_simple():

  qc = submit_bio(&bio);
  for (;;) {
      set_current_state(TASK_UNINTERRUPTIBLE);
      if (!READ_ONCE(bio.bi_private))
          break;
      if (!(iocb->ki_flags & IOCB_HIPRI) ||
          !blk_poll(bdev_get_queue(bdev), qc, true))
          blk_io_schedule();
  }

That's the infinite loop. The block layer implements the generic portion
of blk_poll(). blk_poll() calls mq_ops->poll() (virtblk_poll()).

So in general the polling loop will keep iterating, but there are
exceptions:
1. need_resched() causes blk_poll() to return 0 and blk_io_schedule()
   will be called.
2. blk-mq has a fancier io_poll algorithm that estimates I/O time and
   sleeps until the expected completion time to save CPU cycles. I
   haven't looked into detail at this one.

Both these cases affect existing mq_ops->poll() implementations (e.g.
NVMe). What's new in this patch series is that virtio-blk could have
non-polling requests on the virtqueue which now has irqs disabled. So we
could wait for them.

I think there's an easy solution for this: don't disable virtqueue irqs
when there are non-REQ_HIPRI requests in flight. The disadvantage is
that we'll keep irqs disable in more situations so the performance
improvement may not apply in some configurations.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] virtio_blk: blk-mq io_poll support
  2021-05-20 14:13 [PATCH 0/3] virtio_blk: blk-mq io_poll support Stefan Hajnoczi
                   ` (2 preceding siblings ...)
  2021-05-20 14:13 ` [PATCH 3/3] virtio_blk: implement blk_mq_ops->poll() Stefan Hajnoczi
@ 2021-06-03 15:30 ` Stefan Hajnoczi
  2021-06-16  7:43   ` Christoph Hellwig
  3 siblings, 1 reply; 21+ messages in thread
From: Stefan Hajnoczi @ 2021-06-03 15:30 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: linux-block, linux-kernel, virtualization, Jason Wang,
	Paolo Bonzini, slp, sgarzare, Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 1808 bytes --]

On Thu, May 20, 2021 at 03:13:02PM +0100, Stefan Hajnoczi wrote:
> This patch series implements blk_mq_ops->poll() so REQ_HIPRI requests can be
> polled. IOPS for 4k and 16k block sizes increases by 5-18% on a virtio-blk
> device with 4 virtqueues backed by an NVMe drive.
> 
> - Benchmark: fio ioengine=pvsync2 numjobs=4 direct=1
> - Guest: 4 vCPUs with one virtio-blk device (4 virtqueues)
> - Disk: Intel Corporation NVMe Datacenter SSD [Optane] [8086:2701]
> - CPU: Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz
> 
> rw          bs hipri=0 hipri=1
> ------------------------------
> randread    4k 149,426 170,763 +14%
> randread   16k 118,939 134,269 +12%
> randread   64k  34,886  34,906   0%
> randread  128k  17,655  17,667   0%
> randwrite   4k 138,578 163,600 +18%
> randwrite  16k 102,089 120,950 +18%
> randwrite  64k  32,364  32,561   0%
> randwrite 128k  16,154  16,237   0%
> read        4k 146,032 170,620 +16%
> read       16k 117,097 130,437 +11%
> read       64k  34,834  35,037   0%
> read      128k  17,680  17,658   0%
> write       4k 134,562 151,422 +12%
> write      16k 101,796 107,606  +5%
> write      64k  32,364  32,594   0%
> write     128k  16,259  16,265   0%
> 
> Larger block sizes do not benefit from polling as much but the
> improvement is worthwhile for smaller block sizes.
> 
> Stefan Hajnoczi (3):
>   virtio: add virtioqueue_more_used()
>   virtio_blk: avoid repeating vblk->vqs[qid]
>   virtio_blk: implement blk_mq_ops->poll()
> 
>  include/linux/virtio.h       |   2 +
>  drivers/block/virtio_blk.c   | 126 +++++++++++++++++++++++++++++------
>  drivers/virtio/virtio_ring.c |  17 +++++
>  3 files changed, 123 insertions(+), 22 deletions(-)

Christoph and Jens: Any more thoughts on this irq toggling approach?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] virtio_blk: blk-mq io_poll support
  2021-06-03 15:30 ` [PATCH 0/3] virtio_blk: blk-mq io_poll support Stefan Hajnoczi
@ 2021-06-16  7:43   ` Christoph Hellwig
  0 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2021-06-16  7:43 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Christoph Hellwig, Jens Axboe, linux-block, linux-kernel,
	virtualization, Jason Wang, Paolo Bonzini, slp, sgarzare,
	Michael S. Tsirkin

On Thu, Jun 03, 2021 at 04:30:25PM +0100, Stefan Hajnoczi wrote:
> Christoph and Jens: Any more thoughts on this irq toggling approach?

I think it would eventually come back and byte us and would much prefer
explicit poll queues.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-06-16  7:43 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-20 14:13 [PATCH 0/3] virtio_blk: blk-mq io_poll support Stefan Hajnoczi
2021-05-20 14:13 ` [PATCH 1/3] virtio: add virtioqueue_more_used() Stefan Hajnoczi
2021-05-25  2:23   ` Jason Wang
2021-05-25  8:48     ` Stefan Hajnoczi
2021-05-20 14:13 ` [PATCH 2/3] virtio_blk: avoid repeating vblk->vqs[qid] Stefan Hajnoczi
2021-05-25  2:25   ` Jason Wang
2021-05-20 14:13 ` [PATCH 3/3] virtio_blk: implement blk_mq_ops->poll() Stefan Hajnoczi
2021-05-24 14:59   ` Christoph Hellwig
2021-05-25  7:22     ` Paolo Bonzini
2021-05-25  7:38       ` Ming Lei
2021-05-25  8:06         ` Paolo Bonzini
2021-05-25 13:20         ` Stefan Hajnoczi
2021-05-25 13:19     ` Stefan Hajnoczi
2021-05-25  3:21   ` Jason Wang
2021-05-25  8:59     ` Stefan Hajnoczi
2021-05-27  5:48       ` Jason Wang
2021-06-03 15:24         ` Stefan Hajnoczi
2021-05-27  2:44   ` Ming Lei
2021-06-03 15:12     ` Stefan Hajnoczi
2021-06-03 15:30 ` [PATCH 0/3] virtio_blk: blk-mq io_poll support Stefan Hajnoczi
2021-06-16  7:43   ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).