linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/3] nvme uring passthrough diet
@ 2023-05-01 15:33 ` Keith Busch
  2023-05-01 15:33   ` [RFC 1/3] nvme: skip block cgroups for passthrough commands Keith Busch
                     ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Keith Busch @ 2023-05-01 15:33 UTC (permalink / raw)
  To: linux-nvme, linux-block, joshi.k, axboe, hch, xiaoguang.wang; +Cc: Keith Busch

From: Keith Busch <kbusch@kernel.org>

When you disable all the optional features in your kernel config and
request queue, it looks like the normal request dispatching is just as
fast as any attempts to bypass it. So let's do that instead of
reinventing everything.

This doesn't require additional queues or user setup. It continues to
work with multiple threads and processes, and relies on the well tested
queueing mechanisms that track timeouts, handle tag exhuastion, and sync
with controller state needed for reset control, hotplug events, and
other error handling.

Marked RFC right now because I haven't tested out multipath or teardown.

Keith Busch (3):
  nvme: skip block cgroups for passthrough commands
  nvme: fix cdev name leak
  nvme: create special request queue for cdev

 drivers/nvme/host/core.c  | 39 ++++++++++++++++++++++++++++++++++-----
 drivers/nvme/host/ioctl.c |  6 +++---
 drivers/nvme/host/nvme.h  |  1 +
 include/linux/bio.h       |  7 ++++++-
 4 files changed, 44 insertions(+), 9 deletions(-)

-- 
2.34.1



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC 1/3] nvme: skip block cgroups for passthrough commands
  2023-05-01 15:33 ` [RFC 0/3] nvme uring passthrough diet Keith Busch
@ 2023-05-01 15:33   ` Keith Busch
  2023-05-03  5:04     ` Christoph Hellwig
  2023-05-01 15:33   ` [RFC 2/3] nvme: fix cdev name leak Keith Busch
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: Keith Busch @ 2023-05-01 15:33 UTC (permalink / raw)
  To: linux-nvme, linux-block, joshi.k, axboe, hch, xiaoguang.wang; +Cc: Keith Busch

From: Keith Busch <kbusch@kernel.org>

Passthrough requests don't go through the submit_bio() path, so all the
overhead of setting up the bio's cgroup is wasted cycles. Provide a path
to skip this setup.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/nvme/host/ioctl.c | 2 +-
 include/linux/bio.h       | 7 ++++++-
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index d24ea2e051564..3804e5205b42b 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -192,7 +192,7 @@ static int nvme_map_user_request(struct request *req, u64 ubuffer,
 		goto out;
 	bio = req->bio;
 	if (bdev)
-		bio_set_dev(bio, bdev);
+		__bio_set_dev(bio, bdev);
 
 	if (bdev && meta_buffer && meta_len) {
 		meta = nvme_add_user_metadata(req, meta_buffer, meta_len,
diff --git a/include/linux/bio.h b/include/linux/bio.h
index d766be7152e15..bf003412653dc 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -509,12 +509,17 @@ static inline void bio_clone_blkg_association(struct bio *dst,
 					      struct bio *src) { }
 #endif	/* CONFIG_BLK_CGROUP */
 
-static inline void bio_set_dev(struct bio *bio, struct block_device *bdev)
+static inline void __bio_set_dev(struct bio *bio, struct block_device *bdev)
 {
 	bio_clear_flag(bio, BIO_REMAPPED);
 	if (bio->bi_bdev != bdev)
 		bio_clear_flag(bio, BIO_BPS_THROTTLED);
 	bio->bi_bdev = bdev;
+}
+
+static inline void bio_set_dev(struct bio *bio, struct block_device *bdev)
+{
+	__bio_set_dev(bio, bdev);
 	bio_associate_blkg(bio);
 }
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC 2/3] nvme: fix cdev name leak
  2023-05-01 15:33 ` [RFC 0/3] nvme uring passthrough diet Keith Busch
  2023-05-01 15:33   ` [RFC 1/3] nvme: skip block cgroups for passthrough commands Keith Busch
@ 2023-05-01 15:33   ` Keith Busch
  2023-05-01 15:33   ` [RFC 3/3] nvme: create special request queue for cdev Keith Busch
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Keith Busch @ 2023-05-01 15:33 UTC (permalink / raw)
  To: linux-nvme, linux-block, joshi.k, axboe, hch, xiaoguang.wang; +Cc: Keith Busch

From: Keith Busch <kbusch@kernel.org>

If we fail to add the device, free the name allocated for it.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/nvme/host/core.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1bfd52eae2eeb..0f1cb6f418182 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4079,8 +4079,16 @@ static int nvme_add_ns_cdev(struct nvme_ns *ns)
 	if (ret)
 		return ret;
 
-	return nvme_cdev_add(&ns->cdev, &ns->cdev_device, &nvme_ns_chr_fops,
+	ret = nvme_cdev_add(&ns->cdev, &ns->cdev_device, &nvme_ns_chr_fops,
 			     ns->ctrl->ops->module);
+	if (ret)
+		goto out_free_name;
+
+	return 0;
+
+out_free_name:
+	kfree_const(ns->cdev_device.kobj.name);
+	return ret;
 }
 
 static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC 3/3] nvme: create special request queue for cdev
  2023-05-01 15:33 ` [RFC 0/3] nvme uring passthrough diet Keith Busch
  2023-05-01 15:33   ` [RFC 1/3] nvme: skip block cgroups for passthrough commands Keith Busch
  2023-05-01 15:33   ` [RFC 2/3] nvme: fix cdev name leak Keith Busch
@ 2023-05-01 15:33   ` Keith Busch
  2023-05-02 12:20     ` Johannes Thumshirn
  2023-05-03  5:04     ` Christoph Hellwig
  2023-05-01 19:01   ` [RFC 0/3] nvme uring passthrough diet Kanchan Joshi
  2023-05-03  7:27   ` Kanchan Joshi
  4 siblings, 2 replies; 15+ messages in thread
From: Keith Busch @ 2023-05-01 15:33 UTC (permalink / raw)
  To: linux-nvme, linux-block, joshi.k, axboe, hch, xiaoguang.wang; +Cc: Keith Busch

From: Keith Busch <kbusch@kernel.org>

The cdev only services passthrough commands which don't merge, track
stats, or need accounting. Give it a special request queue with all
these options cleared so that we're not adding overhead for it.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/nvme/host/core.c  | 29 +++++++++++++++++++++++++----
 drivers/nvme/host/ioctl.c |  4 ++--
 drivers/nvme/host/nvme.h  |  1 +
 3 files changed, 28 insertions(+), 6 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 0f1cb6f418182..5bb05c91d866d 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4071,21 +4071,39 @@ static const struct file_operations nvme_ns_chr_fops = {
 
 static int nvme_add_ns_cdev(struct nvme_ns *ns)
 {
+	struct nvme_ctrl *ctrl = ns->ctrl;
+	struct request_queue *q;
 	int ret;
 
-	ns->cdev_device.parent = ns->ctrl->device;
+	ns->cdev_device.parent = ctrl->device;
 	ret = dev_set_name(&ns->cdev_device, "ng%dn%d",
-			   ns->ctrl->instance, ns->head->instance);
+			   ctrl->instance, ns->head->instance);
 	if (ret)
 		return ret;
 
 	ret = nvme_cdev_add(&ns->cdev, &ns->cdev_device, &nvme_ns_chr_fops,
-			     ns->ctrl->ops->module);
+			     ctrl->ops->module);
 	if (ret)
 		goto out_free_name;
 
+	q = blk_mq_init_queue(ctrl->tagset);
+	if (IS_ERR(q)) {
+		ret = PTR_ERR(q);
+		goto out_free_cdev;
+	}
+
+	blk_queue_flag_clear(QUEUE_FLAG_IO_STAT, q);
+	blk_queue_flag_clear(QUEUE_FLAG_STATS, q);
+	blk_queue_flag_clear(QUEUE_FLAG_NOMERGES, q);
+	q->queuedata = ns;
+	ns->cdev_queue = q;
+
 	return 0;
 
+out_free_cdev:
+	cdev_device_del(&ns->cdev, &ns->cdev_device);
+	return ret;
+
 out_free_name:
 	kfree_const(ns->cdev_device.kobj.name);
 	return ret;
@@ -4399,8 +4417,11 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 	/* guarantee not available in head->list */
 	synchronize_srcu(&ns->head->srcu);
 
-	if (!nvme_ns_head_multipath(ns->head))
+	if (!nvme_ns_head_multipath(ns->head)) {
+		blk_mq_destroy_queue(ns->cdev_queue);
+		blk_put_queue(ns->cdev_queue);
 		nvme_cdev_del(&ns->cdev, &ns->cdev_device);
+	}
 	del_gendisk(ns->disk);
 
 	down_write(&ns->ctrl->namespaces_rwsem);
diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 3804e5205b42b..bf4fcb5d270e9 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -553,7 +553,7 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 {
 	struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
 	const struct nvme_uring_cmd *cmd = ioucmd->cmd;
-	struct request_queue *q = ns ? ns->queue : ctrl->admin_q;
+	struct request_queue *q = ns ? ns->cdev_queue : ctrl->admin_q;
 	struct nvme_uring_data d;
 	struct nvme_command c;
 	struct request *req;
@@ -791,7 +791,7 @@ int nvme_ns_chr_uring_cmd_iopoll(struct io_uring_cmd *ioucmd,
 	bio = READ_ONCE(ioucmd->cookie);
 	ns = container_of(file_inode(ioucmd->file)->i_cdev,
 			struct nvme_ns, cdev);
-	q = ns->queue;
+	q = ns->cdev_queue;
 	if (test_bit(QUEUE_FLAG_POLL, &q->queue_flags) && bio && bio->bi_bdev)
 		ret = bio_poll(bio, iob, poll_flags);
 	rcu_read_unlock();
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index bf46f122e9e1e..d837c118f4f18 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -495,6 +495,7 @@ struct nvme_ns {
 
 	struct cdev		cdev;
 	struct device		cdev_device;
+	struct request_queue	*cdev_queue;
 
 	struct nvme_fault_inject fault_inject;
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC 0/3] nvme uring passthrough diet
  2023-05-01 15:33 ` [RFC 0/3] nvme uring passthrough diet Keith Busch
                     ` (2 preceding siblings ...)
  2023-05-01 15:33   ` [RFC 3/3] nvme: create special request queue for cdev Keith Busch
@ 2023-05-01 19:01   ` Kanchan Joshi
  2023-05-01 19:31     ` Keith Busch
  2023-05-03  7:27   ` Kanchan Joshi
  4 siblings, 1 reply; 15+ messages in thread
From: Kanchan Joshi @ 2023-05-01 19:01 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-nvme, linux-block, axboe, hch, xiaoguang.wang, Keith Busch

[-- Attachment #1: Type: text/plain, Size: 917 bytes --]

On Mon, May 01, 2023 at 08:33:03AM -0700, Keith Busch wrote:
>From: Keith Busch <kbusch@kernel.org>
>
>When you disable all the optional features in your kernel config and
>request queue, it looks like the normal request dispatching is just as
>fast as any attempts to bypass it. 

Did a quick run, and it is not the case.
For that workload [1], this one moves to 3.4M from 3M.
While pure bypass moved to 5M from 3M. Maybe it could have gone higher
than 5M (which was the HW limit); have not checked that part yet.

[1]
# t/io_uring -b512 -d64 -c2 -s2 -p1 -F1 -B1 -O0 -n1 -u1 -r4 -k0 /dev/ng0n1
submitter=0, tid=2935, file=/dev/ng0n1, node=-1
polled=1, fixedbufs=1/0, register_files=1, buffered=1, register_queues=0 QD=64
Engine=io_uring, sq_ring=64, cq_ring=64
IOPS=3.31M, BW=1616MiB/s, IOS/call=2/1
IOPS=3.36M, BW=1642MiB/s, IOS/call=2/2
IOPS=3.40M, BW=1661MiB/s, IOS/call=2/1
Exiting on timeout
Maximum IOPS=3.40M

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 0/3] nvme uring passthrough diet
  2023-05-01 19:01   ` [RFC 0/3] nvme uring passthrough diet Kanchan Joshi
@ 2023-05-01 19:31     ` Keith Busch
  0 siblings, 0 replies; 15+ messages in thread
From: Keith Busch @ 2023-05-01 19:31 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: Keith Busch, linux-nvme, linux-block, axboe, hch, xiaoguang.wang

On Tue, May 02, 2023 at 12:31:28AM +0530, Kanchan Joshi wrote:
> On Mon, May 01, 2023 at 08:33:03AM -0700, Keith Busch wrote:
> > From: Keith Busch <kbusch@kernel.org>
> > 
> > When you disable all the optional features in your kernel config and
> > request queue, it looks like the normal request dispatching is just as
> > fast as any attempts to bypass it.
> 
> Did a quick run, and it is not the case.
> For that workload [1], this one moves to 3.4M from 3M.

There's more to trim, more work to do. I was running not quite as
heavy a kernel config as I needed, but I've got something more
appropriate to experiment with now.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 3/3] nvme: create special request queue for cdev
  2023-05-01 15:33   ` [RFC 3/3] nvme: create special request queue for cdev Keith Busch
@ 2023-05-02 12:20     ` Johannes Thumshirn
  2023-05-03  5:04     ` Christoph Hellwig
  1 sibling, 0 replies; 15+ messages in thread
From: Johannes Thumshirn @ 2023-05-02 12:20 UTC (permalink / raw)
  To: Keith Busch, linux-nvme, linux-block, joshi.k, axboe, hch,
	xiaoguang.wang
  Cc: Keith Busch

On 01.05.23 17:33, Keith Busch wrote:
> +	blk_queue_flag_clear(QUEUE_FLAG_NOMERGES, q);

Shouldn't that be blk_queue_flag_set(QUEUE_FLAG_NOMERGES, q);

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 1/3] nvme: skip block cgroups for passthrough commands
  2023-05-01 15:33   ` [RFC 1/3] nvme: skip block cgroups for passthrough commands Keith Busch
@ 2023-05-03  5:04     ` Christoph Hellwig
  2023-05-03 15:25       ` Keith Busch
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2023-05-03  5:04 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-nvme, linux-block, joshi.k, axboe, hch, xiaoguang.wang,
	Keith Busch

On Mon, May 01, 2023 at 08:33:04AM -0700, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> Passthrough requests don't go through the submit_bio() path, so all the
> overhead of setting up the bio's cgroup is wasted cycles. Provide a path
> to skip this setup.

These days we should not need to set bi_bdev at all for passthrough,
so I think we can just drop the assingment.

But instead of just optimizing for passthrough we really need to optimize
this assignment and get rid of the cost entirely.  What is so expensive
about the cgroup lookup?  Is this even for a device that uses cgroups
at all, or could we come up with a flag to bypass all the lookup unless
cgroup are anbled?


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 3/3] nvme: create special request queue for cdev
  2023-05-01 15:33   ` [RFC 3/3] nvme: create special request queue for cdev Keith Busch
  2023-05-02 12:20     ` Johannes Thumshirn
@ 2023-05-03  5:04     ` Christoph Hellwig
  2023-05-03 14:56       ` Keith Busch
  1 sibling, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2023-05-03  5:04 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-nvme, linux-block, joshi.k, axboe, hch, xiaoguang.wang,
	Keith Busch

On Mon, May 01, 2023 at 08:33:06AM -0700, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> The cdev only services passthrough commands which don't merge, track
> stats, or need accounting. Give it a special request queue with all
> these options cleared so that we're not adding overhead for it.

Why can't we always skip these for passthrough commands on any queue
with a little bit of core code?


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 0/3] nvme uring passthrough diet
  2023-05-01 15:33 ` [RFC 0/3] nvme uring passthrough diet Keith Busch
                     ` (3 preceding siblings ...)
  2023-05-01 19:01   ` [RFC 0/3] nvme uring passthrough diet Kanchan Joshi
@ 2023-05-03  7:27   ` Kanchan Joshi
  2023-05-03 15:20     ` Keith Busch
  4 siblings, 1 reply; 15+ messages in thread
From: Kanchan Joshi @ 2023-05-03  7:27 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-nvme, linux-block, axboe, hch, xiaoguang.wang, Keith Busch

[-- Attachment #1: Type: text/plain, Size: 2430 bytes --]

On Mon, May 01, 2023 at 08:33:03AM -0700, Keith Busch wrote:
>From: Keith Busch <kbusch@kernel.org>
>
>When you disable all the optional features in your kernel config and
>request queue, it looks like the normal request dispatching is just as
>fast as any attempts to bypass it. So let's do that instead of
>reinventing everything.
>
>This doesn't require additional queues or user setup. It continues to
>work with multiple threads and processes, and relies on the well tested
>queueing mechanisms that track timeouts, handle tag exhuastion, and sync
>with controller state needed for reset control, hotplug events, and
>other error handling.

I agree with your point that there are some functional holes in
the complete-bypass approach. Yet the work was needed to be done
to figure out the gain (of approach) and see whether the effort to fill
these holes is worth.

On your specific points
- requiring additional queues: not a showstopper IMO.
  If queues are lying unused with HW, we can reap more performance by
  giving those to application. If not, we fall back to the existing path.
  No disruption as such.
- tag exhaustion: that is not missing, a retry will be made. I actually
  wanted to do single command-id management at the io_uring level itself,
  and that would have cleaned things up. But it did not fit in
  because of submission/completion lifetime differences.
- timeout and other bits you mentioned: yes, those need more work.

Now with the alternate proposed in this series, I doubt whether similar
gains are possible. Happy to be wrong if that happens.
Please note that for some non-block command sets, passthrough is the only
usable interface. So these users would want some of the functionality
bits too (e.g. cgroups). Cgroups is broken for the passthrough at the
moment, and I wanted to do something about that too.

Overall, the usage model that I imagine with multiple paths is this -

1. existing block IO path: for block-friendly command-sets
2. existing passthrough IO path: for non-block command sets
3. new pure-bypass variant: for both; and this one deliberately trims all
the fat at the expense of some features/functionality.

#2 will not have all the features of #1, but good to have all that are
necessary and do not have semantic troubles to fit in. And these may
grow over time, leading to a kernel that has improved parity between block
and non-block io.
Do you think this makes sense?

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 3/3] nvme: create special request queue for cdev
  2023-05-03  5:04     ` Christoph Hellwig
@ 2023-05-03 14:56       ` Keith Busch
  0 siblings, 0 replies; 15+ messages in thread
From: Keith Busch @ 2023-05-03 14:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, linux-nvme, linux-block, joshi.k, axboe, xiaoguang.wang

On Wed, May 03, 2023 at 07:04:57AM +0200, Christoph Hellwig wrote:
> On Mon, May 01, 2023 at 08:33:06AM -0700, Keith Busch wrote:
> > From: Keith Busch <kbusch@kernel.org>
> > 
> > The cdev only services passthrough commands which don't merge, track
> > stats, or need accounting. Give it a special request queue with all
> > these options cleared so that we're not adding overhead for it.
> 
> Why can't we always skip these for passthrough commands on any queue
> with a little bit of core code?

A special queue without a gendisk was a little easier to ensure
it's not subscribed to any rq_qos features, iolatency, wbt,
blkthrottle. We might be able to add more blk_rq_is_passthrough()
for things we want to skip and get the same benefit. I'll look into
it.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 0/3] nvme uring passthrough diet
  2023-05-03  7:27   ` Kanchan Joshi
@ 2023-05-03 15:20     ` Keith Busch
  2023-05-05  8:14       ` Kanchan Joshi
  0 siblings, 1 reply; 15+ messages in thread
From: Keith Busch @ 2023-05-03 15:20 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: Keith Busch, linux-nvme, linux-block, axboe, hch, xiaoguang.wang

On Wed, May 03, 2023 at 12:57:17PM +0530, Kanchan Joshi wrote:
> On Mon, May 01, 2023 at 08:33:03AM -0700, Keith Busch wrote:
> > From: Keith Busch <kbusch@kernel.org>
> > 
> > When you disable all the optional features in your kernel config and
> > request queue, it looks like the normal request dispatching is just as
> > fast as any attempts to bypass it. So let's do that instead of
> > reinventing everything.
> > 
> > This doesn't require additional queues or user setup. It continues to
> > work with multiple threads and processes, and relies on the well tested
> > queueing mechanisms that track timeouts, handle tag exhuastion, and sync
> > with controller state needed for reset control, hotplug events, and
> > other error handling.
> 
> I agree with your point that there are some functional holes in
> the complete-bypass approach. Yet the work was needed to be done
> to figure out the gain (of approach) and see whether the effort to fill
> these holes is worth.
> 
> On your specific points
> - requiring additional queues: not a showstopper IMO.
>  If queues are lying unused with HW, we can reap more performance by
>  giving those to application. If not, we fall back to the existing path.
>  No disruption as such.

The current way we're reserving special queues is bad and should
try to not extend it futher. It applies to the whole module and
would steal resources from some devices that don't want poll queues.
If you have a mix of device types in your system, the low end ones
don't want to split their resources this way.

NVMe has no problem creating new queues on the fly. Queue allocation
doesn't have to be an initialization thing, but you would need to
reserve the QID's ahead of time.

> - tag exhaustion: that is not missing, a retry will be made. I actually
>  wanted to do single command-id management at the io_uring level itself,
>  and that would have cleaned things up. But it did not fit in
>  because of submission/completion lifetime differences.
> - timeout and other bits you mentioned: yes, those need more work.
> 
> Now with the alternate proposed in this series, I doubt whether similar
> gains are possible. Happy to be wrong if that happens.

One other thing: the pure-bypass does appear better at low queue
depths, but utilizing the plug for aggregated sq doorbell writes
is a real win at higher queue depths from this series. Batching
submissions at 4 deep is the tipping point on my test box; this
series outperforms pure bypass at any higher batch count.

Part of the problem with the low queue depth for this mode is the
xarray lookup on each bio_poll, but I think that we can fix that.

> Please note that for some non-block command sets, passthrough is the only
> usable interface. So these users would want some of the functionality
> bits too (e.g. cgroups). Cgroups is broken for the passthrough at the
> moment, and I wanted to do something about that too.
> 
> Overall, the usage model that I imagine with multiple paths is this -
> 
> 1. existing block IO path: for block-friendly command-sets
> 2. existing passthrough IO path: for non-block command sets
> 3. new pure-bypass variant: for both; and this one deliberately trims all
> the fat at the expense of some features/functionality.
> 
> #2 will not have all the features of #1, but good to have all that are
> necessary and do not have semantic troubles to fit in. And these may
> grow over time, leading to a kernel that has improved parity between block
> and non-block io.
> Do you think this makes sense?




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 1/3] nvme: skip block cgroups for passthrough commands
  2023-05-03  5:04     ` Christoph Hellwig
@ 2023-05-03 15:25       ` Keith Busch
  2023-05-15 15:47         ` Keith Busch
  0 siblings, 1 reply; 15+ messages in thread
From: Keith Busch @ 2023-05-03 15:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, linux-nvme, linux-block, joshi.k, axboe, xiaoguang.wang

On Wed, May 03, 2023 at 07:04:14AM +0200, Christoph Hellwig wrote:
> On Mon, May 01, 2023 at 08:33:04AM -0700, Keith Busch wrote:
> > From: Keith Busch <kbusch@kernel.org>
> > 
> > Passthrough requests don't go through the submit_bio() path, so all the
> > overhead of setting up the bio's cgroup is wasted cycles. Provide a path
> > to skip this setup.
> 
> These days we should not need to set bi_bdev at all for passthrough,
> so I think we can just drop the assingment.

We can't really skip it for polling since that needs a bio with a
bdev. I'll take another shot at detandling that requirement.
 
> But instead of just optimizing for passthrough we really need to optimize
> this assignment and get rid of the cost entirely.  What is so expensive
> about the cgroup lookup?  Is this even for a device that uses cgroups
> at all, or could we come up with a flag to bypass all the lookup unless
> cgroup are anbled?

It's not super expensive. It's just unnecessary for this usage and
this seemed like the easiest way to avoid it.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 0/3] nvme uring passthrough diet
  2023-05-03 15:20     ` Keith Busch
@ 2023-05-05  8:14       ` Kanchan Joshi
  0 siblings, 0 replies; 15+ messages in thread
From: Kanchan Joshi @ 2023-05-05  8:14 UTC (permalink / raw)
  To: Keith Busch
  Cc: Keith Busch, linux-nvme, linux-block, axboe, hch, xiaoguang.wang

[-- Attachment #1: Type: text/plain, Size: 3246 bytes --]

On Wed, May 03, 2023 at 09:20:04AM -0600, Keith Busch wrote:
>On Wed, May 03, 2023 at 12:57:17PM +0530, Kanchan Joshi wrote:
>> On Mon, May 01, 2023 at 08:33:03AM -0700, Keith Busch wrote:
>> > From: Keith Busch <kbusch@kernel.org>
>> >
>> > When you disable all the optional features in your kernel config and
>> > request queue, it looks like the normal request dispatching is just as
>> > fast as any attempts to bypass it. So let's do that instead of
>> > reinventing everything.
>> >
>> > This doesn't require additional queues or user setup. It continues to
>> > work with multiple threads and processes, and relies on the well tested
>> > queueing mechanisms that track timeouts, handle tag exhuastion, and sync
>> > with controller state needed for reset control, hotplug events, and
>> > other error handling.
>>
>> I agree with your point that there are some functional holes in
>> the complete-bypass approach. Yet the work was needed to be done
>> to figure out the gain (of approach) and see whether the effort to fill
>> these holes is worth.
>>
>> On your specific points
>> - requiring additional queues: not a showstopper IMO.
>>  If queues are lying unused with HW, we can reap more performance by
>>  giving those to application. If not, we fall back to the existing path.
>>  No disruption as such.
>
>The current way we're reserving special queues is bad and should
>try to not extend it futher. It applies to the whole module and
>would steal resources from some devices that don't want poll queues.
>If you have a mix of device types in your system, the low end ones
>don't want to split their resources this way.
>
>NVMe has no problem creating new queues on the fly. Queue allocation
>doesn't have to be an initialization thing, but you would need to
>reserve the QID's ahead of time.

Totally in agreement with that. Jens also mentioned this point.
And I had added preallocation in my to-be-killed list. Thanks for
expanding.
Related to that, I think one-qid-per-ring also need to be lifted.
That should allow to do io on two/more devices with the single ring
and see how well that scales.

>> - tag exhaustion: that is not missing, a retry will be made. I actually
>>  wanted to do single command-id management at the io_uring level itself,
>>  and that would have cleaned things up. But it did not fit in
>>  because of submission/completion lifetime differences.
>> - timeout and other bits you mentioned: yes, those need more work.
>>
>> Now with the alternate proposed in this series, I doubt whether similar
>> gains are possible. Happy to be wrong if that happens.
>
>One other thing: the pure-bypass does appear better at low queue
>depths, but utilizing the plug for aggregated sq doorbell writes
>is a real win at higher queue depths from this series. Batching
>submissions at 4 deep is the tipping point on my test box; this
>series outperforms pure bypass at any higher batch count.

I see. 
I hit 5M cliff without plug/batching primarily because pure-bypass
is reducing the code to do the IO. But plug/batching is needed to get
better at this.
If we create space for a pointer in io_uring_cmd, that can get added in
the plug list (in place of struct request). That will be one way to sort
out the plugging.

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 1/3] nvme: skip block cgroups for passthrough commands
  2023-05-03 15:25       ` Keith Busch
@ 2023-05-15 15:47         ` Keith Busch
  0 siblings, 0 replies; 15+ messages in thread
From: Keith Busch @ 2023-05-15 15:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, linux-nvme, linux-block, joshi.k, axboe, xiaoguang.wang

On Wed, May 03, 2023 at 09:25:21AM -0600, Keith Busch wrote:
> On Wed, May 03, 2023 at 07:04:14AM +0200, Christoph Hellwig wrote:
> > On Mon, May 01, 2023 at 08:33:04AM -0700, Keith Busch wrote:
> > > From: Keith Busch <kbusch@kernel.org>
> > > 
> > > Passthrough requests don't go through the submit_bio() path, so all the
> > > overhead of setting up the bio's cgroup is wasted cycles. Provide a path
> > > to skip this setup.
> > 
> > These days we should not need to set bi_bdev at all for passthrough,
> > so I think we can just drop the assingment.
> 
> We can't really skip it for polling since that needs a bio with a
> bdev. I'll take another shot at detandling that requirement.

I've got a new version ready to go that removes the bio and bi_dev
requirement for polling uring commands, but passthrough still needs to
set bi_bdev for metadata since it's used in bio_integrity_add_page(). I
suppose we could have that just take the queue limits directly instead
of extracting them from an assumed bi_dev.


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-05-15 15:47 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20230501154403epcas5p388c607114ad6f9d20dfd3ec958d88947@epcas5p3.samsung.com>
2023-05-01 15:33 ` [RFC 0/3] nvme uring passthrough diet Keith Busch
2023-05-01 15:33   ` [RFC 1/3] nvme: skip block cgroups for passthrough commands Keith Busch
2023-05-03  5:04     ` Christoph Hellwig
2023-05-03 15:25       ` Keith Busch
2023-05-15 15:47         ` Keith Busch
2023-05-01 15:33   ` [RFC 2/3] nvme: fix cdev name leak Keith Busch
2023-05-01 15:33   ` [RFC 3/3] nvme: create special request queue for cdev Keith Busch
2023-05-02 12:20     ` Johannes Thumshirn
2023-05-03  5:04     ` Christoph Hellwig
2023-05-03 14:56       ` Keith Busch
2023-05-01 19:01   ` [RFC 0/3] nvme uring passthrough diet Kanchan Joshi
2023-05-01 19:31     ` Keith Busch
2023-05-03  7:27   ` Kanchan Joshi
2023-05-03 15:20     ` Keith Busch
2023-05-05  8:14       ` Kanchan Joshi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).