linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH rfc 0/1] nvme-mpath: Add IO stats support
@ 2022-09-28 19:55 Sagi Grimberg
  2022-09-28 19:55 ` [PATCH rfc] nvme: support io stats on the mpath device Sagi Grimberg
  0 siblings, 1 reply; 23+ messages in thread
From: Sagi Grimberg @ 2022-09-28 19:55 UTC (permalink / raw)
  To: linux-nvme
  Cc: Christoph Hellwig, Keith Busch, Chaitanya Kulkarni, linux-block,
	Jens Axboe, Hannes Reinecke

I've been hearing complaints about the fact that the nvme mpath stack
device does not expose IO stats just like any normal block device,
instead people need to check the bottom namespaces hidden devices,
mapping back to the mpath device node.

This really sucks, especially for observability hooks/plugins that
I've seen people do.

This is an attempt to make the nvme mpath device expose normal IO stats.
Given that nvme-mpath doesn't have any context after submitting the bio,
we use the core completion path to start/end stats accounting on its
behalf, a similar practice that we use for other multipath related stuff.

Given that its not too invasive, I decided to keep it
in a single patch, but I can certainly split it in to
smaller patches if anyone thinks I should.

Feedback is welcome.

Sagi Grimberg (1):
  nvme: support io stats on the mpath device

 drivers/nvme/host/apple.c     |  2 +-
 drivers/nvme/host/core.c      | 10 ++++++++++
 drivers/nvme/host/fc.c        |  2 +-
 drivers/nvme/host/multipath.c | 18 ++++++++++++++++++
 drivers/nvme/host/nvme.h      | 13 +++++++++++++
 drivers/nvme/host/pci.c       |  2 +-
 drivers/nvme/host/rdma.c      |  2 +-
 drivers/nvme/host/tcp.c       |  2 +-
 drivers/nvme/target/loop.c    |  2 +-
 9 files changed, 47 insertions(+), 6 deletions(-)

-- 
2.34.1



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH rfc] nvme: support io stats on the mpath device
  2022-09-28 19:55 [PATCH rfc 0/1] nvme-mpath: Add IO stats support Sagi Grimberg
@ 2022-09-28 19:55 ` Sagi Grimberg
  2022-09-29  9:42   ` Max Gurtovoy
  2022-09-29 10:04   ` Sagi Grimberg
  0 siblings, 2 replies; 23+ messages in thread
From: Sagi Grimberg @ 2022-09-28 19:55 UTC (permalink / raw)
  To: linux-nvme
  Cc: Christoph Hellwig, Keith Busch, Chaitanya Kulkarni, linux-block,
	Jens Axboe, Hannes Reinecke

Our mpath stack device is just a shim that selects a bottom namespace
and submits the bio to it without any fancy splitting. This also means
that we don't clone the bio or have any context to the bio beyond
submission. However it really sucks that we don't see the mpath device
io stats.

Given that the mpath device can't do that without adding some context
to it, we let the bottom device do it on its behalf (somewhat similar
to the approach taken in nvme_trace_bio_complete);

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/apple.c     |  2 +-
 drivers/nvme/host/core.c      | 10 ++++++++++
 drivers/nvme/host/fc.c        |  2 +-
 drivers/nvme/host/multipath.c | 18 ++++++++++++++++++
 drivers/nvme/host/nvme.h      | 12 ++++++++++++
 drivers/nvme/host/pci.c       |  2 +-
 drivers/nvme/host/rdma.c      |  2 +-
 drivers/nvme/host/tcp.c       |  2 +-
 drivers/nvme/target/loop.c    |  2 +-
 9 files changed, 46 insertions(+), 6 deletions(-)

diff --git a/drivers/nvme/host/apple.c b/drivers/nvme/host/apple.c
index 5fc5ea196b40..6df4b8a5d8ab 100644
--- a/drivers/nvme/host/apple.c
+++ b/drivers/nvme/host/apple.c
@@ -763,7 +763,7 @@ static blk_status_t apple_nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 			goto out_free_cmd;
 	}
 
-	blk_mq_start_request(req);
+	nvme_start_request(req);
 	apple_nvme_submit_cmd(q, cmnd);
 	return BLK_STS_OK;
 
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 9bacfd014e3d..f42e6e40d84b 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -385,6 +385,8 @@ static inline void nvme_end_req(struct request *req)
 	nvme_end_req_zoned(req);
 	nvme_trace_bio_complete(req);
 	blk_mq_end_request(req, status);
+	if (req->cmd_flags & REQ_NVME_MPATH)
+		nvme_mpath_end_request(req);
 }
 
 void nvme_complete_rq(struct request *req)
@@ -419,6 +421,14 @@ void nvme_complete_rq(struct request *req)
 }
 EXPORT_SYMBOL_GPL(nvme_complete_rq);
 
+void nvme_start_request(struct request *rq)
+{
+	if (rq->cmd_flags & REQ_NVME_MPATH)
+		nvme_mpath_start_request(rq);
+	blk_mq_start_request(rq);
+}
+EXPORT_SYMBOL_GPL(nvme_start_request);
+
 void nvme_complete_batch_req(struct request *req)
 {
 	trace_nvme_complete_rq(req);
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 127abaf9ba5d..2cdcc7f5d0a9 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -2744,7 +2744,7 @@ nvme_fc_start_fcp_op(struct nvme_fc_ctrl *ctrl, struct nvme_fc_queue *queue,
 	atomic_set(&op->state, FCPOP_STATE_ACTIVE);
 
 	if (!(op->flags & FCOP_FLAGS_AEN))
-		blk_mq_start_request(op->rq);
+		nvme_start_request(op->rq);
 
 	cmdiu->csn = cpu_to_be32(atomic_inc_return(&queue->csn));
 	ret = ctrl->lport->ops->fcp_io(&ctrl->lport->localport,
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 6ef497c75a16..9d2ff9ed8c6c 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -114,6 +114,23 @@ void nvme_failover_req(struct request *req)
 	kblockd_schedule_work(&ns->head->requeue_work);
 }
 
+void nvme_mpath_start_request(struct request *rq)
+{
+	struct nvme_ns *ns = rq->q->queuedata;
+
+	nvme_req(rq)->start_time = bdev_start_io_acct(ns->head->disk->part0,
+					blk_rq_bytes(rq) >> SECTOR_SHIFT,
+					req_op(rq), jiffies);
+}
+
+void nvme_mpath_end_request(struct request *rq)
+{
+	struct nvme_ns *ns = rq->q->queuedata;
+
+	bdev_end_io_acct(ns->head->disk->part0,
+		req_op(rq), nvme_req(rq)->start_time);
+}
+
 void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
@@ -501,6 +518,7 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
 
 	blk_queue_flag_set(QUEUE_FLAG_NONROT, head->disk->queue);
 	blk_queue_flag_set(QUEUE_FLAG_NOWAIT, head->disk->queue);
+	blk_queue_flag_set(QUEUE_FLAG_IO_STAT, head->disk->queue);
 	/*
 	 * This assumes all controllers that refer to a namespace either
 	 * support poll queues or not.  That is not a strict guarantee,
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 2d5d44a73f26..08f94c404cd8 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -162,6 +162,9 @@ struct nvme_request {
 	u8			retries;
 	u8			flags;
 	u16			status;
+#ifdef CONFIG_NVME_MULTIPATH
+	unsigned long		start_time;
+#endif
 	struct nvme_ctrl	*ctrl;
 };
 
@@ -753,6 +756,7 @@ static inline enum req_op nvme_req_op(struct nvme_command *cmd)
 }
 
 #define NVME_QID_ANY -1
+void nvme_start_request(struct request *rq);
 void nvme_init_request(struct request *req, struct nvme_command *cmd);
 void nvme_cleanup_cmd(struct request *req);
 blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req);
@@ -861,6 +865,8 @@ bool nvme_mpath_clear_current_path(struct nvme_ns *ns);
 void nvme_mpath_revalidate_paths(struct nvme_ns *ns);
 void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl);
 void nvme_mpath_shutdown_disk(struct nvme_ns_head *head);
+void nvme_mpath_start_request(struct request *rq);
+void nvme_mpath_end_request(struct request *rq);
 
 static inline void nvme_trace_bio_complete(struct request *req)
 {
@@ -946,6 +952,12 @@ static inline void nvme_mpath_start_freeze(struct nvme_subsystem *subsys)
 static inline void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys)
 {
 }
+static inline void nvme_mpath_start_request(struct request *rq)
+{
+}
+static inline void nvme_mpath_end_request(struct request *rq)
+{
+}
 #endif /* CONFIG_NVME_MULTIPATH */
 
 int nvme_revalidate_zones(struct nvme_ns *ns);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 3bdb97205699..e898b9e4e6e0 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -929,7 +929,7 @@ static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
 			goto out_unmap_data;
 	}
 
-	blk_mq_start_request(req);
+	nvme_start_request(req);
 	return BLK_STS_OK;
 out_unmap_data:
 	nvme_unmap_data(dev, req);
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 8e52d2362fa1..ab9d5a17704b 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -2089,7 +2089,7 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
 	if (ret)
 		goto unmap_qe;
 
-	blk_mq_start_request(rq);
+	nvme_start_request(rq);
 
 	if (IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY) &&
 	    queue->pi_support &&
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 2524b5304bfb..a1df405de7f1 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2461,7 +2461,7 @@ static blk_status_t nvme_tcp_queue_rq(struct blk_mq_hw_ctx *hctx,
 	if (unlikely(ret))
 		return ret;
 
-	blk_mq_start_request(rq);
+	nvme_start_request(rq);
 
 	nvme_tcp_queue_request(req, true, bd->last);
 
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index 9750a7fca268..c327615decc2 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -145,7 +145,7 @@ static blk_status_t nvme_loop_queue_rq(struct blk_mq_hw_ctx *hctx,
 	if (ret)
 		return ret;
 
-	blk_mq_start_request(req);
+	nvme_start_request(req);
 	iod->cmd.common.flags |= NVME_CMD_SGL_METABUF;
 	iod->req.port = queue->ctrl->port;
 	if (!nvmet_req_init(&iod->req, &queue->nvme_cq,
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH rfc] nvme: support io stats on the mpath device
  2022-09-28 19:55 ` [PATCH rfc] nvme: support io stats on the mpath device Sagi Grimberg
@ 2022-09-29  9:42   ` Max Gurtovoy
  2022-09-29  9:59     ` Sagi Grimberg
  2022-09-29 10:04   ` Sagi Grimberg
  1 sibling, 1 reply; 23+ messages in thread
From: Max Gurtovoy @ 2022-09-29  9:42 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme
  Cc: Christoph Hellwig, Keith Busch, Chaitanya Kulkarni, linux-block,
	Jens Axboe, Hannes Reinecke

Hi Sagi,

On 9/28/2022 10:55 PM, Sagi Grimberg wrote:
> Our mpath stack device is just a shim that selects a bottom namespace
> and submits the bio to it without any fancy splitting. This also means
> that we don't clone the bio or have any context to the bio beyond
> submission. However it really sucks that we don't see the mpath device
> io stats.
>
> Given that the mpath device can't do that without adding some context
> to it, we let the bottom device do it on its behalf (somewhat similar
> to the approach taken in nvme_trace_bio_complete);

Can you please paste the output of the application that shows the 
benefit of this commit ?

>
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> ---
>   drivers/nvme/host/apple.c     |  2 +-
>   drivers/nvme/host/core.c      | 10 ++++++++++
>   drivers/nvme/host/fc.c        |  2 +-
>   drivers/nvme/host/multipath.c | 18 ++++++++++++++++++
>   drivers/nvme/host/nvme.h      | 12 ++++++++++++
>   drivers/nvme/host/pci.c       |  2 +-
>   drivers/nvme/host/rdma.c      |  2 +-
>   drivers/nvme/host/tcp.c       |  2 +-
>   drivers/nvme/target/loop.c    |  2 +-
>   9 files changed, 46 insertions(+), 6 deletions(-)

Several questions:

1. I guess that for the non-mpath case we get this for free from the 
block layer for each bio ?

2. Now we have doubled the accounting, haven't we ?

3. Do you have some performance numbers (we're touching the fast path 
here) ?

4. Should we enable this by default ?


The implementation look good.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH rfc] nvme: support io stats on the mpath device
  2022-09-29  9:42   ` Max Gurtovoy
@ 2022-09-29  9:59     ` Sagi Grimberg
  2022-09-29 10:25       ` Max Gurtovoy
                         ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Sagi Grimberg @ 2022-09-29  9:59 UTC (permalink / raw)
  To: Max Gurtovoy, linux-nvme
  Cc: Christoph Hellwig, Keith Busch, Chaitanya Kulkarni, linux-block,
	Jens Axboe, Hannes Reinecke


> Hi Sagi,
> 
> On 9/28/2022 10:55 PM, Sagi Grimberg wrote:
>> Our mpath stack device is just a shim that selects a bottom namespace
>> and submits the bio to it without any fancy splitting. This also means
>> that we don't clone the bio or have any context to the bio beyond
>> submission. However it really sucks that we don't see the mpath device
>> io stats.
>>
>> Given that the mpath device can't do that without adding some context
>> to it, we let the bottom device do it on its behalf (somewhat similar
>> to the approach taken in nvme_trace_bio_complete);
> 
> Can you please paste the output of the application that shows the 
> benefit of this commit ?

What do you mean? there is no noticeable effect on the application here.
With this patch applied, /sys/block/nvmeXnY/stat is not zeroed out,
sysstat and friends can monitor IO stats, as well as other observability
tools.

> 
>>
>> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
>> ---
>>   drivers/nvme/host/apple.c     |  2 +-
>>   drivers/nvme/host/core.c      | 10 ++++++++++
>>   drivers/nvme/host/fc.c        |  2 +-
>>   drivers/nvme/host/multipath.c | 18 ++++++++++++++++++
>>   drivers/nvme/host/nvme.h      | 12 ++++++++++++
>>   drivers/nvme/host/pci.c       |  2 +-
>>   drivers/nvme/host/rdma.c      |  2 +-
>>   drivers/nvme/host/tcp.c       |  2 +-
>>   drivers/nvme/target/loop.c    |  2 +-
>>   9 files changed, 46 insertions(+), 6 deletions(-)
> 
> Several questions:
> 
> 1. I guess that for the non-mpath case we get this for free from the 
> block layer for each bio ?

blk-mq provides all IO stat accounting, hence it is on by default.

> 2. Now we have doubled the accounting, haven't we ?

Yes. But as I listed in the cover-letter, I've been getting complaints
about how IO stats appear only for the hidden devices (blk-mq devices)
and there is an non-trivial logic to map that back to the mpath device,
which can also depend on the path selection logic...

I think that this is very much justified, the observability experience
sucks. IMO we should have done it since introducing nvme-multipath.

> 3. Do you have some performance numbers (we're touching the fast path 
> here) ?

This is pretty light-weight, accounting is per-cpu and only wrapped by
preemption disable. This is a very small price to pay for what we gain.

I don't have any performance numbers, other than on my laptop VM that
did not record any noticeable difference, which I don't expect to have.

> 4. Should we enable this by default ?

Yes. there is no reason why nvme-mpath should be the only block device
that does not account and expose IO stats.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH rfc] nvme: support io stats on the mpath device
  2022-09-28 19:55 ` [PATCH rfc] nvme: support io stats on the mpath device Sagi Grimberg
  2022-09-29  9:42   ` Max Gurtovoy
@ 2022-09-29 10:04   ` Sagi Grimberg
  2022-09-29 15:07     ` Jens Axboe
  1 sibling, 1 reply; 23+ messages in thread
From: Sagi Grimberg @ 2022-09-29 10:04 UTC (permalink / raw)
  To: linux-nvme
  Cc: Christoph Hellwig, Keith Busch, Chaitanya Kulkarni, linux-block,
	Jens Axboe, Hannes Reinecke



On 9/28/22 22:55, Sagi Grimberg wrote:
> Our mpath stack device is just a shim that selects a bottom namespace
> and submits the bio to it without any fancy splitting. This also means
> that we don't clone the bio or have any context to the bio beyond
> submission. However it really sucks that we don't see the mpath device
> io stats.
> 
> Given that the mpath device can't do that without adding some context
> to it, we let the bottom device do it on its behalf (somewhat similar
> to the approach taken in nvme_trace_bio_complete);
> 
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> ---
>   drivers/nvme/host/apple.c     |  2 +-
>   drivers/nvme/host/core.c      | 10 ++++++++++
>   drivers/nvme/host/fc.c        |  2 +-
>   drivers/nvme/host/multipath.c | 18 ++++++++++++++++++
>   drivers/nvme/host/nvme.h      | 12 ++++++++++++
>   drivers/nvme/host/pci.c       |  2 +-
>   drivers/nvme/host/rdma.c      |  2 +-
>   drivers/nvme/host/tcp.c       |  2 +-
>   drivers/nvme/target/loop.c    |  2 +-
>   9 files changed, 46 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/nvme/host/apple.c b/drivers/nvme/host/apple.c
> index 5fc5ea196b40..6df4b8a5d8ab 100644
> --- a/drivers/nvme/host/apple.c
> +++ b/drivers/nvme/host/apple.c
> @@ -763,7 +763,7 @@ static blk_status_t apple_nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
>   			goto out_free_cmd;
>   	}
>   
> -	blk_mq_start_request(req);
> +	nvme_start_request(req);
>   	apple_nvme_submit_cmd(q, cmnd);
>   	return BLK_STS_OK;
>   
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 9bacfd014e3d..f42e6e40d84b 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -385,6 +385,8 @@ static inline void nvme_end_req(struct request *req)
>   	nvme_end_req_zoned(req);
>   	nvme_trace_bio_complete(req);
>   	blk_mq_end_request(req, status);
> +	if (req->cmd_flags & REQ_NVME_MPATH)
> +		nvme_mpath_end_request(req);

I guess the order should probably be reversed, because after
blk_mq_end_request req may become invalid and create UAF?


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH rfc] nvme: support io stats on the mpath device
  2022-09-29  9:59     ` Sagi Grimberg
@ 2022-09-29 10:25       ` Max Gurtovoy
  2022-09-29 15:03       ` Keith Busch
  2022-09-29 15:05       ` Jens Axboe
  2 siblings, 0 replies; 23+ messages in thread
From: Max Gurtovoy @ 2022-09-29 10:25 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme
  Cc: Christoph Hellwig, Keith Busch, Chaitanya Kulkarni, linux-block,
	Jens Axboe, Hannes Reinecke


On 9/29/2022 12:59 PM, Sagi Grimberg wrote:
>
>> Hi Sagi,
>>
>> On 9/28/2022 10:55 PM, Sagi Grimberg wrote:
>>> Our mpath stack device is just a shim that selects a bottom namespace
>>> and submits the bio to it without any fancy splitting. This also means
>>> that we don't clone the bio or have any context to the bio beyond
>>> submission. However it really sucks that we don't see the mpath device
>>> io stats.
>>>
>>> Given that the mpath device can't do that without adding some context
>>> to it, we let the bottom device do it on its behalf (somewhat similar
>>> to the approach taken in nvme_trace_bio_complete);
>>
>> Can you please paste the output of the application that shows the 
>> benefit of this commit ?
>
> What do you mean? there is no noticeable effect on the application here.
> With this patch applied, /sys/block/nvmeXnY/stat is not zeroed out,
> sysstat and friends can monitor IO stats, as well as other observability
> tools.
>
I meant the output for iostat application/tool.

This will show us the double accounting I mentioned bellow.

I guess it's the same situation we have today with /dev/dm-0 and its 
underlying devices /dev/sdb and /dev/sdc for example.

This should be explained IMO.

>>
>>>
>>> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
>>> ---
>>>   drivers/nvme/host/apple.c     |  2 +-
>>>   drivers/nvme/host/core.c      | 10 ++++++++++
>>>   drivers/nvme/host/fc.c        |  2 +-
>>>   drivers/nvme/host/multipath.c | 18 ++++++++++++++++++
>>>   drivers/nvme/host/nvme.h      | 12 ++++++++++++
>>>   drivers/nvme/host/pci.c       |  2 +-
>>>   drivers/nvme/host/rdma.c      |  2 +-
>>>   drivers/nvme/host/tcp.c       |  2 +-
>>>   drivers/nvme/target/loop.c    |  2 +-
>>>   9 files changed, 46 insertions(+), 6 deletions(-)
>>
>> Several questions:
>>
>> 1. I guess that for the non-mpath case we get this for free from the 
>> block layer for each bio ?
>
> blk-mq provides all IO stat accounting, hence it is on by default.
>
>> 2. Now we have doubled the accounting, haven't we ?
>
> Yes. But as I listed in the cover-letter, I've been getting complaints
> about how IO stats appear only for the hidden devices (blk-mq devices)
> and there is an non-trivial logic to map that back to the mpath device,
> which can also depend on the path selection logic...
>
> I think that this is very much justified, the observability experience
> sucks. IMO we should have done it since introducing nvme-multipath.
>
>> 3. Do you have some performance numbers (we're touching the fast path 
>> here) ?
>
> This is pretty light-weight, accounting is per-cpu and only wrapped by
> preemption disable. This is a very small price to pay for what we gain.
>
> I don't have any performance numbers, other than on my laptop VM that
> did not record any noticeable difference, which I don't expect to have.
>
>> 4. Should we enable this by default ?
>
> Yes. there is no reason why nvme-mpath should be the only block device
> that does not account and expose IO stats.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH rfc] nvme: support io stats on the mpath device
  2022-09-29  9:59     ` Sagi Grimberg
  2022-09-29 10:25       ` Max Gurtovoy
@ 2022-09-29 15:03       ` Keith Busch
  2022-09-29 16:14         ` Sagi Grimberg
  2022-09-29 16:32         ` Sagi Grimberg
  2022-09-29 15:05       ` Jens Axboe
  2 siblings, 2 replies; 23+ messages in thread
From: Keith Busch @ 2022-09-29 15:03 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Max Gurtovoy, linux-nvme, Christoph Hellwig, Chaitanya Kulkarni,
	linux-block, Jens Axboe, Hannes Reinecke

On Thu, Sep 29, 2022 at 12:59:46PM +0300, Sagi Grimberg wrote:
> > 3. Do you have some performance numbers (we're touching the fast path
> > here) ?
> 
> This is pretty light-weight, accounting is per-cpu and only wrapped by
> preemption disable. This is a very small price to pay for what we gain.

It does add up, though, and some environments disable stats to skip the
overhead. At a minimum, you need to add a check for blk_queue_io_stat() before
assuming you need to account for stats.

Instead of duplicating the accounting, could you just have the stats file report
the sum of its hidden devices?


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH rfc] nvme: support io stats on the mpath device
  2022-09-29  9:59     ` Sagi Grimberg
  2022-09-29 10:25       ` Max Gurtovoy
  2022-09-29 15:03       ` Keith Busch
@ 2022-09-29 15:05       ` Jens Axboe
  2022-09-29 16:25         ` Sagi Grimberg
  2 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2022-09-29 15:05 UTC (permalink / raw)
  To: Sagi Grimberg, Max Gurtovoy, linux-nvme
  Cc: Christoph Hellwig, Keith Busch, Chaitanya Kulkarni, linux-block,
	Hannes Reinecke

On 9/29/22 3:59 AM, Sagi Grimberg wrote:
>> 3. Do you have some performance numbers (we're touching the fast path here) ?
> 
> This is pretty light-weight, accounting is per-cpu and only wrapped by
> preemption disable. This is a very small price to pay for what we gain.

Is it? Enabling IO stats for normal devices has a very noticeable impact
on performance at the higher end of the scale. So much so that I've
contemplated how we can make this less expensive than it currently is.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH rfc] nvme: support io stats on the mpath device
  2022-09-29 10:04   ` Sagi Grimberg
@ 2022-09-29 15:07     ` Jens Axboe
  2022-10-03  8:38       ` Sagi Grimberg
  0 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2022-09-29 15:07 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme
  Cc: Christoph Hellwig, Keith Busch, Chaitanya Kulkarni, linux-block,
	Hannes Reinecke

On 9/29/22 4:04 AM, Sagi Grimberg wrote:
> index 9bacfd014e3d..f42e6e40d84b 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -385,6 +385,8 @@ static inline void nvme_end_req(struct request *req)
>> ????? nvme_end_req_zoned(req);
>> ????? nvme_trace_bio_complete(req);
>> ????? blk_mq_end_request(req, status);
>> +??? if (req->cmd_flags & REQ_NVME_MPATH)
>> +??????? nvme_mpath_end_request(req);
> 
> I guess the order should probably be reversed, because after
> blk_mq_end_request req may become invalid and create UAF?

Yes - blk_mq_end_request() will put the tag, it could be reused by the
time you call nvme_mpath_end_request(). It won't be a UAF as the
requests are allocated upfront and never freed, but the state will be
uncertain at that point.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH rfc] nvme: support io stats on the mpath device
  2022-09-29 15:03       ` Keith Busch
@ 2022-09-29 16:14         ` Sagi Grimberg
  2022-09-30 15:21           ` Keith Busch
  2022-09-29 16:32         ` Sagi Grimberg
  1 sibling, 1 reply; 23+ messages in thread
From: Sagi Grimberg @ 2022-09-29 16:14 UTC (permalink / raw)
  To: Keith Busch
  Cc: Max Gurtovoy, linux-nvme, Christoph Hellwig, Chaitanya Kulkarni,
	linux-block, Jens Axboe, Hannes Reinecke


>>> 3. Do you have some performance numbers (we're touching the fast path
>>> here) ?
>>
>> This is pretty light-weight, accounting is per-cpu and only wrapped by
>> preemption disable. This is a very small price to pay for what we gain.
> 
> It does add up, though, and some environments disable stats to skip the
> overhead. At a minimum, you need to add a check for blk_queue_io_stat() before
> assuming you need to account for stats.
> 
> Instead of duplicating the accounting, could you just have the stats file report
> the sum of its hidden devices?

Interesting...

How do you suggest we do that? .collect_stats() callout in fops?


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH rfc] nvme: support io stats on the mpath device
  2022-09-29 15:05       ` Jens Axboe
@ 2022-09-29 16:25         ` Sagi Grimberg
  2022-09-30  0:08           ` Jens Axboe
  0 siblings, 1 reply; 23+ messages in thread
From: Sagi Grimberg @ 2022-09-29 16:25 UTC (permalink / raw)
  To: Jens Axboe, Max Gurtovoy, linux-nvme
  Cc: Christoph Hellwig, Keith Busch, Chaitanya Kulkarni, linux-block,
	Hannes Reinecke


>>> 3. Do you have some performance numbers (we're touching the fast path here) ?
>>
>> This is pretty light-weight, accounting is per-cpu and only wrapped by
>> preemption disable. This is a very small price to pay for what we gain.
> 
> Is it? Enabling IO stats for normal devices has a very noticeable impact
> on performance at the higher end of the scale.

Interesting, I didn't think this would be that noticeable. How much
would you quantify the impact in terms of %?

I don't have any insight on this for blk-mq, probably because I've never
seen any user turn IO stats off (or at least don't remember).

My (very limited) testing did not show any noticeable differences for
nvme-loop. All I'm saying that we need to have IO stats for the mpath
device node. If there is a clever way to collect this from the hidden
devices just for nvme, great, but we need to expose these stats.

> So much so that I've contemplated how we can make this less expensive than it currently is.

Then nvme-mpath would benefit from that as well.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH rfc] nvme: support io stats on the mpath device
  2022-09-29 15:03       ` Keith Busch
  2022-09-29 16:14         ` Sagi Grimberg
@ 2022-09-29 16:32         ` Sagi Grimberg
  2022-09-30 15:16           ` Keith Busch
  1 sibling, 1 reply; 23+ messages in thread
From: Sagi Grimberg @ 2022-09-29 16:32 UTC (permalink / raw)
  To: Keith Busch
  Cc: Max Gurtovoy, linux-nvme, Christoph Hellwig, Chaitanya Kulkarni,
	linux-block, Jens Axboe, Hannes Reinecke


>>> 3. Do you have some performance numbers (we're touching the fast path
>>> here) ?
>>
>> This is pretty light-weight, accounting is per-cpu and only wrapped by
>> preemption disable. This is a very small price to pay for what we gain.
> 
> It does add up, though, and some environments disable stats to skip the
> overhead. At a minimum, you need to add a check for blk_queue_io_stat() before
> assuming you need to account for stats.

But QUEUE_FLAG_IO_STAT is set by nvme-mpath itself?
You mean disable IO stats in runtime?


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH rfc] nvme: support io stats on the mpath device
  2022-09-29 16:25         ` Sagi Grimberg
@ 2022-09-30  0:08           ` Jens Axboe
  2022-10-03  8:35             ` Sagi Grimberg
  0 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2022-09-30  0:08 UTC (permalink / raw)
  To: Sagi Grimberg, Max Gurtovoy, linux-nvme
  Cc: Christoph Hellwig, Keith Busch, Chaitanya Kulkarni, linux-block,
	Hannes Reinecke

On 9/29/22 10:25 AM, Sagi Grimberg wrote:
> 
>>>> 3. Do you have some performance numbers (we're touching the fast path here) ?
>>>
>>> This is pretty light-weight, accounting is per-cpu and only wrapped by
>>> preemption disable. This is a very small price to pay for what we gain.
>>
>> Is it? Enabling IO stats for normal devices has a very noticeable impact
>> on performance at the higher end of the scale.
> 
> Interesting, I didn't think this would be that noticeable. How much
> would you quantify the impact in terms of %?

If we take it to the extreme - my usual peak benchmark, which is drive
limited at 122M IOPS, run at 113M IOPS if I have iostats enabled. If I
lower the queue depth (128 -> 16), then peak goes from 46M to 44M. Not
as dramatic, but still quite noticeable. This is just using a single
thread on a single CPU core per drive, so not throwing tons of CPU at
it.

Now, I have no idea how well nvme multipath currently scales or works.
Would be interesting to test that separately. But if you were to double
(or more, I guess 3x if you're doing the exposed device and then adding
stats to at least two below?) the overhead, that'd certainly not be
free.

> I don't have any insight on this for blk-mq, probably because I've never
> seen any user turn IO stats off (or at least don't remember).

Most people don't care, but some certainly do. As per the above, it's
noticeable enough that it makes a difference if you're chasing latencies
or peak performance.

> My (very limited) testing did not show any noticeable differences for
> nvme-loop. All I'm saying that we need to have IO stats for the mpath
> device node. If there is a clever way to collect this from the hidden
> devices just for nvme, great, but we need to expose these stats.

 From a previous message, sounds like that's just some qemu setup? Hard
to measure anything there with precision in my experience, and it's not
really peak performance territory either.

>> So much so that I've contemplated how we can make this less expensive
>> than it currently is.
> 
> Then nvme-mpath would benefit from that as well.

Yeah, it'd be a win all around for sure...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH rfc] nvme: support io stats on the mpath device
  2022-09-29 16:32         ` Sagi Grimberg
@ 2022-09-30 15:16           ` Keith Busch
  2022-10-03  8:02             ` Sagi Grimberg
  0 siblings, 1 reply; 23+ messages in thread
From: Keith Busch @ 2022-09-30 15:16 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Max Gurtovoy, linux-nvme, Christoph Hellwig, Chaitanya Kulkarni,
	linux-block, Jens Axboe, Hannes Reinecke

On Thu, Sep 29, 2022 at 07:32:39PM +0300, Sagi Grimberg wrote:
> 
> > > > 3. Do you have some performance numbers (we're touching the fast path
> > > > here) ?
> > > 
> > > This is pretty light-weight, accounting is per-cpu and only wrapped by
> > > preemption disable. This is a very small price to pay for what we gain.
> > 
> > It does add up, though, and some environments disable stats to skip the
> > overhead. At a minimum, you need to add a check for blk_queue_io_stat() before
> > assuming you need to account for stats.
> 
> But QUEUE_FLAG_IO_STAT is set by nvme-mpath itself?
> You mean disable IO stats in runtime?

Yes, the user can disable it at any time. That actually makes things a bit
tricky since it can be enabled at the start of an IO and disabled by the time
it completes.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH rfc] nvme: support io stats on the mpath device
  2022-09-29 16:14         ` Sagi Grimberg
@ 2022-09-30 15:21           ` Keith Busch
  2022-10-03  8:09             ` Sagi Grimberg
  0 siblings, 1 reply; 23+ messages in thread
From: Keith Busch @ 2022-09-30 15:21 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Max Gurtovoy, linux-nvme, Christoph Hellwig, Chaitanya Kulkarni,
	linux-block, Jens Axboe, Hannes Reinecke

On Thu, Sep 29, 2022 at 07:14:23PM +0300, Sagi Grimberg wrote:
> 
> > > > 3. Do you have some performance numbers (we're touching the fast path
> > > > here) ?
> > > 
> > > This is pretty light-weight, accounting is per-cpu and only wrapped by
> > > preemption disable. This is a very small price to pay for what we gain.
> > 
> > It does add up, though, and some environments disable stats to skip the
> > overhead. At a minimum, you need to add a check for blk_queue_io_stat() before
> > assuming you need to account for stats.
> > 
> > Instead of duplicating the accounting, could you just have the stats file report
> > the sum of its hidden devices?
> 
> Interesting...
> 
> How do you suggest we do that? .collect_stats() callout in fops?

Maybe, yeah. I think we'd need something to enumerate the HIDDEN disks that
make up the multipath device. Only the low-level driver can do that right now,
so perhaps either call into the driver to get all the block_device parts, or
the gendisk needs to maintain a list of those parts itself.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH rfc] nvme: support io stats on the mpath device
  2022-09-30 15:16           ` Keith Busch
@ 2022-10-03  8:02             ` Sagi Grimberg
  2022-10-03  9:32               ` Sagi Grimberg
  0 siblings, 1 reply; 23+ messages in thread
From: Sagi Grimberg @ 2022-10-03  8:02 UTC (permalink / raw)
  To: Keith Busch
  Cc: Max Gurtovoy, linux-nvme, Christoph Hellwig, Chaitanya Kulkarni,
	linux-block, Jens Axboe, Hannes Reinecke


>>>>> 3. Do you have some performance numbers (we're touching the fast path
>>>>> here) ?
>>>>
>>>> This is pretty light-weight, accounting is per-cpu and only wrapped by
>>>> preemption disable. This is a very small price to pay for what we gain.
>>>
>>> It does add up, though, and some environments disable stats to skip the
>>> overhead. At a minimum, you need to add a check for blk_queue_io_stat() before
>>> assuming you need to account for stats.
>>
>> But QUEUE_FLAG_IO_STAT is set by nvme-mpath itself?
>> You mean disable IO stats in runtime?
> 
> Yes, the user can disable it at any time. That actually makes things a bit
> tricky since it can be enabled at the start of an IO and disabled by the time
> it completes.

That is what blk_do_io_stat is for, we can definitely export that.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH rfc] nvme: support io stats on the mpath device
  2022-09-30 15:21           ` Keith Busch
@ 2022-10-03  8:09             ` Sagi Grimberg
  2022-10-25 15:30               ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Sagi Grimberg @ 2022-10-03  8:09 UTC (permalink / raw)
  To: Keith Busch
  Cc: Max Gurtovoy, linux-nvme, Christoph Hellwig, Chaitanya Kulkarni,
	linux-block, Jens Axboe, Hannes Reinecke


>>>>> 3. Do you have some performance numbers (we're touching the fast path
>>>>> here) ?
>>>>
>>>> This is pretty light-weight, accounting is per-cpu and only wrapped by
>>>> preemption disable. This is a very small price to pay for what we gain.
>>>
>>> It does add up, though, and some environments disable stats to skip the
>>> overhead. At a minimum, you need to add a check for blk_queue_io_stat() before
>>> assuming you need to account for stats.
>>>
>>> Instead of duplicating the accounting, could you just have the stats file report
>>> the sum of its hidden devices?
>>
>> Interesting...
>>
>> How do you suggest we do that? .collect_stats() callout in fops?
> 
> Maybe, yeah. I think we'd need something to enumerate the HIDDEN disks that
> make up the multipath device. Only the low-level driver can do that right now,
> so perhaps either call into the driver to get all the block_device parts, or
> the gendisk needs to maintain a list of those parts itself.

I definitely don't think we want to propagate the device relationship to
blk-mq. But a callback to the driver also seems very niche to nvme 
multipath and is also kinda messy to combine calculations like
iops/bw/latency accurately which depends on the submission distribution
to the bottom devices which we would need to track now.

I'm leaning towards just moving forward with this, take the relatively
small hit, and if people absolutely care about the extra latency, then
they can disable it altogether (upper and/or bottom devices).


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH rfc] nvme: support io stats on the mpath device
  2022-09-30  0:08           ` Jens Axboe
@ 2022-10-03  8:35             ` Sagi Grimberg
  0 siblings, 0 replies; 23+ messages in thread
From: Sagi Grimberg @ 2022-10-03  8:35 UTC (permalink / raw)
  To: Jens Axboe, Max Gurtovoy, linux-nvme
  Cc: Christoph Hellwig, Keith Busch, Chaitanya Kulkarni, linux-block,
	Hannes Reinecke



On 9/30/22 03:08, Jens Axboe wrote:
> On 9/29/22 10:25 AM, Sagi Grimberg wrote:
>>
>>>>> 3. Do you have some performance numbers (we're touching the fast path here) ?
>>>>
>>>> This is pretty light-weight, accounting is per-cpu and only wrapped by
>>>> preemption disable. This is a very small price to pay for what we gain.
>>>
>>> Is it? Enabling IO stats for normal devices has a very noticeable impact
>>> on performance at the higher end of the scale.
>>
>> Interesting, I didn't think this would be that noticeable. How much
>> would you quantify the impact in terms of %?
> 
> If we take it to the extreme - my usual peak benchmark, which is drive
> limited at 122M IOPS, run at 113M IOPS if I have iostats enabled. If I
> lower the queue depth (128 -> 16), then peak goes from 46M to 44M. Not
> as dramatic, but still quite noticeable. This is just using a single
> thread on a single CPU core per drive, so not throwing tons of CPU at
> it.
> 
> Now, I have no idea how well nvme multipath currently scales or works.

Should be pretty scalable and efficient. There is no bio cloning and
the only shared state is an srcu wrapping the submission path and path
lookup.

> Would be interesting to test that separately. But if you were to double
> (or more, I guess 3x if you're doing the exposed device and then adding
> stats to at least two below?) the overhead, that'd certainly not be
> free.

It is not 3x, in the patch nvme-multipath is accounting separately from
the bottom devices, so each request is accounted once for the bottom
device and once for the upper device.

But again, my working assumption is that IO stats must be exposed for
a nvme-multipath device (unless the user disabled them). So it is a
matter of weather we take a simple approach, where nvme-multipath does
"double" accounting or, we come up with a scheme that allows the driver
to collect stats on behalf of the block layer, and then add non-trivial
logic to combine stats like iops/bw/latency accurately from the bottom
devices.

My vote would be to go with the former.

>> I don't have any insight on this for blk-mq, probably because I've never
>> seen any user turn IO stats off (or at least don't remember).
> 
> Most people don't care, but some certainly do. As per the above, it's
> noticeable enough that it makes a difference if you're chasing latencies
> or peak performance.
> 
>> My (very limited) testing did not show any noticeable differences for
>> nvme-loop. All I'm saying that we need to have IO stats for the mpath
>> device node. If there is a clever way to collect this from the hidden
>> devices just for nvme, great, but we need to expose these stats.
> 
>   From a previous message, sounds like that's just some qemu setup? Hard
> to measure anything there with precision in my experience, and it's not
> really peak performance territory either.

It's not qemu, it is null_blk exported over nvme-loop (nvmet loop
device). so it is faster, but definitely not something that can provide
insight in the realm of real HW.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH rfc] nvme: support io stats on the mpath device
  2022-09-29 15:07     ` Jens Axboe
@ 2022-10-03  8:38       ` Sagi Grimberg
  0 siblings, 0 replies; 23+ messages in thread
From: Sagi Grimberg @ 2022-10-03  8:38 UTC (permalink / raw)
  To: Jens Axboe, linux-nvme
  Cc: Christoph Hellwig, Keith Busch, Chaitanya Kulkarni, linux-block,
	Hannes Reinecke


>> index 9bacfd014e3d..f42e6e40d84b 100644
>>> --- a/drivers/nvme/host/core.c
>>> +++ b/drivers/nvme/host/core.c
>>> @@ -385,6 +385,8 @@ static inline void nvme_end_req(struct request *req)
>>> ????? nvme_end_req_zoned(req);
>>> ????? nvme_trace_bio_complete(req);
>>> ????? blk_mq_end_request(req, status);
>>> +??? if (req->cmd_flags & REQ_NVME_MPATH)
>>> +??????? nvme_mpath_end_request(req);
>>
>> I guess the order should probably be reversed, because after
>> blk_mq_end_request req may become invalid and create UAF?
> 
> Yes - blk_mq_end_request() will put the tag, it could be reused by the
> time you call nvme_mpath_end_request(). It won't be a UAF as the
> requests are allocated upfront and never freed, but the state will be
> uncertain at that point.

Will reverse that...


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH rfc] nvme: support io stats on the mpath device
  2022-10-03  8:02             ` Sagi Grimberg
@ 2022-10-03  9:32               ` Sagi Grimberg
  0 siblings, 0 replies; 23+ messages in thread
From: Sagi Grimberg @ 2022-10-03  9:32 UTC (permalink / raw)
  To: Keith Busch
  Cc: Max Gurtovoy, linux-nvme, Christoph Hellwig, Chaitanya Kulkarni,
	linux-block, Jens Axboe, Hannes Reinecke



On 10/3/22 11:02, Sagi Grimberg wrote:
> 
>>>>>> 3. Do you have some performance numbers (we're touching the fast path
>>>>>> here) ?
>>>>>
>>>>> This is pretty light-weight, accounting is per-cpu and only wrapped by
>>>>> preemption disable. This is a very small price to pay for what we 
>>>>> gain.
>>>>
>>>> It does add up, though, and some environments disable stats to skip the
>>>> overhead. At a minimum, you need to add a check for 
>>>> blk_queue_io_stat() before
>>>> assuming you need to account for stats.
>>>
>>> But QUEUE_FLAG_IO_STAT is set by nvme-mpath itself?
>>> You mean disable IO stats in runtime?
>>
>> Yes, the user can disable it at any time. That actually makes things a 
>> bit
>> tricky since it can be enabled at the start of an IO and disabled by 
>> the time
>> it completes.
> 
> That is what blk_do_io_stat is for, we can definitely export that.

Actually, blk_do_io_stat refers to the bottom request queue. you're
right, we can do the same trick for nvme.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH rfc] nvme: support io stats on the mpath device
  2022-10-03  8:09             ` Sagi Grimberg
@ 2022-10-25 15:30               ` Christoph Hellwig
  2022-10-25 15:58                 ` Sagi Grimberg
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2022-10-25 15:30 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Keith Busch, Max Gurtovoy, linux-nvme, Christoph Hellwig,
	Chaitanya Kulkarni, linux-block, Jens Axboe, Hannes Reinecke

On Mon, Oct 03, 2022 at 11:09:06AM +0300, Sagi Grimberg wrote:
>> make up the multipath device. Only the low-level driver can do that right now,
>> so perhaps either call into the driver to get all the block_device parts, or
>> the gendisk needs to maintain a list of those parts itself.
>
> I definitely don't think we want to propagate the device relationship to
> blk-mq. But a callback to the driver also seems very niche to nvme 
> multipath and is also kinda messy to combine calculations like
> iops/bw/latency accurately which depends on the submission distribution
> to the bottom devices which we would need to track now.
>
> I'm leaning towards just moving forward with this, take the relatively
> small hit, and if people absolutely care about the extra latency, then
> they can disable it altogether (upper and/or bottom devices).

So looking at the patches I'm really not a big fan of the extra
accounting calls, and especially the start_time field in the
nvme_request and even more so the special start/end calls in all
the transport drivers.

the stats sysfs attributes already have the entirely separate
blk-mq vs bio based code pathes.  So I think having a block_device
operation that replaces part_stat_read_all which allows nvme to
iterate over all pathes and collect the numbers would seem
a lot nicer.  There might be some caveats like having to stash
away the numbers for disappearing paths, though.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH rfc] nvme: support io stats on the mpath device
  2022-10-25 15:30               ` Christoph Hellwig
@ 2022-10-25 15:58                 ` Sagi Grimberg
  2022-10-30 16:22                   ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Sagi Grimberg @ 2022-10-25 15:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Max Gurtovoy, linux-nvme, Chaitanya Kulkarni,
	linux-block, Jens Axboe, Hannes Reinecke


>>> make up the multipath device. Only the low-level driver can do that right now,
>>> so perhaps either call into the driver to get all the block_device parts, or
>>> the gendisk needs to maintain a list of those parts itself.
>>
>> I definitely don't think we want to propagate the device relationship to
>> blk-mq. But a callback to the driver also seems very niche to nvme
>> multipath and is also kinda messy to combine calculations like
>> iops/bw/latency accurately which depends on the submission distribution
>> to the bottom devices which we would need to track now.
>>
>> I'm leaning towards just moving forward with this, take the relatively
>> small hit, and if people absolutely care about the extra latency, then
>> they can disable it altogether (upper and/or bottom devices).
> 
> So looking at the patches I'm really not a big fan of the extra
> accounting calls, and especially the start_time field in the
> nvme_request

Don't love it either.

> and even more so the special start/end calls in all
> the transport drivers.

The end is centralized and the start part has not sprinkled to
the drivers. I don't think its bad.

> the stats sysfs attributes already have the entirely separate
> blk-mq vs bio based code pathes.  So I think having a block_device
> operation that replaces part_stat_read_all which allows nvme to
> iterate over all pathes and collect the numbers would seem
> a lot nicer.  There might be some caveats like having to stash
> away the numbers for disappearing paths, though.

You think this is better? really? I don't agree with you, I think its
better to pay a small cost than doing this very specialized thing that
will only ever be used for nvme-mpath.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH rfc] nvme: support io stats on the mpath device
  2022-10-25 15:58                 ` Sagi Grimberg
@ 2022-10-30 16:22                   ` Christoph Hellwig
  0 siblings, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2022-10-30 16:22 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Keith Busch, Max Gurtovoy, linux-nvme,
	Chaitanya Kulkarni, linux-block, Jens Axboe, Hannes Reinecke

On Tue, Oct 25, 2022 at 06:58:19PM +0300, Sagi Grimberg wrote:
>> and even more so the special start/end calls in all
>> the transport drivers.
>
> The end is centralized and the start part has not sprinkled to
> the drivers. I don't think its bad.

Well.  We need a new magic helper instead of blk_mq_start_request,
and a new call to nvme_mpath_end_request in the lower driver to
support functionality in the multipath driver that sits above them.
This is because of the hack of storing the start_time in the
nvme_request, which is realy owned by the lower driver, and quite
a bit of a layering violation.

If the multipath driver simply did the start and end itself things
would be a lot better.  The upside of that would be that it also
accounts for the (tiny) overhead of the mpath driver.  The big
downside would be that we'd have to allocate memory just for the
start_time as nvme-multipath has no per-I/O data structure of it's
own.  In a way it would be nice to just have a start_time in
the bio, which would clean up the interface a lot, and
also massively simplify the I/O accounting in md.  But Jens might
not be willing to grow the bio for this special case, even if some
things in the bio seem even more obscure.

>> the stats sysfs attributes already have the entirely separate
>> blk-mq vs bio based code pathes.  So I think having a block_device
>> operation that replaces part_stat_read_all which allows nvme to
>> iterate over all pathes and collect the numbers would seem
>> a lot nicer.  There might be some caveats like having to stash
>> away the numbers for disappearing paths, though.
>
> You think this is better? really? I don't agree with you, I think its
> better to pay a small cost than doing this very specialized thing that
> will only ever be used for nvme-mpath.

Yes, I think a callout at least conceptually would be much better.


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2022-10-30 16:23 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-28 19:55 [PATCH rfc 0/1] nvme-mpath: Add IO stats support Sagi Grimberg
2022-09-28 19:55 ` [PATCH rfc] nvme: support io stats on the mpath device Sagi Grimberg
2022-09-29  9:42   ` Max Gurtovoy
2022-09-29  9:59     ` Sagi Grimberg
2022-09-29 10:25       ` Max Gurtovoy
2022-09-29 15:03       ` Keith Busch
2022-09-29 16:14         ` Sagi Grimberg
2022-09-30 15:21           ` Keith Busch
2022-10-03  8:09             ` Sagi Grimberg
2022-10-25 15:30               ` Christoph Hellwig
2022-10-25 15:58                 ` Sagi Grimberg
2022-10-30 16:22                   ` Christoph Hellwig
2022-09-29 16:32         ` Sagi Grimberg
2022-09-30 15:16           ` Keith Busch
2022-10-03  8:02             ` Sagi Grimberg
2022-10-03  9:32               ` Sagi Grimberg
2022-09-29 15:05       ` Jens Axboe
2022-09-29 16:25         ` Sagi Grimberg
2022-09-30  0:08           ` Jens Axboe
2022-10-03  8:35             ` Sagi Grimberg
2022-09-29 10:04   ` Sagi Grimberg
2022-09-29 15:07     ` Jens Axboe
2022-10-03  8:38       ` Sagi Grimberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).