All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] nvme-multipath: add an 'ana_groups_only' module option
@ 2022-02-07 10:00 Hannes Reinecke
  2022-02-07 12:37 ` Sagi Grimberg
  2022-02-09  8:07 ` Christoph Hellwig
  0 siblings, 2 replies; 9+ messages in thread
From: Hannes Reinecke @ 2022-02-07 10:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke

On large installations the ANA log buffer can be exceedingly large;
we've come across a controller with 49 ANA Group Descriptors and
65536 namespaces, resulting in an ANA buffer with an order-7 allocation.
And this is just to validate that the namespace ID is _really_listed
in the log page.
So to avoid an overly large memory allocation we can leverage the
'RGO' bit when retrieving the ANA log page, and check whether the
ANA group ID from the namespace is found in the ANA descriptors.
That cuts down the memory allocation, and provides the same result.
But to be on the safe side I've added a module option 'ana_groups_only'
to switch between modes.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/nvme/host/multipath.c | 30 ++++++++++++++++++++++++------
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 7f2071f2460c..bffa56c4fc83 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -13,6 +13,11 @@ module_param(multipath, bool, 0444);
 MODULE_PARM_DESC(multipath,
 	"turn on native support for multiple controllers per subsystem");
 
+static bool ana_groups_only = false;
+module_param(ana_groups_only, bool, 0644);
+MODULE_PARM_DESC(ana_groups_only,
+		 "Retrieve ANA Log page with groups only (RGO bit set)");
+
 void nvme_mpath_unfreeze(struct nvme_subsystem *subsys)
 {
 	struct nvme_ns_head *h;
@@ -556,13 +561,14 @@ static int nvme_parse_ana_log(struct nvme_ctrl *ctrl, void *data,
 	for (i = 0; i < le16_to_cpu(ctrl->ana_log_buf->ngrps); i++) {
 		struct nvme_ana_group_desc *desc = base + offset;
 		u32 nr_nsids;
-		size_t nsid_buf_size;
+		size_t nsid_buf_size = 0;
 
 		if (WARN_ON_ONCE(offset > ctrl->ana_log_size - sizeof(*desc)))
 			return -EINVAL;
 
 		nr_nsids = le32_to_cpu(desc->nnsids);
-		nsid_buf_size = flex_array_size(desc, nsids, nr_nsids);
+		if (nr_nsids)
+			nsid_buf_size = flex_array_size(desc, nsids, nr_nsids);
 
 		if (WARN_ON_ONCE(desc->grpid == 0))
 			return -EINVAL;
@@ -617,8 +623,17 @@ static int nvme_update_ana_state(struct nvme_ctrl *ctrl,
 	if (desc->state == NVME_ANA_CHANGE)
 		(*nr_change_groups)++;
 
-	if (!nr_nsids)
+	if (!nr_nsids) {
+		if (!ana_groups_only)
+			return 0;
+		down_read(&ctrl->namespaces_rwsem);
+		list_for_each_entry(ns, &ctrl->namespaces, list) {
+			if (ns->ana_grpid == le32_to_cpu(desc->grpid))
+				nvme_update_ns_ana_state(desc, ns);
+		}
+		up_read(&ctrl->namespaces_rwsem);
 		return 0;
+	}
 
 	down_read(&ctrl->namespaces_rwsem);
 	list_for_each_entry(ns, &ctrl->namespaces, list) {
@@ -644,7 +659,8 @@ static int nvme_read_ana_log(struct nvme_ctrl *ctrl)
 	int error;
 
 	mutex_lock(&ctrl->ana_lock);
-	error = nvme_get_log(ctrl, NVME_NSID_ALL, NVME_LOG_ANA, 0, NVME_CSI_NVM,
+	error = nvme_get_log(ctrl, NVME_NSID_ALL, NVME_LOG_ANA,
+			ana_groups_only ? NVME_ANA_LOG_RGO : 0, NVME_CSI_NVM,
 			ctrl->ana_log_buf, ctrl->ana_log_size, 0);
 	if (error) {
 		dev_warn(ctrl->device, "Failed to get ANA log: %d\n", error);
@@ -855,8 +871,10 @@ int nvme_mpath_init_identify(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 	ctrl->anagrpmax = le32_to_cpu(id->anagrpmax);
 
 	ana_log_size = sizeof(struct nvme_ana_rsp_hdr) +
-		ctrl->nanagrpid * sizeof(struct nvme_ana_group_desc) +
-		ctrl->max_namespaces * sizeof(__le32);
+		ctrl->nanagrpid * sizeof(struct nvme_ana_group_desc);
+	if (!ana_groups_only)
+		ana_log_size += ctrl->max_namespaces * sizeof(__le32);
+
 	if (ana_log_size > max_transfer_size) {
 		dev_err(ctrl->device,
 			"ANA log page size (%zd) larger than MDTS (%zd).\n",
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] nvme-multipath: add an 'ana_groups_only' module option
  2022-02-07 10:00 [PATCH] nvme-multipath: add an 'ana_groups_only' module option Hannes Reinecke
@ 2022-02-07 12:37 ` Sagi Grimberg
  2022-02-07 13:00   ` Hannes Reinecke
  2022-02-09  8:07 ` Christoph Hellwig
  1 sibling, 1 reply; 9+ messages in thread
From: Sagi Grimberg @ 2022-02-07 12:37 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme


> On large installations the ANA log buffer can be exceedingly large;
> we've come across a controller with 49 ANA Group Descriptors and
> 65536 namespaces, resulting in an ANA buffer with an order-7 allocation.
> And this is just to validate that the namespace ID is _really_listed
> in the log page.
> So to avoid an overly large memory allocation we can leverage the
> 'RGO' bit when retrieving the ANA log page, and check whether the
> ANA group ID from the namespace is found in the ANA descriptors.
> That cuts down the memory allocation, and provides the same result.
> But to be on the safe side I've added a module option 'ana_groups_only'
> to switch between modes.

Hannes, why not a sysfs writable attribute? One can add a udev rule if
it wants a one-shot all controllers...

Although maybe that will bring us to yet another attribute that is
unstable until the controller identifies...

I think we are getting more indication that the controller device
creation is sitting in the wrong place...


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] nvme-multipath: add an 'ana_groups_only' module option
  2022-02-07 12:37 ` Sagi Grimberg
@ 2022-02-07 13:00   ` Hannes Reinecke
  0 siblings, 0 replies; 9+ messages in thread
From: Hannes Reinecke @ 2022-02-07 13:00 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig; +Cc: Keith Busch, linux-nvme

On 2/7/22 13:37, Sagi Grimberg wrote:
> 
>> On large installations the ANA log buffer can be exceedingly large;
>> we've come across a controller with 49 ANA Group Descriptors and
>> 65536 namespaces, resulting in an ANA buffer with an order-7 allocation.
>> And this is just to validate that the namespace ID is _really_listed
>> in the log page.
>> So to avoid an overly large memory allocation we can leverage the
>> 'RGO' bit when retrieving the ANA log page, and check whether the
>> ANA group ID from the namespace is found in the ANA descriptors.
>> That cuts down the memory allocation, and provides the same result.
>> But to be on the safe side I've added a module option 'ana_groups_only'
>> to switch between modes.
> 
> Hannes, why not a sysfs writable attribute? One can add a udev rule if
> it wants a one-shot all controllers...
> 
'tis a bit late, innit?
By the time we manage to write to the sysfs attribute (which, out of 
necessity, will be a controller attribute) the controller is already 
present, and the ANA log will already be read ...

> Although maybe that will bring us to yet another attribute that is
> unstable until the controller identifies...
> 
Yep.

> I think we are getting more indication that the controller device
> creation is sitting in the wrong place...

As mentioned earlier: main problem is that we need the controller device 
to initiate a controller removal. And as there is a possibility that the 
device initialisation _itself_ goes wrong we might need to tear down the
controller even before initialisation is done.
Which currently requires the sysfs attribute.
If we manage to figure out how to do that without the sysfs attribute
we should be fine.

Not that it'll help us in this case, though.
The only way out of _this_ issue is to pass in the configuration for
controller setup, ie adding a new argument to the fabrics configuration 
(eg 'ana_groups_only=1' or somesuch).
But I'm not sure if we want to go that way ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] nvme-multipath: add an 'ana_groups_only' module option
  2022-02-07 10:00 [PATCH] nvme-multipath: add an 'ana_groups_only' module option Hannes Reinecke
  2022-02-07 12:37 ` Sagi Grimberg
@ 2022-02-09  8:07 ` Christoph Hellwig
  2022-02-09  8:49   ` Hannes Reinecke
  2022-02-10  2:52   ` John Meneghini
  1 sibling, 2 replies; 9+ messages in thread
From: Christoph Hellwig @ 2022-02-09  8:07 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Christoph Hellwig, Sagi Grimberg, Keith Busch, linux-nvme

On Mon, Feb 07, 2022 at 11:00:05AM +0100, Hannes Reinecke wrote:
> On large installations the ANA log buffer can be exceedingly large;
> we've come across a controller with 49 ANA Group Descriptors and
> 65536 namespaces, resulting in an ANA buffer with an order-7 allocation.
> And this is just to validate that the namespace ID is _really_listed
> in the log page.
> So to avoid an overly large memory allocation we can leverage the
> 'RGO' bit when retrieving the ANA log page, and check whether the
> ANA group ID from the namespace is found in the ANA descriptors.
> That cuts down the memory allocation, and provides the same result.
> But to be on the safe side I've added a module option 'ana_groups_only'
> to switch between modes.

How is this supposed to work?  We'll fail to see what namespaces
the change applies to.

So in doubt fix the controller config to be less broken (and say hello
to NetApp and explain them they do not need more namespace for more
performance), and if that fails switch to a vmalloc allocation for
the buffer.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] nvme-multipath: add an 'ana_groups_only' module option
  2022-02-09  8:07 ` Christoph Hellwig
@ 2022-02-09  8:49   ` Hannes Reinecke
  2022-02-09 13:57     ` Christoph Hellwig
  2022-02-10  2:52   ` John Meneghini
  1 sibling, 1 reply; 9+ messages in thread
From: Hannes Reinecke @ 2022-02-09  8:49 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme

On 2/9/22 09:07, Christoph Hellwig wrote:
> On Mon, Feb 07, 2022 at 11:00:05AM +0100, Hannes Reinecke wrote:
>> On large installations the ANA log buffer can be exceedingly large;
>> we've come across a controller with 49 ANA Group Descriptors and
>> 65536 namespaces, resulting in an ANA buffer with an order-7 allocation.
>> And this is just to validate that the namespace ID is _really_listed
>> in the log page.
>> So to avoid an overly large memory allocation we can leverage the
>> 'RGO' bit when retrieving the ANA log page, and check whether the
>> ANA group ID from the namespace is found in the ANA descriptors.
>> That cuts down the memory allocation, and provides the same result.
>> But to be on the safe side I've added a module option 'ana_groups_only'
>> to switch between modes.
> 
> How is this supposed to work?  We'll fail to see what namespaces
> the change applies to.
> 
How so?
I'm just reverting the check; instead of having the nsid listed in the 
ana log page and check for a match between the nsids we now check for a 
match between the group id in the ana log page and the ana grpid from 
the 'identify namespace' output.
So I'm still able to identify which namespace this change belongs to.

> So in doubt fix the controller config to be less broken (and say hello
> to NetApp and explain them they do not need more namespace for more
> performance), and if that fails switch to a vmalloc allocation for
> the buffer.

Curiously it was _not_ NetApp, but I'll happily explain this to the 
reporters :-)

But I'll have a look a vmalloc anyway.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] nvme-multipath: add an 'ana_groups_only' module option
  2022-02-09  8:49   ` Hannes Reinecke
@ 2022-02-09 13:57     ` Christoph Hellwig
  0 siblings, 0 replies; 9+ messages in thread
From: Christoph Hellwig @ 2022-02-09 13:57 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Christoph Hellwig, Sagi Grimberg, Keith Busch, linux-nvme

On Wed, Feb 09, 2022 at 09:49:16AM +0100, Hannes Reinecke wrote:
> How so?
> I'm just reverting the check; instead of having the nsid listed in the ana 
> log page and check for a match between the nsids we now check for a match 
> between the group id in the ana log page and the ana grpid from the 
> 'identify namespace' output.
> So I'm still able to identify which namespace this change belongs to.

And how this this catching namespaces that move between ANA groups?

How does it deal with the hang fixed by Anton in commit af8fd0424713?


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] nvme-multipath: add an 'ana_groups_only' module option
  2022-02-09  8:07 ` Christoph Hellwig
  2022-02-09  8:49   ` Hannes Reinecke
@ 2022-02-10  2:52   ` John Meneghini
  2022-02-10  5:43     ` Christoph Hellwig
  2022-02-10  8:17     ` Hannes Reinecke
  1 sibling, 2 replies; 9+ messages in thread
From: John Meneghini @ 2022-02-10  2:52 UTC (permalink / raw)
  To: Christoph Hellwig, Hannes Reinecke; +Cc: Sagi Grimberg, Keith Busch, linux-nvme

On 2/9/22 03:07, Christoph Hellwig wrote:
> On Mon, Feb 07, 2022 at 11:00:05AM +0100, Hannes Reinecke wrote:
>> On large installations the ANA log buffer can be exceedingly large;
>> we've come across a controller with 49 ANA Group Descriptors and
>> 65536 namespaces, resulting in an ANA buffer with an order-7 allocation.
>> And this is just to validate that the namespace ID is _really_listed
>> in the log page.
>> So to avoid an overly large memory allocation we can leverage the
>> 'RGO' bit when retrieving the ANA log page, and check whether the
>> ANA group ID from the namespace is found in the ANA descriptors.
>> That cuts down the memory allocation, and provides the same result.
>> But to be on the safe side I've added a module option 'ana_groups_only'
>> to switch between modes.
> 
> How is this supposed to work?  We'll fail to see what namespaces
> the change applies to.
> 
> So in doubt fix the controller config to be less broken (and say hello
> to NetApp and explain them they do not need more namespace for more
> performance), and if that fails switch to a vmalloc allocation for
> the buffer.

I agree with Christoph.  I don't see the point in supporting 65536 namespaces across 49 ana groups or controllers. The problem 
here is: the vendor is trying to turn NVMe into SCSI.

Moreover, I don't understand how implementing this as a MODULE_PARM is supposed to work.  If you configure this module parameter 
on it assumes all NVMe-oF arrays connected to the host support RGO. What's really needed here is some kind of protocol mechanism 
that will allow the host to dynamically discovery if RGO is supported on a controller by controller basis.

And this isn't an NetApp array. Just look at who's asking for this change if you want a clue as to what NVMe-oF array is asking 
for this.  I know for a fact this isn't a NetApp array.

/John



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] nvme-multipath: add an 'ana_groups_only' module option
  2022-02-10  2:52   ` John Meneghini
@ 2022-02-10  5:43     ` Christoph Hellwig
  2022-02-10  8:17     ` Hannes Reinecke
  1 sibling, 0 replies; 9+ messages in thread
From: Christoph Hellwig @ 2022-02-10  5:43 UTC (permalink / raw)
  To: John Meneghini
  Cc: Christoph Hellwig, Hannes Reinecke, Sagi Grimberg, Keith Busch,
	linux-nvme

On Wed, Feb 09, 2022 at 09:52:45PM -0500, John Meneghini wrote:
> I agree with Christoph.  I don't see the point in supporting 65536 
> namespaces across 49 ana groups or controllers. The problem here is: the 
> vendor is trying to turn NVMe into SCSI.

I'm not saying we should not support it.  We can trivially do that by
using a vmalloc buffer.  I don't think we should optimize for it, and
we certainly should not break behaviour which I think this does.

>
> Moreover, I don't understand how implementing this as a MODULE_PARM is 
> supposed to work.  If you configure this module parameter on it assumes all 
> NVMe-oF arrays connected to the host support RGO. What's really needed here 
> is some kind of protocol mechanism that will allow the host to dynamically 
> discovery if RGO is supported on a controller by controller basis.

RGO support is mandatory and has been there since the first released
ANA TP.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] nvme-multipath: add an 'ana_groups_only' module option
  2022-02-10  2:52   ` John Meneghini
  2022-02-10  5:43     ` Christoph Hellwig
@ 2022-02-10  8:17     ` Hannes Reinecke
  1 sibling, 0 replies; 9+ messages in thread
From: Hannes Reinecke @ 2022-02-10  8:17 UTC (permalink / raw)
  To: John Meneghini, Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme

On 2/10/22 03:52, John Meneghini wrote:
> On 2/9/22 03:07, Christoph Hellwig wrote:
>> On Mon, Feb 07, 2022 at 11:00:05AM +0100, Hannes Reinecke wrote:
>>> On large installations the ANA log buffer can be exceedingly large;
>>> we've come across a controller with 49 ANA Group Descriptors and
>>> 65536 namespaces, resulting in an ANA buffer with an order-7 allocation.
>>> And this is just to validate that the namespace ID is _really_listed
>>> in the log page.
>>> So to avoid an overly large memory allocation we can leverage the
>>> 'RGO' bit when retrieving the ANA log page, and check whether the
>>> ANA group ID from the namespace is found in the ANA descriptors.
>>> That cuts down the memory allocation, and provides the same result.
>>> But to be on the safe side I've added a module option 'ana_groups_only'
>>> to switch between modes.
>>
>> How is this supposed to work?  We'll fail to see what namespaces
>> the change applies to.
>>
>> So in doubt fix the controller config to be less broken (and say hello
>> to NetApp and explain them they do not need more namespace for more
>> performance), and if that fails switch to a vmalloc allocation for
>> the buffer.
> 
> I agree with Christoph.  I don't see the point in supporting 65536 
> namespaces across 49 ana groups or controllers. The problem here is: the 
> vendor is trying to turn NVMe into SCSI.
> 
> Moreover, I don't understand how implementing this as a MODULE_PARM is 
> supposed to work.  If you configure this module parameter on it assumes 
> all NVMe-oF arrays connected to the host support RGO. What's really 
> needed here is some kind of protocol mechanism that will allow the host 
> to dynamically discovery if RGO is supported on a controller by 
> controller basis.
> 
> And this isn't an NetApp array. Just look at who's asking for this 
> change if you want a clue as to what NVMe-oF array is asking for this.  
> I know for a fact this isn't a NetApp array.
> 
Indeed, you are correct. Surprisingly we have more customers/partners 
implementing NVMe :-)

All things considered I guess I'll have to go with the kvmalloc approach.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-02-10  8:18 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-07 10:00 [PATCH] nvme-multipath: add an 'ana_groups_only' module option Hannes Reinecke
2022-02-07 12:37 ` Sagi Grimberg
2022-02-07 13:00   ` Hannes Reinecke
2022-02-09  8:07 ` Christoph Hellwig
2022-02-09  8:49   ` Hannes Reinecke
2022-02-09 13:57     ` Christoph Hellwig
2022-02-10  2:52   ` John Meneghini
2022-02-10  5:43     ` Christoph Hellwig
2022-02-10  8:17     ` Hannes Reinecke

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.