All of lore.kernel.org
 help / color / mirror / Atom feed
* NVMe over Fabrics host: behavior on presence of ANA Group in "change" state
@ 2022-02-06 13:59 Alex Talker
  2022-02-07  9:46 ` Hannes Reinecke
  2022-02-07 11:07 ` Sagi Grimberg
  0 siblings, 2 replies; 15+ messages in thread
From: Alex Talker @ 2022-02-06 13:59 UTC (permalink / raw)
  To: linux-nvme

Recently I noticed a peculiar error after connecting from the host
(CentOS 8 Stream at the time, more on that below)
via TCP(unlikely matters) to the NVMe target subsystem shared using 
nvmet module:

 > ...
 > nvme nvme1: ANATT timeout, resetting controller.
 > nvme nvme1: creating 8 I/O queues.
 > nvme nvme1: mapped 8/0/0 default/read/poll queues.
 > ...
 > nvme nvme1: ANATT timeout, resetting controller.
 > ...(and it continues like that over and over and over again, on some 
configuration even getting worse with greater iterations of reconnect)

I discovered that this behavior is caused by code in 
drivers/nvme/host/multipath.c,
in particular when function nvme_update_ana_state increments value of 
variable nr_change_groups whenever any ANA Group is in "change",
indifference of whether any namespace belongs to the group or not.
Now, after figuring out that ANATT stands for ANA Transition Time and 
reading some more of the NVMe 2.0 standards, I understood that the 
problem caused by how I managed to utilize ANA Groups.

As far as I remember, permitted number of ANA Groups in nvmet module is 
128, while maximum number of namespaces is 1024(8 times more).
Thus, mapping 1 namespace to 1 ANA Group works only up to a point.
It is nice to have some logically-related namespaces belong to the same 
ANA Group,
and the final scheme of how namespaces belong to ANA groups is often 
vendor-specific
(or rather lies in decision domain of the end user of target-related stuff),
However, rather than changing state of a namespace on specific port, for 
example for maintenance reasons,
I find it particularly useful to utilize ANA Groups to change the state 
of a certain namespace, since it is more likely that block device might 
enter unusable state or be a part of some transitioning process.
Thus, the simplest scheme for me on each port is to assign few ANA 
Groups, one per each possible ANA state, and change ANA Group on a 
namespace rather than changing state of the group the namespace belongs 
to at the moment.
And here's the catch.

If one creates a subsystem(no namespaces needed) on a port, connects to 
it and then sets state of ANA Group #1 to "change", the issue introduced 
in the beginning would be reproduced practically on many major distros 
and even upstream code without and issue,
tho sometimes it can be mitigated by disabling the "native 
multipath"(when /sys/module/nvme_core/parameters/multipath set to N) but 
sometimes that's not the case which is why this issue quite annoying for 
my setup.
I just checked it on 5.15.16 from Manjaro(basically Arch Linux) and 
ELRepo's kernel-ml and kernel-lt(basically vanilla versions of the 
mainline and LTS kernels respectively for CentOSs).

The standard tells that:

 > An ANA Group may contain zero or more namespaces

which makes perfect sense, since one has to create a group prior to 
assigning it to a namespace, and then:

 > While ANA Change state is reported by a controller for the namespace, 
the host should: ...(part regarding ANATT)

So on one hand I think my setup might be questionable(I might allocate 
ANAGRPID for "change" only in times of actual transitions, while that 
might over-complicate usage of the module),
on the other I think it happens to be a misinterpretation of the 
standard and might need some additional clarification.

That's why I decided to compose this message first prior to proposing 
any patches.

Also, while digging the code, I noticed that ANATT at the moment 
presented by a random constant(of 10 seconds), and since often 
transition time differs depending on block devices being in-use 
underneath namespaces,
it might be viable to allow end-user to change this value via configfs.

Considering everything I wrote, I'd like to hear opinions on the 
following issues:
1. Whether my utilization of ANA Groups is viable approach?
2. Which ANA Group assignment schemes utilized in production, from your 
experience?
3. Whether changing ANATT value change should be allowed via configfs(in 
particular, on per-subsystem level I think)?

Thanks for reading till the very end! Hope I didn't ramble too much, I 
just wanted to only lay out all of the details.

Best regards,
Alex


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NVMe over Fabrics host: behavior on presence of ANA Group in "change" state
  2022-02-06 13:59 NVMe over Fabrics host: behavior on presence of ANA Group in "change" state Alex Talker
@ 2022-02-07  9:46 ` Hannes Reinecke
  2022-02-07 15:04   ` Alex Talker
  2022-02-07 11:07 ` Sagi Grimberg
  1 sibling, 1 reply; 15+ messages in thread
From: Hannes Reinecke @ 2022-02-07  9:46 UTC (permalink / raw)
  To: Alex Talker, linux-nvme

On 2/6/22 14:59, Alex Talker wrote:
> Recently I noticed a peculiar error after connecting from the host
> (CentOS 8 Stream at the time, more on that below)
> via TCP(unlikely matters) to the NVMe target subsystem shared using 
> nvmet module:
> 
>  > ...
>  > nvme nvme1: ANATT timeout, resetting controller.
>  > nvme nvme1: creating 8 I/O queues.
>  > nvme nvme1: mapped 8/0/0 default/read/poll queues.
>  > ...
>  > nvme nvme1: ANATT timeout, resetting controller.
>  > ...(and it continues like that over and over and over again, on some 
> configuration even getting worse with greater iterations of reconnect)
> 
> I discovered that this behavior is caused by code in 
> drivers/nvme/host/multipath.c,
> in particular when function nvme_update_ana_state increments value of 
> variable nr_change_groups whenever any ANA Group is in "change",
> indifference of whether any namespace belongs to the group or not.
> Now, after figuring out that ANATT stands for ANA Transition Time and 
> reading some more of the NVMe 2.0 standards, I understood that the 
> problem caused by how I managed to utilize ANA Groups.
> 
> As far as I remember, permitted number of ANA Groups in nvmet module is 
> 128, while maximum number of namespaces is 1024(8 times more).
> Thus, mapping 1 namespace to 1 ANA Group works only up to a point.
> It is nice to have some logically-related namespaces belong to the same 
> ANA Group,
> and the final scheme of how namespaces belong to ANA groups is often 
> vendor-specific
> (or rather lies in decision domain of the end user of target-related 
> stuff),
> However, rather than changing state of a namespace on specific port, for 
> example for maintenance reasons,
> I find it particularly useful to utilize ANA Groups to change the state 
> of a certain namespace, since it is more likely that block device might 
> enter unusable state or be a part of some transitioning process.
> Thus, the simplest scheme for me on each port is to assign few ANA 
> Groups, one per each possible ANA state, and change ANA Group on a 
> namespace rather than changing state of the group the namespace belongs 
> to at the moment.
> And here's the catch.
> 
> If one creates a subsystem(no namespaces needed) on a port, connects to 
> it and then sets state of ANA Group #1 to "change", the issue introduced 
> in the beginning would be reproduced practically on many major distros 
> and even upstream code without and issue,
> tho sometimes it can be mitigated by disabling the "native 
> multipath"(when /sys/module/nvme_core/parameters/multipath set to N) but 
> sometimes that's not the case which is why this issue quite annoying for 
> my setup.
> I just checked it on 5.15.16 from Manjaro(basically Arch Linux) and 
> ELRepo's kernel-ml and kernel-lt(basically vanilla versions of the 
> mainline and LTS kernels respectively for CentOSs).
> 
> The standard tells that:
> 
>  > An ANA Group may contain zero or more namespaces
> 
> which makes perfect sense, since one has to create a group prior to 
> assigning it to a namespace, and then:
> 
>  > While ANA Change state is reported by a controller for the namespace, 
> the host should: ...(part regarding ANATT)
> 
> So on one hand I think my setup might be questionable(I might allocate 
> ANAGRPID for "change" only in times of actual transitions, while that 
> might over-complicate usage of the module),
> on the other I think it happens to be a misinterpretation of the 
> standard and might need some additional clarification.
> 
That's actually a misinterpretation.
The above sentence refers to a device reporting to be in ANA 'change', 
ie after reading the ANA log and detecting that a given namespace is in 
a group whose ANA status is 'change'.

In your case it might be feasible to not report 'change' at all, but 
rather do an direct transition from one group to the other.
If it's just a single namespace the transition should be atomic, and 
hence there won't be any synchronisation issues which might warrant a 
'change' state.

> That's why I decided to compose this message first prior to proposing 
> any patches.
> 
> Also, while digging the code, I noticed that ANATT at the moment 
> presented by a random constant(of 10 seconds), and since often 
> transition time differs depending on block devices being in-use 
> underneath namespaces,
> it might be viable to allow end-user to change this value via configfs.
> 
> Considering everything I wrote, I'd like to hear opinions on the 
> following issues:
> 1. Whether my utilization of ANA Groups is viable approach?

Well, it certainly is an odd one, but should be doable.
But note, there had been some fixes to the ANA group ID handling;
most recently commit 79f528afa939 ("nvme-multipath: fix ANA state 
updates when a namespace is not present").
So do ensure to have the latest fixes to get the 'best' possible 
user-experience.

> 2. Which ANA Group assignment schemes utilized in production, from your 
> experience?

Typically it's the NVMe controller port which holds the ANA state; for 
most implementation I'm aware of you have one or more (physical) NVMe 
controller ports, which hosts the interfaces etc.
They connect to the actual storage, and failover is done by switching 
I/O between those controller ports.
Hence the ANA state is really property of the controller port in those 
implementations.

> 3. Whether changing ANATT value change should be allowed via configfs(in 
> particular, on per-subsystem level I think)?
> 
The ANATT value is useful if you have an implementation which takes some 
time to facilitate the switch-over. As there is always a chance of the 
switch-over going wrong the ANATT serves as an upper boundary after 
which an ANA state of 'change' can be considered stale, and a re-read is 
in order.
So for the linux implementation it's a bit moot; one would have to 
present a use-case where changing the ANATT value would make a difference.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NVMe over Fabrics host: behavior on presence of ANA Group in "change" state
  2022-02-06 13:59 NVMe over Fabrics host: behavior on presence of ANA Group in "change" state Alex Talker
  2022-02-07  9:46 ` Hannes Reinecke
@ 2022-02-07 11:07 ` Sagi Grimberg
  2022-02-07 15:04   ` Alex Talker
  1 sibling, 1 reply; 15+ messages in thread
From: Sagi Grimberg @ 2022-02-07 11:07 UTC (permalink / raw)
  To: Alex Talker, linux-nvme



On 2/6/22 15:59, Alex Talker wrote:
> Recently I noticed a peculiar error after connecting from the host
> (CentOS 8 Stream at the time, more on that below)
> via TCP(unlikely matters) to the NVMe target subsystem shared using 
> nvmet module:
> 
>  > ...
>  > nvme nvme1: ANATT timeout, resetting controller.
>  > nvme nvme1: creating 8 I/O queues.
>  > nvme nvme1: mapped 8/0/0 default/read/poll queues.
>  > ...
>  > nvme nvme1: ANATT timeout, resetting controller.
>  > ...(and it continues like that over and over and over again, on some 
> configuration even getting worse with greater iterations of reconnect)
> 
> I discovered that this behavior is caused by code in 
> drivers/nvme/host/multipath.c,
> in particular when function nvme_update_ana_state increments value of 
> variable nr_change_groups whenever any ANA Group is in "change",
> indifference of whether any namespace belongs to the group or not.
> Now, after figuring out that ANATT stands for ANA Transition Time and 
> reading some more of the NVMe 2.0 standards, I understood that the 
> problem caused by how I managed to utilize ANA Groups.
> 
> As far as I remember, permitted number of ANA Groups in nvmet module is 
> 128, while maximum number of namespaces is 1024(8 times more).
> Thus, mapping 1 namespace to 1 ANA Group works only up to a point.
> It is nice to have some logically-related namespaces belong to the same 
> ANA Group,
> and the final scheme of how namespaces belong to ANA groups is often 
> vendor-specific
> (or rather lies in decision domain of the end user of target-related 
> stuff),
> However, rather than changing state of a namespace on specific port, for 
> example for maintenance reasons,
> I find it particularly useful to utilize ANA Groups to change the state 
> of a certain namespace, since it is more likely that block device might 
> enter unusable state or be a part of some transitioning process.

I'm not exactly sure what you are trying to do, but it sounds wrong...
ANA groups are supposed to be a logical unit that expresses controllers
access state to the associated namespaces that belong to the group.

> Thus, the simplest scheme for me on each port is to assign few ANA 
> Groups, one per each possible ANA state, and change ANA Group on a 
> namespace rather than changing state of the group the namespace belongs 
> to at the moment.

That is an abuse of ANA groups IMO. But OK...

> And here's the catch.
> 
> If one creates a subsystem(no namespaces needed) on a port, connects to 
> it and then sets state of ANA Group #1 to "change", the issue introduced 
> in the beginning would be reproduced practically on many major distros 
> and even upstream code without and issue,

This state is not a permanent state, it is transient by definition,
which is why the host is treating it as such.

The host is expecting the controller to send another ANA AEN that
notifies the new state within ANATT (i.e. stateA -> change -> stateB).

> tho sometimes it can be mitigated by disabling the "native 
> multipath"(when /sys/module/nvme_core/parameters/multipath set to N) but 
> sometimes that's not the case which is why this issue quite annoying for 
> my setup.

That is simply removing support for multipathing altogether.

> I just checked it on 5.15.16 from Manjaro(basically Arch Linux) and 
> ELRepo's kernel-ml and kernel-lt(basically vanilla versions of the 
> mainline and LTS kernels respectively for CentOSs).
> 
> The standard tells that:
> 
>  > An ANA Group may contain zero or more namespaces
> 
> which makes perfect sense, since one has to create a group prior to 
> assigning it to a namespace, and then:
> 
>  > While ANA Change state is reported by a controller for the namespace, 
> the host should: ...(part regarding ANATT)
> 
> So on one hand I think my setup might be questionable(I might allocate 
> ANAGRPID for "change" only in times of actual transitions, while that 
> might over-complicate usage of the module),

I'm still don't fully understand what you are trying to do, but creating
a transient ANA group for a change state sounds backwards to me.

> on the other I think it happens to be a misinterpretation of the 
> standard and might need some additional clarification.
> 
> That's why I decided to compose this message first prior to proposing 
> any patches.
> 
> Also, while digging the code, I noticed that ANATT at the moment 
> presented by a random constant(of 10 seconds), and since often 
> transition time differs depending on block devices being in-use 
> underneath namespaces,
> it might be viable to allow end-user to change this value via configfs.

How would you expose it via configfs? ana groups may be shared via
different ports IIRC. You would need to prevent conflicting settings...

> Considering everything I wrote, I'd like to hear opinions on the 
> following issues:
> 1. Whether my utilization of ANA Groups is viable approach?

I don't think so, but I don't know if I understood what you are trying
to do.

> 2. Which ANA Group assignment schemes utilized in production, from your 
> experience?

ANA groups will usually relate, a ANA group will be used for what it is
supposed to. A group of zero or more namespaces where each controller
may have different access state to it (or the namespaces assigned to
it).

> 3. Whether changing ANATT value change should be allowed via configfs(in 
> particular, on per-subsystem level I think)?

Could be... We'll need to see patches.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NVMe over Fabrics host: behavior on presence of ANA Group in "change" state
  2022-02-07  9:46 ` Hannes Reinecke
@ 2022-02-07 15:04   ` Alex Talker
  0 siblings, 0 replies; 15+ messages in thread
From: Alex Talker @ 2022-02-07 15:04 UTC (permalink / raw)
  To: linux-nvme

Thanks for the quick reply!

 > That's actually a misinterpretation. The above sentence refers to a
 > device reporting to be in ANA 'change', ie after reading the ANA log
 > and detecting that a given namespace is in a group whose ANA status
 > is 'change'.

Thus if I understand you correctly, you agree on behavior that the 
timer(ANATT)
shouldn't be started upon seeing an empty ANA Group (in "change" state)
but rather when at least one namespace belongs to a group in such state?

 > In your case it might be feasible to not report 'change' at all, but
 > rather do an direct transition from one group to the other. If it's
 > just a single namespace the transition should be atomic, and hence
 > there won't be any synchronisation issues which might warrant a
 > 'change' state.

While it might be possible, I think that sometimes "change" state might 
work as a sort of "I/O barrier"
when some changes on underlying block device might be in progress,
and in that case determining final state might not be that easy prior to 
the operation end.
Please, do feel free to correct me if that's an inappropriate usage!
I'm still interested to hear tho more details on what synchronization 
issues might arise as you mentioned?

 > Well, it certainly is an odd one, but should be doable. But note,
 > there had been some fixes to the ANA group ID handling; most recently
 > commit 79f528afa939 ("nvme-multipath: fix ANA state updates when a
 > namespace is not present"). So do ensure to have the latest fixes to
 > get the 'best' possible user-experience.

Well, i do agree that my approach is a tiny bit outstanding since I 
catch such bugs
but I checked the nvme.git on this platform and the code my setup has 
issues with is still there.
The commit you mentioned as I understood is made in regard to some race 
condition or alike,
which is by the way abstractly speaking might as well be an infinite 
loop if something goes awry
but I hope I won't ever met such an issue in production.

 > Typically it's the NVMe controller port which holds the ANA state;
 > for most implementation I'm aware of you have one or more (physical)
 > NVMe controller ports, which hosts the interfaces etc. They connect
 > to the actual storage, and failover is done by switching I/O between
 > those controller ports. Hence the ANA state is really property of
 > the controller port in those implementations.

So, in short, each port has just one group and all namespaces belong to it,
and thus all resources on the port change state at once?
I can agree that this is quite a simple setup in its own right too 
however I'm thinking more in dimension of
"what if one namespace becomes unavailable, how to efficiently change 
only its state without having an over-complicated ANA Group configuration?"
to be in foot with High Availability concept.
Of course I might have personalized ANA Group setup for each 
installation/deployment but I preferred to start with something simple 
yet flexible,
since standard doesn't exactly opposes an idea that change of ANA Group 
involves possible change of the ANA state for a namespace too.

 > The ANATT value is useful if you have an implementation which takes
 > some time to facilitate the switch-over. As there is always a chance
 > of the switch-over going wrong the ANATT serves as an upper boundary
 > after which an ANA state of 'change' can be considered stale, and a
 > re-read is in order. So for the linux implementation it's a bit
 > moot; one would have to present a use-case where changing the ANATT
 > value would make a difference.

I can't think much of any particular example at this moment but from my 
personal experience,
working with custom storage solutions on some operations might take 
quite a while,
and even standard gives an example with ANATT=30 which I think some kind 
of "magical constant"
from SCSI world where it needs to be customized from initiator's side in 
order for things to hang out a little more.
On this regard, I'm more thinking from the perspective of giving an 
administrator ability to customize the value if a need arises,
since I haven't had an idea of such value existing prior to catching my 
(in silly way triggered) error.
For example, I think some operations on DRBD devices might take quite 
reasonable amount of time,
and while we may argue on how useful DRBD in this setup anyway, I 
mentioned it purely for example purposes.
I'm sure eventually, possibly especially with releases of NVMe HDDs on 
the market,
real use-cases might be more often and I find implementing the 
customization prior to that quite useful.

Best regards,
Alex

On 07.02.2022 12:46, Hannes Reinecke wrote:
 > On 2/6/22 14:59, Alex Talker wrote:
 >> Recently I noticed a peculiar error after connecting from the host
 >> (CentOS 8 Stream at the time, more on that below) via TCP(unlikely
 >> matters) to the NVMe target subsystem shared using nvmet module:
 >>
 >>> ... nvme nvme1: ANATT timeout, resetting controller. nvme nvme1:
 >>> creating 8 I/O queues. nvme nvme1: mapped 8/0/0
 >>> default/read/poll queues. ... nvme nvme1: ANATT timeout,
 >>> resetting controller. ...(and it continues like that over and
 >>> over and over again, on some configuration even getting worse
 >>> with greater iterations of reconnect)
 >>
 >> I discovered that this behavior is caused by code in
 >> drivers/nvme/host/multipath.c, in particular when function
 >> nvme_update_ana_state increments value of variable nr_change_groups
 >> whenever any ANA Group is in "change", indifference of whether any
 >> namespace belongs to the group or not. Now, after figuring out that
 >> ANATT stands for ANA Transition Time and reading some more of the
 >> NVMe 2.0 standards, I understood that the problem caused by how I
 >> managed to utilize ANA Groups.
 >>
 >> As far as I remember, permitted number of ANA Groups in nvmet
 >> module is 128, while maximum number of namespaces is 1024(8 times
 >> more). Thus, mapping 1 namespace to 1 ANA Group works only up to a
 >> point. It is nice to have some logically-related namespaces belong
 >> to the same ANA Group, and the final scheme of how namespaces
 >> belong to ANA groups is often vendor-specific (or rather lies in
 >> decision domain of the end user of target-related stuff), However,
 >> rather than changing state of a namespace on specific port, for
 >> example for maintenance reasons, I find it particularly useful to
 >> utilize ANA Groups to change the state of a certain namespace,
 >> since it is more likely that block device might enter unusable
 >> state or be a part of some transitioning process. Thus, the
 >> simplest scheme for me on each port is to assign few ANA Groups,
 >> one per each possible ANA state, and change ANA Group on a
 >> namespace rather than changing state of the group the namespace
 >> belongs to at the moment. And here's the catch.
 >>
 >> If one creates a subsystem(no namespaces needed) on a port,
 >> connects to it and then sets state of ANA Group #1 to "change", the
 >> issue introduced in the beginning would be reproduced practically
 >> on many major distros and even upstream code without and issue, tho
 >> sometimes it can be mitigated by disabling the "native
 >> multipath"(when /sys/module/nvme_core/parameters/multipath set to
 >> N) but sometimes that's not the case which is why this issue quite
 >> annoying for my setup. I just checked it on 5.15.16 from
 >> Manjaro(basically Arch Linux) and ELRepo's kernel-ml and
 >> kernel-lt(basically vanilla versions of the mainline and LTS
 >> kernels respectively for CentOSs).
 >>
 >> The standard tells that:
 >>
 >>> An ANA Group may contain zero or more namespaces
 >>
 >> which makes perfect sense, since one has to create a group prior to
 >> assigning it to a namespace, and then:
 >>
 >>> While ANA Change state is reported by a controller for the
 >>> namespace, the host should: ...(part regarding ANATT)
 >>
 >> So on one hand I think my setup might be questionable(I might
 >> allocate ANAGRPID for "change" only in times of actual transitions,
 >> while that might over-complicate usage of the module), on the other
 >> I think it happens to be a misinterpretation of the standard and
 >> might need some additional clarification.
 >>
 > That's actually a misinterpretation. The above sentence refers to a
 > device reporting to be in ANA 'change', ie after reading the ANA log
 > and detecting that a given namespace is in a group whose ANA status
 > is 'change'.
 >
 > In your case it might be feasible to not report 'change' at all, but
 > rather do an direct transition from one group to the other. If it's
 > just a single namespace the transition should be atomic, and hence
 > there won't be any synchronisation issues which might warrant a
 > 'change' state.
 >
 >> That's why I decided to compose this message first prior to
 >> proposing any patches.
 >>
 >> Also, while digging the code, I noticed that ANATT at the moment
 >> presented by a random constant(of 10 seconds), and since often
 >> transition time differs depending on block devices being in-use
 >> underneath namespaces, it might be viable to allow end-user to
 >> change this value via configfs.
 >>
 >> Considering everything I wrote, I'd like to hear opinions on the
 >> following issues: 1. Whether my utilization of ANA Groups is viable
 >> approach?
 >
 > Well, it certainly is an odd one, but should be doable. But note,
 > there had been some fixes to the ANA group ID handling; most recently
 > commit 79f528afa939 ("nvme-multipath: fix ANA state updates when a
 > namespace is not present"). So do ensure to have the latest fixes to
 > get the 'best' possible user-experience.
 >
 >> 2. Which ANA Group assignment schemes utilized in production, from
 >> your experience?
 >
 > Typically it's the NVMe controller port which holds the ANA state;
 > for most implementation I'm aware of you have one or more (physical)
 > NVMe controller ports, which hosts the interfaces etc. They connect
 > to the actual storage, and failover is done by switching I/O between
 > those controller ports. Hence the ANA state is really property of
 > the controller port in those implementations.
 >
 >> 3. Whether changing ANATT value change should be allowed via
 >> configfs(in particular, on per-subsystem level I think)?
 >>
 > The ANATT value is useful if you have an implementation which takes
 > some time to facilitate the switch-over. As there is always a chance
 > of the switch-over going wrong the ANATT serves as an upper boundary
 > after which an ANA state of 'change' can be considered stale, and a
 > re-read is in order. So for the linux implementation it's a bit
 > moot; one would have to present a use-case where changing the ANATT
 > value would make a difference.
 >
 > Cheers,
 >
 > Hannes


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NVMe over Fabrics host: behavior on presence of ANA Group in "change" state
  2022-02-07 11:07 ` Sagi Grimberg
@ 2022-02-07 15:04   ` Alex Talker
  2022-02-07 22:16     ` Sagi Grimberg
  0 siblings, 1 reply; 15+ messages in thread
From: Alex Talker @ 2022-02-07 15:04 UTC (permalink / raw)
  To: linux-nvme

 > I'm not exactly sure what you are trying to do, but it sounds
 > wrong... ANA groups are supposed to be a logical unit that expresses
 > controllers access state to the associated namespaces that belong to
 > the group.

I do agree that my setup might seem odd but I doubt it contradicts your 
statement much,
since each group would represent state of namespaces belonging to it,
the difference is just that instate of having a complex(or should I say 
one depending on installation/deployment)
relationship between a namespace and an ANA group, I opted for the 
balancing act between flexibility of assigning state for a namespace
and having a constant set of ANA groups on each system.
In my view, it is rather often situation when one namespace has troubles 
while others aren't and thus it better be unavailable on all ports at once,
rather than when certain port needs to deny access to certain namespaces 
for, say, maintenance issues.

 > That is an abuse of ANA groups IMO. But OK...

I do not disagree but so seems to do the standard.
But let me try to explain my perspective in possibly more familiar 
analogy to you.
As you probably aware, with ALUA in SCSI, via Target Port Groups 
mechanism, one can with zero worry specify certain LUN (ALUA) state on a 
set of targets(at least in SCST implementation).
I ain't sure about certain limitations but I think it's quite easy to 
keep up with 1 LUN = 1 group ratio for flexible control.
However, as I highlighter in earlier message, in nvmet implementation 
there's allowed only 128 ANA Groups, while (each!) subsystem may keep up 
to 1024 namespaces.
Thus, if I had no issue of say assigning a group per each 
namespace(assuming that NSIDs are globally unique on my target), this is 
currently not the case,
so I'm trying my best of out in these restrictions, while keeping ANA 
Group setup as straightforward, as possible.
One may argue that I shall dump everything into one ANA Group but it 
will contradict my expectations of High Availability of namespaces that 
are still (mostly?) working while others aren't.
One also may argue that it's rare to have in production greater number 
of namespaces than 128 in total but I still would prefer to go for 
support of 1024 anyway.
Hope I cleared that one out, do feel free to correct me if I have a flaw 
somewhere.

 > This state is not a permanent state, it is transient by definition,
 > which is why the host is treating it as such.
 >
 > The host is expecting the controller to send another ANA AEN that
 > notifies the new state within ANATT (i.e. stateA -> change ->
 > stateB).

As mentioned by Hannes, and I agree, state is indeed transient but only 
in relation to a namespace,
so I find it to be zero issue of having a group in change state with 0 
namespaces as its members.
I understand that it would be nice and dandy to change state of multiple 
namespaces at once(if one can take time to configure such dependency 
between them),
but I at the moment opt for simpler but flexible solution, maybe at the 
cost of greater number of ANA log changes in worst-case scenario.
Thus, the cycle "namespace in state A" => "namespace in state of change" 
=> "namespace in state B" is still preserved, tho with different 
methods(change of a group rather than a state of the group).

 > That is simply removing support for multipathing altogether.

You're not wrong on that one, tho, no offense, in certain configurations 
or certain initiators that's a way to go.
Especially when it might be a matter of changing one implementation to 
another(i.e. old good dm-multipath).
I mainly mentioned this because it fixes the issue on some 
kernels(including mainline/LTS) while not on others,
which is why I think it's important that misinterpretation of the 
standard will be accounted for on the mainstream code
since I can't possibly patch every single thing that lives on back-ports 
for it(I personally look at CentOS world rn),
while it might be the end user of my target setups.
My territory is mainly the target and this is not the issue I can fix on 
my side.
Besides, handling of my case differs from standard way anyway right now.

 > I'm still don't fully understand what you are trying to do, but
 > creating a transient ANA group for a change state sounds backwards to
 > me.

As I stated, I'm just trying to work with present limitations, which I 
suppose were chosen with regard to performance or something.

 > Could be... We'll need to see patches.

On that regard, I have seen plenty of git-related mails around here,
so would it be possible to publish patches as a few commits based on 
mainline or infradead git repo on GitHub or something?
Or is it mandatory to go, no offense, the old-fashioned way of sending 
patch files as attachments or text?
I just 99.9% work with git and the former will be easier for me.

Best regards,
Alex

On 07.02.2022 14:07, Sagi Grimberg wrote:
 >
 >
 > On 2/6/22 15:59, Alex Talker wrote:
 >> Recently I noticed a peculiar error after connecting from the host
 >> (CentOS 8 Stream at the time, more on that below) via TCP(unlikely
 >> matters) to the NVMe target subsystem shared using nvmet module:
 >>
 >>> ... nvme nvme1: ANATT timeout, resetting controller. nvme nvme1:
 >>> creating 8 I/O queues. nvme nvme1: mapped 8/0/0 default/read/poll
 >>> queues. ... nvme nvme1: ANATT timeout, resetting controller.
 >>> ...(and it continues like that over and over and over again, on
 >>> some configuration even getting worse with greater iterations of
 >>> reconnect)
 >>
 >> I discovered that this behavior is caused by code in
 >> drivers/nvme/host/multipath.c, in particular when function
 >> nvme_update_ana_state increments value of variable nr_change_groups
 >> whenever any ANA Group is in "change", indifference of whether any
 >> namespace belongs to the group or not. Now, after figuring out that
 >> ANATT stands for ANA Transition Time and reading some more of the
 >> NVMe 2.0 standards, I understood that the problem caused by how I
 >> managed to utilize ANA Groups.
 >>
 >> As far as I remember, permitted number of ANA Groups in nvmet
 >> module is 128, while maximum number of namespaces is 1024(8 times
 >> more). Thus, mapping 1 namespace to 1 ANA Group works only up to a
 >> point. It is nice to have some logically-related namespaces belong
 >> to the same ANA Group, and the final scheme of how namespaces
 >> belong to ANA groups is often vendor-specific (or rather lies in
 >> decision domain of the end user of target-related stuff), However,
 >> rather than changing state of a namespace on specific port, for
 >> example for maintenance reasons, I find it particularly useful to
 >> utilize ANA Groups to change the state of a certain namespace,
 >> since it is more likely that block device might enter unusable
 >> state or be a part of some transitioning process.
 >
 > I'm not exactly sure what you are trying to do, but it sounds
 > wrong... ANA groups are supposed to be a logical unit that expresses
 > controllers access state to the associated namespaces that belong to
 > the group.
 >
 >> Thus, the simplest scheme for me on each port is to assign few ANA
 >> Groups, one per each possible ANA state, and change ANA Group on a
 >> namespace rather than changing state of the group the namespace
 >> belongs to at the moment.
 >
 > That is an abuse of ANA groups IMO. But OK...
 >
 >> And here's the catch.
 >>
 >> If one creates a subsystem(no namespaces needed) on a port,
 >> connects to it and then sets state of ANA Group #1 to "change", the
 >> issue introduced in the beginning would be reproduced practically
 >> on many major distros and even upstream code without and issue,
 >
 > This state is not a permanent state, it is transient by definition,
 > which is why the host is treating it as such.
 >
 > The host is expecting the controller to send another ANA AEN that
 > notifies the new state within ANATT (i.e. stateA -> change ->
 > stateB).
 >
 >> tho sometimes it can be mitigated by disabling the "native
 >> multipath"(when /sys/module/nvme_core/parameters/multipath set to
 >> N) but sometimes that's not the case which is why this issue quite
 >> annoying for my setup.
 >
 > That is simply removing support for multipathing altogether.
 >
 >> I just checked it on 5.15.16 from Manjaro(basically Arch Linux) and
 >> ELRepo's kernel-ml and kernel-lt(basically vanilla versions of the
 >> mainline and LTS kernels respectively for CentOSs).
 >>
 >> The standard tells that:
 >>
 >>> An ANA Group may contain zero or more namespaces
 >>
 >> which makes perfect sense, since one has to create a group prior to
 >> assigning it to a namespace, and then:
 >>
 >>> While ANA Change state is reported by a controller for the
 >>> namespace, the host should: ...(part regarding ANATT)
 >>
 >> So on one hand I think my setup might be questionable(I might
 >> allocate ANAGRPID for "change" only in times of actual transitions,
 >> while that might over-complicate usage of the module),
 >
 > I'm still don't fully understand what you are trying to do, but
 > creating a transient ANA group for a change state sounds backwards to
 > me.
 >
 >> on the other I think it happens to be a misinterpretation of the
 >> standard and might need some additional clarification.
 >>
 >> That's why I decided to compose this message first prior to
 >> proposing any patches.
 >>
 >> Also, while digging the code, I noticed that ANATT at the moment
 >> presented by a random constant(of 10 seconds), and since often
 >> transition time differs depending on block devices being in-use
 >> underneath namespaces, it might be viable to allow end-user to
 >> change this value via configfs.
 >
 > How would you expose it via configfs? ana groups may be shared via
 > different ports IIRC. You would need to prevent conflicting
 > settings...
 >
 >> Considering everything I wrote, I'd like to hear opinions on the
 >> following issues: 1. Whether my utilization of ANA Groups is viable
 >> approach?
 >
 > I don't think so, but I don't know if I understood what you are
 > trying to do.
 >
 >> 2. Which ANA Group assignment schemes utilized in production, from
 >> your experience?
 >
 > ANA groups will usually relate, a ANA group will be used for what it
 > is supposed to. A group of zero or more namespaces where each
 > controller may have different access state to it (or the namespaces
 > assigned to it).
 >
 >> 3. Whether changing ANATT value change should be allowed via
 >> configfs(in particular, on per-subsystem level I think)?
 >
 > Could be... We'll need to see patches.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NVMe over Fabrics host: behavior on presence of ANA Group in "change" state
  2022-02-07 15:04   ` Alex Talker
@ 2022-02-07 22:16     ` Sagi Grimberg
  2022-02-10 13:46       ` Alex Talker
  0 siblings, 1 reply; 15+ messages in thread
From: Sagi Grimberg @ 2022-02-07 22:16 UTC (permalink / raw)
  To: Alex Talker, linux-nvme

Alex,

Can you please stop top-posting, its difficult to follow this
copy-pasting-top-posting chain that you are generating..

>  > I'm not exactly sure what you are trying to do, but it sounds
>  > wrong... ANA groups are supposed to be a logical unit that expresses
>  > controllers access state to the associated namespaces that belong to
>  > the group.
> 
> I do agree that my setup might seem odd but I doubt it contradicts your 
> statement much,
> since each group would represent state of namespaces belonging to it,
> the difference is just that instate of having a complex(or should I say 
> one depending on installation/deployment)
> relationship between a namespace and an ANA group, I opted for the 
> balancing act between flexibility of assigning state for a namespace
> and having a constant set of ANA groups on each system.
> In my view, it is rather often situation when one namespace has troubles 
> while others aren't and thus it better be unavailable on all ports at once,
> rather than when certain port needs to deny access to certain namespaces 
> for, say, maintenance issues.

Not exactly sure what you are alluding to. I didn't suggest anything
about any static configuration. I was just explaining what ANA groups
are expressing. They're there, you are free to use them however you
like.

>  > That is an abuse of ANA groups IMO. But OK...
> 
> I do not disagree but so seems to do the standard.

The standard abuses ANA groups?

> But let me try to explain my perspective in possibly more familiar 
> analogy to you.
> As you probably aware, with ALUA in SCSI, via Target Port Groups 
> mechanism, one can with zero worry specify certain LUN (ALUA) state on a 
> set of targets(at least in SCST implementation).
> I ain't sure about certain limitations but I think it's quite easy to 
> keep up with 1 LUN = 1 group ratio for flexible control.
> However, as I highlighter in earlier message, in nvmet implementation 
> there's allowed only 128 ANA Groups, while (each!) subsystem may keep up 
> to 1024 namespaces.

If your use-case needs more than 128 groups, you can send a patch.

> Thus, if I had no issue of say assigning a group per each 
> namespace(assuming that NSIDs are globally unique on my target), this is 
> currently not the case,
> so I'm trying my best of out in these restrictions, while keeping ANA 
> Group setup as straightforward, as possible.
> One may argue that I shall dump everything into one ANA Group but it 
> will contradict my expectations of High Availability of namespaces that 
> are still (mostly?) working while others aren't.

I don't understand where HA/multipathing come into play here at all, let
alone asymmetric access. But it doesn't really matter.

> One also may argue that it's rare to have in production greater number 
> of namespaces than 128 in total but I still would prefer to go for 
> support of 1024 anyway.
> Hope I cleared that one out, do feel free to correct me if I have a flaw 
> somewhere.
> 
>  > This state is not a permanent state, it is transient by definition,
>  > which is why the host is treating it as such.
>  >
>  > The host is expecting the controller to send another ANA AEN that
>  > notifies the new state within ANATT (i.e. stateA -> change ->
>  > stateB).
> 
> As mentioned by Hannes, and I agree, state is indeed transient but only 
> in relation to a namespace,

Where did you get that from the spec? How can a state be transient or
persistent if the ana group has zero namespaces or not? It is completely 
orthogonal. Any relashionship between the ana group state lifetime and
the number of namespaces that belong to the group make no sense to me at
all tbh.

> so I find it to be zero issue of having a group in change state with 0 
> namespaces as its members.

What is "zero issue"? you mean a non-issue?
The spec defines this state as a state that represents transition
between states, and hence its not surprising that the host expects it to
be as such. IMO the current host behavior is correct.

> I understand that it would be nice and dandy to change state of multiple 
> namespaces at once(if one can take time to configure such dependency 
> between them),
> but I at the moment opt for simpler but flexible solution, maybe at the 
> cost of greater number of ANA log changes in worst-case scenario.
> Thus, the cycle "namespace in state A" => "namespace in state of change" 
> => "namespace in state B" is still preserved, tho with different 
> methods(change of a group rather than a state of the group).

nvmet is actually violating the spec right now because it doesn't set
bit 6 in ctrl identify anacap, and it clearly exposes anagrpid as a
config knob, so either we need to block it, or set that bit.

In any event, you can move namespaces between ana groups as much as
you like, you don't need the change state at all, just don't use it,
especially if you keep it permanently which is not what the host nor the
spec expects.

>  > That is simply removing support for multipathing altogether.
> 
> You're not wrong on that one, tho, no offense,

None taken :)

> in certain configurations or certain initiators that's a way to go.
> Especially when it might be a matter of changing one implementation to 
> another(i.e. old good dm-multipath).

Whatever works for you...

> I mainly mentioned this because it fixes the issue on some 
> kernels(including mainline/LTS) while not on others,
> which is why I think it's important that misinterpretation of the 
> standard will be accounted for on the mainstream code
> since I can't possibly patch every single thing that lives on back-ports 
> for it(I personally look at CentOS world rn),
> while it might be the end user of my target setups.
> My territory is mainly the target and this is not the issue I can fix on 
> my side.

Again, there is no issue here that I see. The only issue I see right now
is that nvmet is allowing namespaces to change anagrpid while telling
the host that it won't change.

>  > Could be... We'll need to see patches.
> 
> On that regard, I have seen plenty of git-related mails around here,
> so would it be possible to publish patches as a few commits based on 
> mainline or infradead git repo on GitHub or something?

Not really.

> Or is it mandatory to go, no offense, the old-fashioned way of sending 
> patch files as attachments or text?

No attachments please, follow the instructions in:
Documentation/process/submitting-patches.rst


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NVMe over Fabrics host: behavior on presence of ANA Group in "change" state
  2022-02-07 22:16     ` Sagi Grimberg
@ 2022-02-10 13:46       ` Alex Talker
  2022-02-10 22:24         ` Sagi Grimberg
  0 siblings, 1 reply; 15+ messages in thread
From: Alex Talker @ 2022-02-10 13:46 UTC (permalink / raw)
  To: linux-nvme

 > Can you please stop top-posting, its difficult to follow this
 > copy-pasting-top-posting chain that you are generating..

My bad, I just had a hard time taming my mail client in all this 
plain-text mode requirements.
Also, I was a bit confused on how to handle branching-off replies :)


In short, my opinion, after consulting with NVMe base specification 2.0b 
stating is that:

a) (8.1.2, page 340) "Namespaces that are members of the same ANA Group 
perform identical asymmetric namespace accessstate transitions.
The ANA Group maintains the same asymmetric namespace access state for 
allnamespaces that are members of thatANA Group
[...]The method for assigning namespacesto ANA Groups is outside the 
scope ofthisspecification."
b) (8.1.2, page 340) "An ANA Group may contain zero or more 
namespaces[...]The mapping of namespaces,[...] to ANA Groups is vendor
specific."
c) (Figure 280, page 270) "ANAGRPID[...]
If the value in this field changes and Asymmetric Namespace Access 
ChangeNotices are supported and enabled,
then the controller shall issue an AsymmetricNamespace Access Change 
Notice."
d) (8.1.3.5, page 343) "While ANA Change state is reported by a 
controller for the namespace, the host should:[...part about ANATT...]"

Thus I see it that ANATT-based timer should be started only upon 
condition that a namespace belongs to a group in this state
but change of relation between a namespace and it's ANA state can occur 
either because ANA Group state has changes(and this would affect all of 
the group members)
or when ANAGRPID is changed(and this, if the new group's ANA state 
differs from the old one, affects only one namespace at a time).
 From that, I find it is logical to no-op on the empty ANA Groups in 
"change" state since I don't see the standard explicitly disallowing 
that behavior in any way whatsoever.

Now, moving to the rest...

 > If your use-case needs more than 128 groups, you can send a patch.

I thought these constants were defined by optimization or something but 
I ain't no expert on the code base :)
In that case, what is your opinion on adding number of groups & 
namespaces as module parameters(to nvmet) alike "mulipath"(i.e. to be 
only set upon the load)?
As I saw in earlier discussion here on different matter, in production 
there's systems which provide huge number of namespace(and possibly 
groups too)
and this extra parameters would allow for easy mock-up of such setups to 
look for bottleneck and the like, without re-patching the code for each try.
Would appreciate your advice on possible disadvantages of such an approach!

 > nvmet is actually violating the spec right now because it doesn't
 > set bit 6 in ctrl identify anacap, and it clearly exposes anagrpid as
 > a config knob, so either we need to block it, or set that bit.

So, my "expertise" is lacking on this one but if I get what you saying 
correctly,
it means that for past hell knows how much time I could change the ANA 
group on target, AEN(I think? the notification functionality I mean) 
would be successfully sent
and hosts were quite okay with what's going on but somebody forgot to 
advertise such capability in the controller's info?
I literally used my workaround on these pre-"native multipath" host 
times(I'm looking on CentOS 7 for example) and it seem to went okay, so 
I'd vote for the bit.
So...this is like +1-2 patch(-es)?

 > No attachments please, follow the instructions in:
 > Documentation/process/submitting-patches.rst

Gotcha, already working on it.
I tested the straightforward patch on the latest release and it seems to 
be working as expected
(i.e. ANATT triggers only on on non-empty "change" state groups), so I'm 
on the stage of trying out the 'git send-email' :)

Thanks for your time anyway!

Best regards,
Alex


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NVMe over Fabrics host: behavior on presence of ANA Group in "change" state
  2022-02-10 13:46       ` Alex Talker
@ 2022-02-10 22:24         ` Sagi Grimberg
  2022-02-11  2:08           ` Knight, Frederick
  0 siblings, 1 reply; 15+ messages in thread
From: Sagi Grimberg @ 2022-02-10 22:24 UTC (permalink / raw)
  To: Alex Talker, linux-nvme, Knight, Frederick


>  > Can you please stop top-posting, its difficult to follow this
>  > copy-pasting-top-posting chain that you are generating..
> 
> My bad, I just had a hard time taming my mail client in all this 
> plain-text mode requirements.
> Also, I was a bit confused on how to handle branching-off replies :)
> 
> 
> In short, my opinion, after consulting with NVMe base specification 2.0b 
> stating is that:
> 
> a) (8.1.2, page 340) "Namespaces that are members of the same ANA Group 
> perform identical asymmetric namespace accessstate transitions.
> The ANA Group maintains the same asymmetric namespace access state for 
> allnamespaces that are members of thatANA Group
> [...]The method for assigning namespacesto ANA Groups is outside the 
> scope ofthisspecification."
> b) (8.1.2, page 340) "An ANA Group may contain zero or more 
> namespaces[...]The mapping of namespaces,[...] to ANA Groups is vendor
> specific."
> c) (Figure 280, page 270) "ANAGRPID[...]
> If the value in this field changes and Asymmetric Namespace Access 
> ChangeNotices are supported and enabled,
> then the controller shall issue an AsymmetricNamespace Access Change 
> Notice."
> d) (8.1.3.5, page 343) "While ANA Change state is reported by a 
> controller for the namespace, the host should:[...part about ANATT...]"
> 
> Thus I see it that ANATT-based timer should be started only upon 
> condition that a namespace belongs to a group in this state
> but change of relation between a namespace and it's ANA state can occur 
> either because ANA Group state has changes(and this would affect all of 
> the group members)
> or when ANAGRPID is changed(and this, if the new group's ANA state 
> differs from the old one, affects only one namespace at a time).
>  From that, I find it is logical to no-op on the empty ANA Groups in 
> "change" state since I don't see the standard explicitly disallowing 
> that behavior in any way whatsoever.

Sorry, at least I and the original implementer (Christoph) disagree with
your interpretation. Not to say that we are both wrong.

Adding Fred for some more clarity here.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: NVMe over Fabrics host: behavior on presence of ANA Group in "change" state
  2022-02-10 22:24         ` Sagi Grimberg
@ 2022-02-11  2:08           ` Knight, Frederick
  2022-02-11  9:21             ` Alex Talker
  0 siblings, 1 reply; 15+ messages in thread
From: Knight, Frederick @ 2022-02-11  2:08 UTC (permalink / raw)
  To: Sagi Grimberg, Alex Talker, linux-nvme


>  > Can you please stop top-posting, its difficult to follow this  > 
> copy-pasting-top-posting chain that you are generating..
>
> My bad, I just had a hard time taming my mail client in all this 
> plain-text mode requirements.
> Also, I was a bit confused on how to handle branching-off replies :)
>
>
> In short, my opinion, after consulting with NVMe base specification 
> 2.0b stating is that:
>
> a) (8.1.2, page 340) "Namespaces that are members of the same ANA 
> Group perform identical asymmetric namespace accessstate transitions.
> The ANA Group maintains the same asymmetric namespace access state for 
> allnamespaces that are members of thatANA Group [...]The method for 
> assigning namespacesto ANA Groups is outside the scope 
> ofthisspecification."
> b) (8.1.2, page 340) "An ANA Group may contain zero or more 
> namespaces[...]The mapping of namespaces,[...] to ANA Groups is vendor 
> specific."
> c) (Figure 280, page 270) "ANAGRPID[...] If the value in this field 
> changes and Asymmetric Namespace Access ChangeNotices are supported 
> and enabled, then the controller shall issue an AsymmetricNamespace 
> Access Change Notice."
> d) (8.1.3.5, page 343) "While ANA Change state is reported by a 
> controller for the namespace, the host should:[...part about ANATT...]"
>
> Thus I see it that ANATT-based timer should be started only upon 
> condition that a namespace belongs to a group in this state but change 
> of relation between a namespace and it's ANA state can occur either 
> because ANA Group state has changes(and this would affect all of the 
> group members) or when ANAGRPID is changed(and this, if the new 
> group's ANA state differs from the old one, affects only one namespace 
> at a time).
>  From that, I find it is logical to no-op on the empty ANA Groups in 
> "change" state since I don't see the standard explicitly disallowing 
> that behavior in any way whatsoever.

Sorry, at least I and the original implementer (Christoph) disagree with your interpretation. Not to say that we are both wrong.

Adding Fred for some more clarity here.

[FK> ] I think I'm missing a bunch of context here. What is the original question?  I take a stab at some assumptions:
What is an empty ANA group?  That is an ANA Group with NO NSIDs associated with that group. Meaning the "Number of NSID Values" field is cleared to '0h' in the ANA Group Descriptor. That descriptor can be used to update some host internal state information related to that ANA group, but it has no impact on any I/O because there can be no I/O (since there are no NSID values).  So I'm not sure where that is going (because RGO=1 also can return ANA Groups that have state, but no attached namespaces (it's a way to get group state without any NSID inventory requirements)).

> ... or when ANAGRPID is changed(and this, if the new 
> group's ANA state differs from the old one, affects only one namespace 
> at a time).

Now this treads into the TP 4108 space.  There is currently no way to report anything that impacts "only one namespace at a time".  ANY report of a change (AEN) for any namespace is always reporting a state change for the entire group that contains the namespace where the event occurred.  That is the WHOLE POINT of ANA Groups.  AND, that is the whole point of TP4108 - to address that kind of situation (where a change impacts only 1 namespace). Until TP4108 address this situation, a single namespace changing the ANAGRPID is ugly.  Maybe we should get to work on that TP.

Also remember that "Change" state does NOT have to be visible to the host; the controller can transition through the Change state before the host ever notices - so, to the host, it can look like all the namespace in the Group transition directly from optimized to non-optimized (as one example).

Not sure if that helps any. What was the original question?

	Fred


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NVMe over Fabrics host: behavior on presence of ANA Group in "change" state
  2022-02-11  2:08           ` Knight, Frederick
@ 2022-02-11  9:21             ` Alex Talker
  2022-02-11 16:58               ` Knight, Frederick
  2022-02-14 13:16               ` Hannes Reinecke
  0 siblings, 2 replies; 15+ messages in thread
From: Alex Talker @ 2022-02-11  9:21 UTC (permalink / raw)
  To: linux-nvme, Sagi Grimberg, Knight, Frederick

 > [FK> ] I think I'm missing a bunch of context here. What is the
 > original question? I take a stab at some assumptions: What is an
 > empty ANA group? That is an ANA Group with NO NSIDs associated with
 > that group. Meaning the "Number of NSID Values" field is cleared to
 > '0h' in the ANA Group Descriptor. That descriptor can be used to
 > update some host internal state information related to that ANA
 > group, but it has no impact on any I/O because there can be no I/O
 > (since there are no NSID values). So I'm not sure where that is
 > going (because RGO=1 also can return ANA Groups that have state, but
 > no attached namespaces (it's a way to get group state without any
 > NSID inventory requirements)).

That's exactly right, "nnsids=0" case. I/O is not a problem for such a 
group, for sure.
I suppose the main argument we're having here is that when such a group 
has a "change" ANA state,
the host("nvme-core" module) starts a timer for ANATT which upon 
expiration resets the controller.
Now, I do not disagree that having such a group is "ugly" but rather 
argue that ANATT-related functionality could be only invoked for 
"nnsids>0" case,
since only then there's a relation between "change" state and a 
namespace via "ANAGRPID".

My approach for assigning ANA groups to namespaces involves and idea 
that on one node(i.e. "system") casually a namespace has the same state 
on every port,
since it's more likely that access state of the namespace would change, 
rather than what's it accessed through (the port),
so I simply pre-allocate 5 ANA groups per 5 possible at the moment ANA 
states on each port and then change "ANAGRPID" of a namespace to 
transition it from one state to another.
While it is perfectly possible as highlighter earlier to transition 
bypassing "change" state,
it is still preferable in my opinion in situations when the final state 
is not known "a priori",
and thus works as a graceful guard from host's I/O. This is why I opt to 
pre-allocate one for this state too,
however on modern versions of popular distributions that causes the 
reset issue described before,
which might have undetermined impact on my I/O in progress.

Thus, I find starting the ANATT timer redundant when "nnsids=0".
I think the only users such a change might affect if someone uses this 
as a dirty hack to reset controller on host(when that would be helpful 
tho?).
Otherwise, I have prepared & checked on the mainline a simple(+2 lines, 
-2 lines) patch that fixes this behavior,
so I might sent it if it's preferable to have this discussion around an 
actual change.

 > Now this treads into the TP 4108 space. There is currently no way to
 > report anything that impacts "only one namespace at a time". ANY
 > report of a change (AEN) for any namespace is always reporting a
 > state change for the entire group that contains the namespace where
 > the event occurred. That is the WHOLE POINT of ANA Groups. AND,
 > that is the whole point of TP4108 - to address that kind of situation
 > (where a change impacts only 1 namespace). Until TP4108 address this
 > situation, a single namespace changing the ANAGRPID is ugly. Maybe
 > we should get to work on that TP.

I ain't no member of a committee or something(unfortunately), so I have 
no idea what TP 4108 is about or where to find it.
But my main message on this passage was not in a sense how little data 
would be exchanged between target & hosts but rather for how many namespace
relation between them and associated with them ANA state would change, 
as to highlight the contrast between changing ANA state of a group and 
changing ANAGRPID of a namespace.
Again, I do not disagree that it's ugly but on the matter why I can't 
just go an assign each namespace(assuming NSID is global on my target 
system rather than one of the subsystems)
a separate ANA Group due to 8 times difference between allowed number of 
the first and the latter, I proposed to parametrize that in previous 
message but got no reply in that regard unfortunately.

Hope that more or less cleared things out.

Thanks for your time!

Best regards,
Alex



^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: NVMe over Fabrics host: behavior on presence of ANA Group in "change" state
  2022-02-11  9:21             ` Alex Talker
@ 2022-02-11 16:58               ` Knight, Frederick
  2022-02-11 20:53                 ` Alex Talker
  2022-02-14 13:16               ` Hannes Reinecke
  1 sibling, 1 reply; 15+ messages in thread
From: Knight, Frederick @ 2022-02-11 16:58 UTC (permalink / raw)
  To: Alex Talker, linux-nvme, Sagi Grimberg

> 
> 
>  > [FK> ] I think I'm missing a bunch of context here. What is the  > original
> question? I take a stab at some assumptions: What is an  > empty ANA
> group? That is an ANA Group with NO NSIDs associated with  > that group.
> Meaning the "Number of NSID Values" field is cleared to  > '0h' in the ANA
> Group Descriptor. That descriptor can be used to  > update some host
> internal state information related to that ANA  > group, but it has no impact
> on any I/O because there can be no I/O  > (since there are no NSID values).
> So I'm not sure where that is  > going (because RGO=1 also can return ANA
> Groups that have state, but  > no attached namespaces (it's a way to get
> group state without any  > NSID inventory requirements)).
> 
> That's exactly right, "nnsids=0" case. I/O is not a problem for such a group, for
> sure.
> I suppose the main argument we're having here is that when such a group
> has a "change" ANA state, the host("nvme-core" module) starts a timer for
> ANATT which upon expiration resets the controller.
> Now, I do not disagree that having such a group is "ugly" but rather argue
> that ANATT-related functionality could be only invoked for "nnsids>0" case,
> since only then there's a relation between "change" state and a namespace
> via "ANAGRPID".
> 
> My approach for assigning ANA groups to namespaces involves and idea that
> on one node(i.e. "system") casually a namespace has the same state on
> every port, since it's more likely that access state of the namespace would
> change, rather than what's it accessed through (the port), so I simply pre-
> allocate 5 ANA groups per 5 possible at the moment ANA states on each port
> and then change "ANAGRPID" of a namespace to transition it from one state
> to another.

[FK> ] I'm not sure I understand that.  Access state is always based on the port, and ANA is totally about different access states on different ports.  If it was always the same on every port, then it would be symmetric and there would be no need for ANA.  The point of the ANAGRPID is so the host can use a change of state reported for one namespace to also recognize that an equivalent change has also occurred for all other namespaces that have the same ANAGRPID.

> While it is perfectly possible as highlighter earlier to transition bypassing
> "change" state, it is still preferable in my opinion in situations when the final
> state is not known "a priori", and thus works as a graceful guard from host's
> I/O. This is why I opt to pre-allocate one for this state too, however on
> modern versions of popular distributions that causes the reset issue
> described before, which might have undetermined impact on my I/O in
> progress.
> 
> Thus, I find starting the ANATT timer redundant when "nnsids=0".
> I think the only users such a change might affect if someone uses this as a
> dirty hack to reset controller on host(when that would be helpful tho?).
> Otherwise, I have prepared & checked on the mainline a simple(+2 lines,
> -2 lines) patch that fixes this behavior, so I might sent it if it's preferable to
> have this discussion around an actual change.
> 
>  > Now this treads into the TP 4108 space. There is currently no way to  >
> report anything that impacts "only one namespace at a time". ANY  > report
> of a change (AEN) for any namespace is always reporting a  > state change for
> the entire group that contains the namespace where  > the event occurred.
> That is the WHOLE POINT of ANA Groups. AND,  > that is the whole point of
> TP4108 - to address that kind of situation  > (where a change impacts only 1
> namespace). Until TP4108 address this  > situation, a single namespace
> changing the ANAGRPID is ugly. Maybe  > we should get to work on that TP.
> 
> I ain't no member of a committee or something(unfortunately), so I have no
> idea what TP 4108 is about or where to find it.
> But my main message on this passage was not in a sense how little data
> would be exchanged between target & hosts but rather for how many
> namespace relation between them and associated with them ANA state
> would change, as to highlight the contrast between changing ANA state of a
> group and changing ANAGRPID of a namespace.
> Again, I do not disagree that it's ugly but on the matter why I can't just go an
> assign each namespace(assuming NSID is global on my target system rather
> than one of the subsystems) a separate ANA Group due to 8 times
> difference between allowed number of the first and the latter, I proposed to
> parametrize that in previous message but got no reply in that regard
> unfortunately.

[FK> ] It would be fine for a host to track each NSID individually, but they are unique only to a single NVM subsystem (if your host is connected to an NVM subsystem from vendor 1, and also to an NVM subsystem from vendor 2, then an NSID on the first subsystem is a DIFFERENT namespace than the same NSID on the other NVM subsystem).  Dispersed namespaces are a different topic for a different thread.  And how a host does groupings of namespaces and how the ANAGRPID is defined in the spec are independent.

Right now if a namespace changes its ANAGRPID, there is 1 AEN required - for the ANA Log page contents changed (the NAMESPACE data changed AEN is prohibited for this case). But, if the ANA changes in the log page cause any groups to enter CHANGE state, then all namespaces in that ANA Group are in the CHANGE state - not just the 1 namespace for which the ANAGRPID value changed. So storage that can instantaneously change the ANAGRPID, the change is just about inventory.  But, for storage that takes time to move things around, the whole "source" ANA group may enter CHANGE state (AEN), so the one NSID can be removed (maybe another AEN), then the "destination" ANA group enters CHANGE state (maybe another AEN), the "source" ANA group can go out of CHANGE state (maybe another AEN), the "destination" ANA group has the NSID added (maybe another AEN), and that "destination" ANA group can go out of CHANGE state (maybe another AEN) - that means stopping all commands to all the namespaces in both groups at some point during that "move" process.  How many changes happen (vs. how many steps are combined), and how many AENs happen depends on how long it takes, how many steps are merged vs. independent, and how the host responds during that process.  But no matter how it progresses, that process is ugly, and something we wanted to optimize (via TP4108).  We hoped to create a way to optimize that transition.

As for a group with zero attached namespaces - a host that uses RGO=0 will not get any state information about that group (it will simply NOT be returned in the log page).  If however, the host uses RGO=1, then the host gets back a list of all groups and their states (and there aren't ANY NSID values returned at all); meaning, there is no way to determine from that data alone if there are any attached namespaces or not.  The point of RGO=1 is to be able to update the state of the groups without having to parse all the NSID information (just so it can be ignored).

SO, what should happen for an ANA GROUP that has no namespaces when that group enters CHANGE state.  I don't see why it should be any different than any other group.  I'm not convinced a group with 0 namespaces is allowed to have any different behavior than a group with 1 namespace attached. No group should remain in the CHANGE state any longer than the ANATT timer value.  However, when I read section 8.10.4 Host ANA Change Notice operation (NVMe Base Spec 2.0), all the recovery actions are described in the context of sending commands to a namespace in the ANA Group, or the retries of commands being sent to a namespace in the ANA Group.  Obviously, that will never happen for an ANA Group with no namespaces.  EVEN the worst case scenario says: "If the ANATT time interval expires, then the host should use a different controller for sending commands to the namespaces in that ANA Group."  It's still about commands sent to namespaces.  NOWHERE does that text suggest a reset.  If an ANATT timeout occurs - it says pick a different path for sending commands to the namespaces in that group (which is obviously a no-op when the group has no namespaces).

So if the timer is not started (because there are 0 namespaces attached) - and a namespace does come along (added to an ANA group that is still in the CHANGE state), would the timer start when the first command is sent to that namespace (and it fails with the Asymmetric Access Transition)?  That seems fine.

> 
> Hope that more or less cleared things out.
> 
> Thanks for your time!
> 
> Best regards,
> Alex


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NVMe over Fabrics host: behavior on presence of ANA Group in "change" state
  2022-02-11 16:58               ` Knight, Frederick
@ 2022-02-11 20:53                 ` Alex Talker
  0 siblings, 0 replies; 15+ messages in thread
From: Alex Talker @ 2022-02-11 20:53 UTC (permalink / raw)
  To: Knight, Frederick, linux-nvme, Sagi Grimberg

Thanks for taking time to give the advance explanation! Now...

 > [FK> ] I'm not sure I understand that. Access state is always based
 > on the port, and ANA is totally about different access states on
 > different ports. If it was always the same on every port, then it
 > would be symmetric and there would be no need for ANA. The point of
 > the ANAGRPID is so the host can use a change of state reported for
 > one namespace to also recognize that an equivalent change has also
 > occurred for all other namespaces that have the same ANAGRPID.

I just meant that my setup is a little bit dumb.
In all previous messages I was talking in context of only one 
node("installation") but it's actually more cluster-like configuration 
on bigger picture.
Thus it is often (in my experience) when one namespace(i.e. underlying 
block device) needs separate attention at given while other aren't,
and ports rather disappear as a whole(for example due to broken cable) 
rather than part of the namespaces just unavailable on one of them.
Hence why I opted for such group configuration.
I do understand that the standard aims for more flexible path and it's 
okay, that's just too advanced for my application of this functionality.
One can also assume that namespace's NSID is global on such system for, 
again, pure simplicity.
And I do set NGUID to same value between nodes(when it's possible to 
have shared block device, present on all of them), so it's all fine and 
dandy in that part.

 > [FK> ] It would be fine for a host to track each NSID individually,
 > but they are unique only to a single NVM subsystem (if your host is
 > connected to an NVM subsystem from vendor 1, and also to an NVM
 > subsystem from vendor 2, then an NSID on the first subsystem is a
 > DIFFERENT namespace than the same NSID on the other NVM subsystem).
 > Dispersed namespaces are a different topic for a different thread.
 > And how a host does groupings of namespaces and how the ANAGRPID is
 > defined in the spec are independent.

The last statement precisely explains all the rest, since again, it's 
just my own setup and my own choice how to map things,
so as I highlighted above, in my case equal NGUID would likely yield 
equal NSID between different subsystems
(which might be setup in order to give different set of resources to 
different hosts, since list of allowed hosts is set in their plane in 
nvmet implementation).
I probably should had written a clearer explanation prior, sorry for the 
distraction.


 > Right now if a namespace changes its ANAGRPID, there is 1 AEN
 > required - for the ANA Log page contents changed (the NAMESPACE data
 > changed AEN is prohibited for this case). But, if the ANA changes in
 > the log page cause any groups to enter CHANGE state, then all
 > namespaces in that ANA Group are in the CHANGE state - not just the 1
 > namespace for which the ANAGRPID value changed. So storage that can
 > instantaneously change the ANAGRPID, the change is just about
 > inventory. But, for storage that takes time to move things around,
 > the whole "source" ANA group may enter CHANGE state (AEN), so the one
 > NSID can be removed (maybe another AEN), then the "destination" ANA
 > group enters CHANGE state (maybe another AEN), the "source" ANA group
 > can go out of CHANGE state (maybe another AEN), the "destination" ANA
 > group has the NSID added (maybe another AEN), and that "destination"
 > ANA group can go out of CHANGE state (maybe another AEN) - that means
 > stopping all commands to all the namespaces in both groups at some
 > point during that "move" process. How many changes happen (vs. how
 > many steps are combined), and how many AENs happen depends on how
 > long it takes, how many steps are merged vs. independent, and how the
 > host responds during that process. But no matter how it progresses,
 > that process is ugly, and something we wanted to optimize (via
 > TP4108). We hoped to create a way to optimize that transition.

So, did I got right, that it is advised to put ANA groups in "change" 
state when changing ANAGRPID(in sense of namespace attribute)?
Or did I completely lost the plot?
In any case, I sincerely hope that whatever is going on in this document 
I definitely have no access to reach, it's for the best!
I suppose I do get the basics of ANA groups tho(in regard that state 
changes for all group members at once) but thanks for the explanation 
anyway.

 > As for a group with zero attached namespaces - a host that uses RGO=0
 > will not get any state information about that group (it will simply
 > NOT be returned in the log page). If however, the host uses RGO=1,
 > then the host gets back a list of all groups and their states (and
 > there aren't ANY NSID values returned at all); meaning, there is no
 > way to determine from that data alone if there are any attached
 > namespaces or not. The point of RGO=1 is to be able to update the
 > state of the groups without having to parse all the NSID information
 > (just so it can be ignored).

Now I once again learned something new! So I get that RGO is an 
optimization, which is nice.
However, the piece of code I'm having problems with in this 
implementation(nvmet.ko) seems to opt for RGO=0
but I'm not completely sure. I did this conclusion based on the fact 
that nnsids is checked
withing a function I'm trying to patch (nvme_update_ana_state) and it 
clearly comes from the log.
Someone with more familiarity with the code base might give an idea 
whether RGO=1 is the case or it depends.

 > SO, what should happen for an ANA GROUP that has no namespaces when
 > that group enters CHANGE state. I don't see why it should be any
 > different than any other group. I'm not convinced a group with 0
 > namespaces is allowed to have any different behavior than a group
 > with 1 namespace attached. No group should remain in the CHANGE state
 > any longer than the ANATT timer value. However, when I read section
 > 8.10.4 Host ANA Change Notice operation (NVMe Base Spec 2.0), all the
 > recovery actions are described in the context of sending commands to
 > a namespace in the ANA Group, or the retries of commands being sent
 > to a namespace in the ANA Group. Obviously, that will never happen
 > for an ANA Group with no namespaces. EVEN the worst case scenario
 > says: "If the ANATT time interval expires, then the host should use a
 > different controller for sending commands to the namespaces in that
 > ANA Group." It's still about commands sent to namespaces. NOWHERE
 > does that text suggest a reset. If an ANATT timeout occurs - it says
 > pick a different path for sending commands to the namespaces in that
 > group (which is obviously a no-op when the group has no namespaces).

Why exactly the ANATT timer's function (nvme_anatt_timeout) opts for 
reset is unclear to me from the commit description to be honest.
The rest is my observation too.


> So if the timer is not started (because  there are 0 namespaces
 > attached) - and a namespace does come along (added to an ANA group
 > that is still in the CHANGE state), would the timer start when the
 > first command is sent to that namespace (and it fails with the
 > Asymmetric Access Transition)? That seems fine.

This is precisely what I'm aiming at with my patch in-progress, in this 
thread I just wanted to discuss its sanity prior to publishing,
the situation why I have the problem in the first place and to get other 
ideas on the way.
I'll double check but as far as I remember it worked fine with the patch.
So, just to be sure, you do agree then with my proposal that there's no 
point to start the timer prior to when at least one namespace becomes a 
member of such a group?

Much appreciated for your overall knowledge!


Best regards,
Alex



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NVMe over Fabrics host: behavior on presence of ANA Group in "change" state
  2022-02-11  9:21             ` Alex Talker
  2022-02-11 16:58               ` Knight, Frederick
@ 2022-02-14 13:16               ` Hannes Reinecke
  2022-02-14 16:16                 ` Knight, Frederick
  1 sibling, 1 reply; 15+ messages in thread
From: Hannes Reinecke @ 2022-02-14 13:16 UTC (permalink / raw)
  To: Alex Talker, linux-nvme, Sagi Grimberg, Knight, Frederick

On 2/11/22 10:21, Alex Talker wrote:
>  > [FK> ] I think I'm missing a bunch of context here. What is the
>  > original question? I take a stab at some assumptions: What is an
>  > empty ANA group? That is an ANA Group with NO NSIDs associated with
>  > that group. Meaning the "Number of NSID Values" field is cleared to
>  > '0h' in the ANA Group Descriptor. That descriptor can be used to
>  > update some host internal state information related to that ANA
>  > group, but it has no impact on any I/O because there can be no I/O
>  > (since there are no NSID values). So I'm not sure where that is
>  > going (because RGO=1 also can return ANA Groups that have state, but
>  > no attached namespaces (it's a way to get group state without any
>  > NSID inventory requirements)).
> 
> That's exactly right, "nnsids=0" case. I/O is not a problem for such a 
> group, for sure.
> I suppose the main argument we're having here is that when such a group 
> has a "change" ANA state, the host("nvme-core" module) starts a timer
> for ANATT which upon expiration resets the controller.
> Now, I do not disagree that having such a group is "ugly" but rather 
> argue that ANATT-related functionality could be only invoked for 
> "nnsids>0" case, since only then there's a relation between "change"
> state and a namespace via "ANAGRPID".
> 
Ah. So now I get where you are coming from.

The problem seems to be around a static ANA group with status 'change'.
The whole idea behind the 'change' ANA status (and ANA groups in 
general) was that ANA groups could _change_ the ANA status, and such a 
change would affect all namespaces in that group.
 From that angle having a 'change' state makes sense, as one could use 
them to signal "hey, I'm busy transitioning to another state" to the host.

But when using _static_ ANA groups the whole concept become shaky.
While it's perfectly okay to have static ANA groups with 'normal' states 
(ie excluding the 'change' state), things become iffy if you have a 
static 'change' state.
Thing is, having a 'change' state indicates to the host that a 
controller will need time to transition between states. And this 
transitioning out of necessity will always be between 'normal' ANA 
states (ie excluding the 'change' state).
The spec doesn't allow the controller to specify "I need time to 
transition _to_ the 'change' state"; the transition to and from 'change' 
is always assumed to be atomic.
(Because the 'change' state is used to signal precisely that condition.)

Exec summary: having static ANA groups is okay, having a static ANA 
group in 'change' state questionable and not something I would recommend.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer


^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: NVMe over Fabrics host: behavior on presence of ANA Group in "change" state
  2022-02-14 13:16               ` Hannes Reinecke
@ 2022-02-14 16:16                 ` Knight, Frederick
  2022-02-22  9:04                   ` Alex Talker
  0 siblings, 1 reply; 15+ messages in thread
From: Knight, Frederick @ 2022-02-14 16:16 UTC (permalink / raw)
  To: Hannes Reinecke, Alex Talker, linux-nvme, Sagi Grimberg

> 
> On 2/11/22 10:21, Alex Talker wrote:
> >  > [FK> ] I think I'm missing a bunch of context here. What is the  >
> > original question? I take a stab at some assumptions: What is an  >
> > empty ANA group? That is an ANA Group with NO NSIDs associated with  >
> > that group. Meaning the "Number of NSID Values" field is cleared to  >
> > '0h' in the ANA Group Descriptor. That descriptor can be used to  >
> > update some host internal state information related to that ANA  >
> > group, but it has no impact on any I/O because there can be no I/O  >
> > (since there are no NSID values). So I'm not sure where that is  >
> > going (because RGO=1 also can return ANA Groups that have state, but
> > > no attached namespaces (it's a way to get group state without any  >
> > NSID inventory requirements)).
> >
> > That's exactly right, "nnsids=0" case. I/O is not a problem for such a
> > group, for sure.
> > I suppose the main argument we're having here is that when such a
> > group has a "change" ANA state, the host("nvme-core" module) starts a
> > timer for ANATT which upon expiration resets the controller.
> > Now, I do not disagree that having such a group is "ugly" but rather
> > argue that ANATT-related functionality could be only invoked for
> > "nnsids>0" case, since only then there's a relation between "change"
> > state and a namespace via "ANAGRPID".
> >
> Ah. So now I get where you are coming from.
> 
> The problem seems to be around a static ANA group with status 'change'.
> The whole idea behind the 'change' ANA status (and ANA groups in
> general) was that ANA groups could _change_ the ANA status, and such a
> change would affect all namespaces in that group.
>  From that angle having a 'change' state makes sense, as one could use them
> to signal "hey, I'm busy transitioning to another state" to the host.
> 
> But when using _static_ ANA groups the whole concept become shaky.
> While it's perfectly okay to have static ANA groups with 'normal' states (ie
> excluding the 'change' state), things become iffy if you have a static 'change'
> state.
> Thing is, having a 'change' state indicates to the host that a controller will
> need time to transition between states. And this transitioning out of
> necessity will always be between 'normal' ANA states (ie excluding the
> 'change' state).
> The spec doesn't allow the controller to specify "I need time to transition
> _to_ the 'change' state"; the transition to and from 'change'
> is always assumed to be atomic.
> (Because the 'change' state is used to signal precisely that condition.)
> 
> Exec summary: having static ANA groups is okay, having a static ANA group in
> 'change' state questionable and not something I would recommend.

[FK> ]  That is the point of ANATT.  If you get a status code of Asymmetric Access Transition on any I/O,
Then all namespaces in the ANA Group that contains that namespace are in CHANGE state.  All of those
Namespaces (the ones in the same ANA Group as the namespace that returned a status code of Asymmetric
Access Transition) are allowed to stay in that state for ANATT seconds.  So YES, that ANA Group may
Stay in the CHANGE state (but only for ANATT seconds).  If it stays in the CHANGE state for longer than
ANATT Seconds, then there is a problem.  The definition of ANATT = the controller will NOT stay in the
ANA Change state longer than this amount of time.

> 
> Cheers,
> 
> Hannes
> --
> Dr. Hannes Reinecke                        Kernel Storage Architect
> hare@suse.de                                      +49 911 74053 688
> SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
> HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NVMe over Fabrics host: behavior on presence of ANA Group in "change" state
  2022-02-14 16:16                 ` Knight, Frederick
@ 2022-02-22  9:04                   ` Alex Talker
  0 siblings, 0 replies; 15+ messages in thread
From: Alex Talker @ 2022-02-22  9:04 UTC (permalink / raw)
  To: Knight, Frederick, Hannes Reinecke, linux-nvme, Sagi Grimberg

Apologies for the long delay in reply, it has been a rough week :)

Okay, I see your point. I think this discussion was quite helpful for me,
so I submitted patches regarding parametrize  of maximum number of ANA 
groups (and also namespaces),
signed off under something that ain't an alias of course (if one can't 
tell by my rambling that I do fit the pseudonym. hehe).

I checked both changes on mainline and it worked fine, i could allocate 
1024 namespace and assign each one a separate ANA group without much of 
an issue
but I noticed that disconnecting from such large configuration was 
somewhat buggy perhaps, so this might be something to investigate in the 
future.

In any case, thanks for your time and expertise!
I hope that you all would have a nice day and definitely accept my patches,
so that they could spread-out and remove need of manual code patching to 
fix my issue with it :)

Best regards,
Alex


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-02-22  9:05 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-06 13:59 NVMe over Fabrics host: behavior on presence of ANA Group in "change" state Alex Talker
2022-02-07  9:46 ` Hannes Reinecke
2022-02-07 15:04   ` Alex Talker
2022-02-07 11:07 ` Sagi Grimberg
2022-02-07 15:04   ` Alex Talker
2022-02-07 22:16     ` Sagi Grimberg
2022-02-10 13:46       ` Alex Talker
2022-02-10 22:24         ` Sagi Grimberg
2022-02-11  2:08           ` Knight, Frederick
2022-02-11  9:21             ` Alex Talker
2022-02-11 16:58               ` Knight, Frederick
2022-02-11 20:53                 ` Alex Talker
2022-02-14 13:16               ` Hannes Reinecke
2022-02-14 16:16                 ` Knight, Frederick
2022-02-22  9:04                   ` Alex Talker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.