NVMe over Fabrics host: behavior on presence of ANA Group in "change" state

* NVMe over Fabrics host: behavior on presence of ANA Group in "change" state
@ 2022-02-06 13:59 Alex Talker
  2022-02-07  9:46 ` Hannes Reinecke
  2022-02-07 11:07 ` Sagi Grimberg
  0 siblings, 2 replies; 15+ messages in thread
From: Alex Talker @ 2022-02-06 13:59 UTC (permalink / raw)
  To: linux-nvme

Recently I noticed a peculiar error after connecting from the host
(CentOS 8 Stream at the time, more on that below)
via TCP(unlikely matters) to the NVMe target subsystem shared using 
nvmet module:

 > ...
 > nvme nvme1: ANATT timeout, resetting controller.
 > nvme nvme1: creating 8 I/O queues.
 > nvme nvme1: mapped 8/0/0 default/read/poll queues.
 > ...
 > nvme nvme1: ANATT timeout, resetting controller.
 > ...(and it continues like that over and over and over again, on some 
configuration even getting worse with greater iterations of reconnect)

I discovered that this behavior is caused by code in 
drivers/nvme/host/multipath.c,
in particular when function nvme_update_ana_state increments value of 
variable nr_change_groups whenever any ANA Group is in "change",
indifference of whether any namespace belongs to the group or not.
Now, after figuring out that ANATT stands for ANA Transition Time and 
reading some more of the NVMe 2.0 standards, I understood that the 
problem caused by how I managed to utilize ANA Groups.

As far as I remember, permitted number of ANA Groups in nvmet module is 
128, while maximum number of namespaces is 1024(8 times more).
Thus, mapping 1 namespace to 1 ANA Group works only up to a point.
It is nice to have some logically-related namespaces belong to the same 
ANA Group,
and the final scheme of how namespaces belong to ANA groups is often 
vendor-specific
(or rather lies in decision domain of the end user of target-related stuff),
However, rather than changing state of a namespace on specific port, for 
example for maintenance reasons,
I find it particularly useful to utilize ANA Groups to change the state 
of a certain namespace, since it is more likely that block device might 
enter unusable state or be a part of some transitioning process.
Thus, the simplest scheme for me on each port is to assign few ANA 
Groups, one per each possible ANA state, and change ANA Group on a 
namespace rather than changing state of the group the namespace belongs 
to at the moment.
And here's the catch.

If one creates a subsystem(no namespaces needed) on a port, connects to 
it and then sets state of ANA Group #1 to "change", the issue introduced 
in the beginning would be reproduced practically on many major distros 
and even upstream code without and issue,
tho sometimes it can be mitigated by disabling the "native 
multipath"(when /sys/module/nvme_core/parameters/multipath set to N) but 
sometimes that's not the case which is why this issue quite annoying for 
my setup.
I just checked it on 5.15.16 from Manjaro(basically Arch Linux) and 
ELRepo's kernel-ml and kernel-lt(basically vanilla versions of the 
mainline and LTS kernels respectively for CentOSs).

The standard tells that:

 > An ANA Group may contain zero or more namespaces

which makes perfect sense, since one has to create a group prior to 
assigning it to a namespace, and then:

 > While ANA Change state is reported by a controller for the namespace, 
the host should: ...(part regarding ANATT)

So on one hand I think my setup might be questionable(I might allocate 
ANAGRPID for "change" only in times of actual transitions, while that 
might over-complicate usage of the module),
on the other I think it happens to be a misinterpretation of the 
standard and might need some additional clarification.

That's why I decided to compose this message first prior to proposing 
any patches.

Also, while digging the code, I noticed that ANATT at the moment 
presented by a random constant(of 10 seconds), and since often 
transition time differs depending on block devices being in-use 
underneath namespaces,
it might be viable to allow end-user to change this value via configfs.

Considering everything I wrote, I'd like to hear opinions on the 
following issues:
1. Whether my utilization of ANA Groups is viable approach?
2. Which ANA Group assignment schemes utilized in production, from your 
experience?
3. Whether changing ANATT value change should be allowed via configfs(in 
particular, on per-subsystem level I think)?

Thanks for reading till the very end! Hope I didn't ramble too much, I 
just wanted to only lay out all of the details.

Best regards,
Alex

^ permalink raw reply	[flat|nested] 15+ messages in thread