Re: [PATCH v2] nvmet: force reconnect when number of queue changes

From: Hannes Reinecke <hare@suse.de>
To: Sagi Grimberg <sagi@grimberg.me>, Daniel Wagner <dwagner@suse.de>,
	"Knight, Frederick" <Frederick.Knight@netapp.com>
Cc: "linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
	Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Subject: Re: [PATCH v2] nvmet: force reconnect when number of queue changes
Date: Wed, 28 Sep 2022 17:21:09 +0200	[thread overview]
Message-ID: <ff83b414-e268-e5e2-3e2f-c6d33cd0cc5d@suse.de> (raw)
In-Reply-To: <b121c092-cecc-b891-be3c-b32c2a3e611d@grimberg.me>

On 9/28/22 16:23, Sagi Grimberg wrote:
> 
>>> Would this be a case for a new AEN - controller configuration changed?
>>> I'm also wondering exactly what changed in the controller?  It can't
>>> be the "Number of Queues" feature (that can't change - The controller
>>> shall not change the value allocated between resets.).  Is it the MQES
>>> field in the CAP property that changes (queue size)?
>>>
>>> We already have change reporting for: Namespace attribute, Predictable
>>> Latency, LBA status, EG aggregate, Zone descriptor, Discovery Log,
>>> Reservations. We've questioned whether we need a Controller Attribute
>>> Changed.
>>>
>>> Would this be a case for an exception?  Does the DNR bit apply only to
>>> commands sent on queues that already exist (i.e., NOT the connect
>>> command since that command is actually creating the queue)?  FWIW - I
>>> don't like exceptions.
>>>
>>> Can you elaborate on exactly what is changing?
>>
>> The background story is, that we have a setup with two targets (*) and
>> the host is connected two both of them (HA setup). Both server run the
>> same software version. The host is told via Number of Queues (Feature
>> Identifier 07h) how many queues are supported (N queues).
>>
>> Now, a software upgrade is started which takes first one server offline,
>> updates it and brings it online again. Then the same process with the
>> second server.
>>
>> In the meantime the host tries to reconnect. Eventually, the reconnect
>> is successful but the Number of Queues (Feature Identifier 07h) has
>> changed to a smaller value, e.g N-2 queues.
>>
>> My test case here is trying to replicated this scenario but just with
>> one target. Hence the problem how to notify the host that it should
>> reconnect. As you mentioned this is not to supposed to change as long a
>> connection is established.
>>
>> My understanding is that the current nvme target implementation in Linux
>> doesn't really support this HA setup scenario hence my attempt to get it
>> flying with one target. The DNR bit comes into play because I was toying
>> with removing the subsystem from the port, changing the number of queues
>> and re-adding the subsystem to the port.
>>
>> This resulted in any request posted from the host seeing the DNR
>> bit. The second attempt here was to delete the controller to force a
>> reconnect. I agree it's also not really the right thing to do.
>>
>> As far I can tell, what's is missing from a testing point of view is the
>> ability to fail requests without the DNR bit set or the ability to tell
>> the host to reconnect. Obviously, an AEN would be nice for this but I
>> don't know if this is reason enough to extend the spec.
> 
> Looking into the code, its the connect that fails on invalid parameter
> with a DNR, because the host is attempting to connect to a subsystems
> that does not exist on the port (because it was taken offline for
> maintenance reasons).
> 
> So I guess it is valid to allow queue change without removing it from
> the port, but that does not change the fundamental question on DNR.
> If the host sees a DNR error on connect, my interpretation is that the
> host should not retry the connect command itself, but it shouldn't imply
> anything on tearing down the controller and giving up on it completely,
> forever.

Well. And now we're in the nitty gritty details.
Problem is that 'connect' is tied with the whole 'create association' 
stuff from the transport.
I can't really imagine what would be the results of a successful 
association creation, a 'connect' failing with DNR bit set, and the 
association _not_ being torn down.
In the end, the association is only valid for this specific 'connect' 
command, and as that failed we have to tear down the association.
Am I correct this far?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman