From mboxrd@z Thu Jan 1 00:00:00 1970 From: kbusch@kernel.org (Keith Busch) Date: Wed, 31 Jul 2019 14:58:37 -0600 Subject: [PATCH rfc 1/2] nvme: don't remove namespace if revalidate failed because of controller reset In-Reply-To: References: <8bd6d219-f4fd-de58-a341-257c6274eddd@grimberg.me> <2825eb74-1df5-5dd2-3e90-c696bc7fa3d1@grimberg.me> <20190730173048.GC13948@localhost.localdomain> <61445d6f-f4ca-f8d4-cef2-5bfe40aa1e7f@suse.de> <2f7535ab-3d45-b24d-1512-a937e16e620f@grimberg.me> <20190731193257.GB15643@localhost.localdomain> <0720636c-8706-e927-3c0b-c2687694664f@grimberg.me> <20190731201634.GC15643@localhost.localdomain> Message-ID: <20190731205836.GD15643@localhost.localdomain> On Wed, Jul 31, 2019@01:45:12PM -0700, Sagi Grimberg wrote: > > > > > > I think I asked this but was not answered, why are we removing > > > > > the namespace at all? do others do the same thing (remove the > > > > > disk if revalidation fails)? > > > > > > > > If a namespace no longer exists, > > > > > > Why is it no longer exists? it failed revalidate.. > > > > One way it fails to validate is if it doesn't exist, i.e., the > > controller returned an error when attempting to identify it. > > > > The other way it may fail to revalidate is if its identify has changed > > since we last discovered it, so removal is better than data corruption. > > Well, perhaps we can mark failures resulting from reset with a transport > error. > > For example, nvme_cancel_request is setting:NVME_SC_ABORT_REQ, perhaps > we can modify nvme_error_status to set that into BLK_STS_TRANSPORT and > check for that as the return code for revalidate_disk? > > Thoughts? Would it be sufficient to let these admin commands requeue? Instead of flushing the scan work, we can let it block for IO on a reset, and the IO will resume when the reset completes.