Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems

From: Vishal Verma <vishal.l.verma@intel.com>
To: Jan Kara <jack@suse.cz>
Cc: Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	Linux FS Devel <linux-fsdevel@vger.kernel.org>,
	"lsf-pc@lists.linux-foundation.org"
	<lsf-pc@lists.linux-foundation.org>,
	Viacheslav Dubeyko <slava@dubeyko.com>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>
Subject: Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
Date: Tue, 17 Jan 2017 15:14:21 -0700	[thread overview]
Message-ID: <20170117221421.GC4880@omniknight.lm.intel.com> (raw)
In-Reply-To: <20170117143703.GP2517@quack2.suse.cz>

On 01/17, Jan Kara wrote:
> On Mon 16-01-17 02:27:52, Slava Dubeyko wrote:
> > 
> > -----Original Message-----
> > From: Vishal Verma [mailto:vishal.l.verma@intel.com] 
> > Sent: Friday, January 13, 2017 4:49 PM
> > To: Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>
> > Cc: lsf-pc@lists.linux-foundation.org; linux-nvdimm@lists.01.org; linux-block@vger.kernel.org; Linux FS Devel <linux-fsdevel@vger.kernel.org>; Viacheslav Dubeyko <slava@dubeyko.com>
> > Subject: Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
> > 
> > <skipped>
> > 
> > > We don't have direct physical access to the device's address space, in the sense
> > > the device is still free to perform remapping of chunks of NVM underneath us.
> > > The problem is that when a block or address range (as small as a cache line) goes bad,
> > > the device maintains a poison bit for every affected cache line. Behind the scenes,
> > > it may have already remapped the range, but the cache line poison has to be kept so that
> > > there is a notification to the user/owner of the data that something has been lost.
> > > Since NVM is byte addressable memory sitting on the memory bus, such a poisoned
> > > cache line results in memory errors and SIGBUSes.
> > > Compared to tradational storage where an app will get nice and friendly (relatively speaking..) -EIOs.
> > > The whole badblocks implementation was done so that the driver can intercept IO (i.e. reads)
> > > to _known_ bad locations, and short-circuit them with an EIO. If the driver doesn't catch these,
> > > the reads will turn into a memory bus access, and the poison will cause a SIGBUS.
> > >
> > > This effort is to try and make this badblock checking smarter - and try and reduce the penalty
> > > on every IO to a smaller range, which only the filesystem can do.
> > 
> > I still slightly puzzled and I cannot understand why the situation looks
> > like a dead end.  As far as I can see, first of all, a NVM device is able
> > to use hardware-based LDPC, Reed-Solomon error correction or any other
> > fancy code. It could provide some error correction basis. Also it can
> > provide the way of estimation of BER value. So, if a NVM memory's address
> > range degrades gradually (during weeks or months) then, practically, it's
> > possible to remap and to migrate the affected address ranges in the
> > background. Otherwise, if a NVM memory so unreliable that address range
> > is able to degrade during seconds or minutes then who will use such NVM
> > memory?
> 
> Well, the situation with NVM is more like with DRAM AFAIU. It is quite
> reliable but given the size the probability *some* cell has degraded is
> quite high. And similar to DRAM you'll get MCE (Machine Check Exception)
> when you try to read such cell. As Vishal wrote, the hardware does some
> background scrubbing and relocates stuff early if needed but nothing is 100%.
> 
> The reason why we play games with badblocks is to avoid those MCEs (i.e.,
> even trying to read the data we know that are bad). Even if it would be
> rare event, MCE may mean the machine just immediately reboots (although I
> find such platforms hardly usable with NVM then) and that is no good. And
> even on hardware platforms that allow for more graceful recovery from MCE
> it is asynchronous in its nature and our error handling
> around IO is all synchronous so it is difficult to join these two models
> together.
> 
> But I think it is a good question to ask whether we cannot improve on MCE
> handling instead of trying to avoid them and pushing around responsibility
> for handling bad blocks. Actually I thought someone was working on that.
> Cannot we e.g. wrap in-kernel accesses to persistent memory (those are now
> well identified anyway so that we can consult the badblocks list) so that
> it MCE happens during these accesses, we note it somewhere and at the end
> of the magic block we will just pick up the errors and report them back?

Yes that is an interesting topic, how/if we can improve MCE handling
from a storage point of view. Tradationally it has been designed for the
memory use case, and what we have so far is adaptation of it for the
pmem/storage uses.

> 
> > OK. Let's imagine that NVM memory device hasn't any internal error
> > correction hardware-based scheme. Next level of defense could be any
> > erasure coding scheme on device driver level. So, any piece of data can
> > be protected by parities. And device driver will be responsible for
> > management of erasure coding scheme. It will increase latency of read
> > operation for the case of necessity to recover the affected memory page.
> > But, finally, all recovering activity will be behind the scene and file
> > system will be unaware about such recovering activity.
> 
> Note that your options are limited by the byte addressability and the
> direct CPU access to the memory. But even with these limitations it is not
> that error rate would but unusually high, it is just not zero.
>  
> > If you are going not to provide any erasure coding or error correction
> > scheme then it's really bad case. The fsck tool is not regular case tool
> > but the last resort. If you are going to rely on the fsck tool then
> > simply forget about using your hardware. Some file systems haven't the
> > fsck tool at all. Some guys really believe that file system has to work
> > without support of the fsck tool.  Even if a mature file system has
> > reliable fsck tool then the probability of file system recovering is very
> > low in the case of serious metadata corruptions. So, it means that you
> > are trying to suggest the technique when we will lose the whole file
> > system volumes on regular basis without any hope to recover data. Even if
> > file system has snapshots then, again, we haven't hope because we can
> > suffer from read error and for operation with snapshot.
> 
> I hope I have cleared out that this is not about higher error rate of
> persistent memory above. As a side note, XFS guys are working on automatic
> background scrubbing and online filesystem checking. Not specifically for
> persistent memory but simply because with growing size of the filesystem
> the likelihood of some problem somewhere is growing.

Your note on the online repair does raise another tangentially related
topic. Currently, if there are badblocks, writes via the bio submission
path will clear the error (if the hardware is able to remap the bad
locations). However, if the filesystem is mounted eith DAX, even
non-mmap operations - read() and write() will go through the dax paths
(dax_do_io()). We haven't found a good/agreeable way to perform
error-clearing in this case. So currently, if a dax mounted filesystem
has badblocks, the only way to clear those badblocks is to mount it
without DAX, and overwrite/zero the bad locations. This is a pretty
terrible user experience, and I'm hoping this can be solved in a better
way.

If the filesystem is 'badblocks-aware', perhaps it can redirect dax_io
to happen via the driver (bio submission) path for files/ranges with
known errors. This removes the ability to do free-form, unaligned IO
in the DAX path, but we gain a way to actually repair (online) a dax
filesystem, which currently doesn't exist.

>  
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR