RE: [LSF/MM TOPIC] Badblocks checking/representation in filesystems

From: Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>
To: "vishal.l.verma@intel.com" <vishal.l.verma@intel.com>
Cc: "linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	Linux FS Devel <linux-fsdevel@vger.kernel.org>,
	"lsf-pc@lists.linux-foundation.org"
	<lsf-pc@lists.linux-foundation.org>,
	Viacheslav Dubeyko <slava@dubeyko.com>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>
Subject: RE: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
Date: Sat, 14 Jan 2017 00:00:45 +0000	[thread overview]
Message-ID: <SN2PR04MB2191756EABCB0E9DAA3B5328887B0@SN2PR04MB2191.namprd04.prod.outlook.com> (raw)
In-Reply-To: <b9rbflutjt10mb4ofherta8j.1484345610771@email.android.com>

---- Original Message ----
Subject: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
Sent: Jan 13, 2017 1:40 PM
From: "Verma, Vishal L" <vishal.l.verma@intel.com>
To: lsf-pc@lists.linux-foundation.org
Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org

> The current implementation of badblocks, where we consult the badblocks
> list for every IO in the block driver works, and is a last option
> failsafe, but from a user perspective, it isn't the easiest interface to
> work with.

As I remember, FAT and HFS+ specifications contain description of bad blocks
(physical sectors) table. I believe that this table was used for the case of
floppy media. But, finally, this table becomes to be the completely obsolete
artefact because mostly storage devices are reliably enough. Why do you need
in exposing the bad blocks on the file system level?  Do you expect that next
generation of NVM memory will be so unreliable that file system needs to manage
bad blocks? What's about erasure coding schemes? Do file system really need to suffer
from the bad block issue? 

Usually, we are using LBAs and it is the responsibility of storage device to map
a bad physical block/page/sector into valid one. Do you mean that we have
access to physical NVM memory address directly? But it looks like that we can
have a "bad block" issue even we will access data into page cache's memory
page (if we will use NVM memory for page cache, of course). So, what do you
imply by "bad block" issue? 

> 
> A while back, Dave Chinner had suggested a move towards smarter
> handling, and I posted initial RFC patches [1], but since then the topic
> hasn't really moved forward.
> 
> I'd like to propose and have a discussion about the following new
> functionality:
> 
> 1. Filesystems develop a native representation of badblocks. For
> example, in xfs, this would (presumably) be linked to the reverse
> mapping btree. The filesystem representation has the potential to be 
> more efficient than the block driver doing the check, as the fs can
> check the IO happening on a file against just that file's range. 

What do you mean by "file system can check the IO happening on a file"?
Do you mean read or write operation? What's about metadata?

If we are talking about the discovering a bad block on read operation then
rare modern file system is able to survive as for the case of metadata as
for the case of user data. Let's imagine that we have really mature file
system driver then what does it mean to encounter a bad block? The failure
to read a logical block of some metadata (bad block) means that we are
unable to extract some part of a metadata structure. From file system
driver point of view, it looks like that our file system is corrupted, we need
to stop the file system operations and, finally, to check and recover file
system volume by means of fsck tool. If we find a bad block for some
user file then, again, it looks like an issue. Some file systems simply
return "unrecovered read error". Another one, theoretically, is able
to survive because of snapshots, for example. But, anyway, it will look
like as Read-Only mount state and the user will need to resolve such
trouble by hands.

If we are talking about discovering a bad block during write operation then,
again, we are in trouble. Usually, we are using asynchronous model
of write/flush operation. We are preparing the consistent state of all our
metadata structures in the memory, at first. The flush operations for metadata
and user data can be done in different times. And what should be done if we
discover bad block for any piece of metadata or user data? Simple tracking of
bad blocks is not enough at all. Let's consider user data, at first. If we cannot
write some file's block successfully then we have two ways: (1) forget about
this piece of data; (2) try to change the associated LBA for this piece of data.
The operation of re-allocation LBA number for discovered bad block
(user data case) sounds as real pain. Because you need to rebuild the metadata
that track the location of this part of file. And it sounds as practically
impossible operation, for the case of LFS file system, for example.
If we have trouble with flushing any part of metadata then it sounds as
complete disaster for any file system.

Are you really sure that file system should process bad block issue?

>In contrast, today, the block driver checks against the whole block device
> range for every IO. On encountering badblocks, the filesystem can
> generate a better notification/error message that points the user to 
> (file, offset) as opposed to the block driver, which can only provide
> (block-device, sector).
>
> 2. The block layer adds a notifier to badblock addition/removal
> operations, which the filesystem subscribes to, and uses to maintain its
> badblocks accounting. (This part is implemented as a proof of concept in
> the RFC mentioned above [1]).

I am not sure that any bad block notification during/after IO operation
is valuable for file system. Maybe, it could help if file system simply will
know about bad block beforehand the operation of logical block allocation.
But what subsystem will discover bad blocks before any IO operations?
How file system will receive information or some bad block table?
I am not convinced that suggested badblocks approach is really feasible.
Also I am not sure that file system should see the bad blocks at all.
Why hardware cannot manage this issue for us?

Thanks,
Vyacheslav Dubeyko.

Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm