All of lore.kernel.org
 help / color / mirror / Atom feed
From: Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>
To: "vishal.l.verma@intel.com" <vishal.l.verma@intel.com>
Cc: "linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	Linux FS Devel <linux-fsdevel@vger.kernel.org>,
	"lsf-pc@lists.linux-foundation.org"
	<lsf-pc@lists.linux-foundation.org>,
	Viacheslav Dubeyko <slava@dubeyko.com>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>
Subject: RE: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
Date: Sat, 14 Jan 2017 00:00:45 +0000	[thread overview]
Message-ID: <SN2PR04MB2191756EABCB0E9DAA3B5328887B0@SN2PR04MB2191.namprd04.prod.outlook.com> (raw)
In-Reply-To: <b9rbflutjt10mb4ofherta8j.1484345610771@email.android.com>


---- Original Message ----
Subject: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
Sent: Jan 13, 2017 1:40 PM
From: "Verma, Vishal L" <vishal.l.verma@intel.com>
To: lsf-pc@lists.linux-foundation.org
Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org

> The current implementation of badblocks, where we consult the badblocks
> list for every IO in the block driver works, and is a last option
> failsafe, but from a user perspective, it isn't the easiest interface to
> work with.

As I remember, FAT and HFS+ specifications contain description of bad blocks
(physical sectors) table. I believe that this table was used for the case of
floppy media. But, finally, this table becomes to be the completely obsolete
artefact because mostly storage devices are reliably enough. Why do you need
in exposing the bad blocks on the file system level?  Do you expect that next
generation of NVM memory will be so unreliable that file system needs to manage
bad blocks? What's about erasure coding schemes? Do file system really need to suffer
from the bad block issue? 

Usually, we are using LBAs and it is the responsibility of storage device to map
a bad physical block/page/sector into valid one. Do you mean that we have
access to physical NVM memory address directly? But it looks like that we can
have a "bad block" issue even we will access data into page cache's memory
page (if we will use NVM memory for page cache, of course). So, what do you
imply by "bad block" issue? 

> 
> A while back, Dave Chinner had suggested a move towards smarter
> handling, and I posted initial RFC patches [1], but since then the topic
> hasn't really moved forward.
> 
> I'd like to propose and have a discussion about the following new
> functionality:
> 
> 1. Filesystems develop a native representation of badblocks. For
> example, in xfs, this would (presumably) be linked to the reverse
> mapping btree. The filesystem representation has the potential to be 
> more efficient than the block driver doing the check, as the fs can
> check the IO happening on a file against just that file's range. 

What do you mean by "file system can check the IO happening on a file"?
Do you mean read or write operation? What's about metadata?

If we are talking about the discovering a bad block on read operation then
rare modern file system is able to survive as for the case of metadata as
for the case of user data. Let's imagine that we have really mature file
system driver then what does it mean to encounter a bad block? The failure
to read a logical block of some metadata (bad block) means that we are
unable to extract some part of a metadata structure. From file system
driver point of view, it looks like that our file system is corrupted, we need
to stop the file system operations and, finally, to check and recover file
system volume by means of fsck tool. If we find a bad block for some
user file then, again, it looks like an issue. Some file systems simply
return "unrecovered read error". Another one, theoretically, is able
to survive because of snapshots, for example. But, anyway, it will look
like as Read-Only mount state and the user will need to resolve such
trouble by hands.

If we are talking about discovering a bad block during write operation then,
again, we are in trouble. Usually, we are using asynchronous model
of write/flush operation. We are preparing the consistent state of all our
metadata structures in the memory, at first. The flush operations for metadata
and user data can be done in different times. And what should be done if we
discover bad block for any piece of metadata or user data? Simple tracking of
bad blocks is not enough at all. Let's consider user data, at first. If we cannot
write some file's block successfully then we have two ways: (1) forget about
this piece of data; (2) try to change the associated LBA for this piece of data.
The operation of re-allocation LBA number for discovered bad block
(user data case) sounds as real pain. Because you need to rebuild the metadata
that track the location of this part of file. And it sounds as practically
impossible operation, for the case of LFS file system, for example.
If we have trouble with flushing any part of metadata then it sounds as
complete disaster for any file system.

Are you really sure that file system should process bad block issue?

>In contrast, today, the block driver checks against the whole block device
> range for every IO. On encountering badblocks, the filesystem can
> generate a better notification/error message that points the user to 
> (file, offset) as opposed to the block driver, which can only provide
> (block-device, sector).
>
> 2. The block layer adds a notifier to badblock addition/removal
> operations, which the filesystem subscribes to, and uses to maintain its
> badblocks accounting. (This part is implemented as a proof of concept in
> the RFC mentioned above [1]).

I am not sure that any bad block notification during/after IO operation
is valuable for file system. Maybe, it could help if file system simply will
know about bad block beforehand the operation of logical block allocation.
But what subsystem will discover bad blocks before any IO operations?
How file system will receive information or some bad block table?
I am not convinced that suggested badblocks approach is really feasible.
Also I am not sure that file system should see the bad blocks at all.
Why hardware cannot manage this issue for us?

Thanks,
Vyacheslav Dubeyko.
 
Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

WARNING: multiple messages have this Message-ID (diff)
From: Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>
To: "vishal.l.verma@intel.com" <vishal.l.verma@intel.com>
Cc: "lsf-pc@lists.linux-foundation.org"
	<lsf-pc@lists.linux-foundation.org>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	Linux FS Devel <linux-fsdevel@vger.kernel.org>,
	Viacheslav Dubeyko <slava@dubeyko.com>
Subject: RE: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
Date: Sat, 14 Jan 2017 00:00:45 +0000	[thread overview]
Message-ID: <SN2PR04MB2191756EABCB0E9DAA3B5328887B0@SN2PR04MB2191.namprd04.prod.outlook.com> (raw)
In-Reply-To: <b9rbflutjt10mb4ofherta8j.1484345610771@email.android.com>

DQotLS0tIE9yaWdpbmFsIE1lc3NhZ2UgLS0tLQ0KU3ViamVjdDogW0xTRi9NTSBUT1BJQ10gQmFk
YmxvY2tzIGNoZWNraW5nL3JlcHJlc2VudGF0aW9uIGluIGZpbGVzeXN0ZW1zDQpTZW50OiBKYW4g
MTMsIDIwMTcgMTo0MCBQTQ0KRnJvbTogIlZlcm1hLCBWaXNoYWwgTCIgPHZpc2hhbC5sLnZlcm1h
QGludGVsLmNvbT4NClRvOiBsc2YtcGNAbGlzdHMubGludXgtZm91bmRhdGlvbi5vcmcNCkNjOiBs
aW51eC1udmRpbW1AbGlzdHMuMDEub3JnLCBsaW51eC1ibG9ja0B2Z2VyLmtlcm5lbC5vcmcsIGxp
bnV4LWZzZGV2ZWxAdmdlci5rZXJuZWwub3JnDQoNCj4gVGhlIGN1cnJlbnQgaW1wbGVtZW50YXRp
b24gb2YgYmFkYmxvY2tzLCB3aGVyZSB3ZSBjb25zdWx0IHRoZSBiYWRibG9ja3MNCj4gbGlzdCBm
b3IgZXZlcnkgSU8gaW4gdGhlIGJsb2NrIGRyaXZlciB3b3JrcywgYW5kIGlzIGEgbGFzdCBvcHRp
b24NCj4gZmFpbHNhZmUsIGJ1dCBmcm9tIGEgdXNlciBwZXJzcGVjdGl2ZSwgaXQgaXNuJ3QgdGhl
IGVhc2llc3QgaW50ZXJmYWNlIHRvDQo+IHdvcmsgd2l0aC4NCg0KQXMgSSByZW1lbWJlciwgRkFU
IGFuZCBIRlMrIHNwZWNpZmljYXRpb25zIGNvbnRhaW4gZGVzY3JpcHRpb24gb2YgYmFkIGJsb2Nr
cw0KKHBoeXNpY2FsIHNlY3RvcnMpIHRhYmxlLiBJIGJlbGlldmUgdGhhdCB0aGlzIHRhYmxlIHdh
cyB1c2VkIGZvciB0aGUgY2FzZSBvZg0KZmxvcHB5IG1lZGlhLiBCdXQsIGZpbmFsbHksIHRoaXMg
dGFibGUgYmVjb21lcyB0byBiZSB0aGUgY29tcGxldGVseSBvYnNvbGV0ZQ0KYXJ0ZWZhY3QgYmVj
YXVzZSBtb3N0bHkgc3RvcmFnZSBkZXZpY2VzIGFyZSByZWxpYWJseSBlbm91Z2guIFdoeSBkbyB5
b3UgbmVlZA0KaW4gZXhwb3NpbmcgdGhlIGJhZCBibG9ja3Mgb24gdGhlIGZpbGUgc3lzdGVtIGxl
dmVsPyAgRG8geW91IGV4cGVjdCB0aGF0IG5leHQNCmdlbmVyYXRpb24gb2YgTlZNIG1lbW9yeSB3
aWxsIGJlIHNvIHVucmVsaWFibGUgdGhhdCBmaWxlIHN5c3RlbSBuZWVkcyB0byBtYW5hZ2UNCmJh
ZCBibG9ja3M/IFdoYXQncyBhYm91dCBlcmFzdXJlIGNvZGluZyBzY2hlbWVzPyBEbyBmaWxlIHN5
c3RlbSByZWFsbHkgbmVlZCB0byBzdWZmZXINCmZyb20gdGhlIGJhZCBibG9jayBpc3N1ZT8gDQoN
ClVzdWFsbHksIHdlIGFyZSB1c2luZyBMQkFzIGFuZCBpdCBpcyB0aGUgcmVzcG9uc2liaWxpdHkg
b2Ygc3RvcmFnZSBkZXZpY2UgdG8gbWFwDQphIGJhZCBwaHlzaWNhbCBibG9jay9wYWdlL3NlY3Rv
ciBpbnRvIHZhbGlkIG9uZS4gRG8geW91IG1lYW4gdGhhdCB3ZSBoYXZlDQphY2Nlc3MgdG8gcGh5
c2ljYWwgTlZNIG1lbW9yeSBhZGRyZXNzIGRpcmVjdGx5PyBCdXQgaXQgbG9va3MgbGlrZSB0aGF0
IHdlIGNhbg0KaGF2ZSBhICJiYWQgYmxvY2siIGlzc3VlIGV2ZW4gd2Ugd2lsbCBhY2Nlc3MgZGF0
YSBpbnRvIHBhZ2UgY2FjaGUncyBtZW1vcnkNCnBhZ2UgKGlmIHdlIHdpbGwgdXNlIE5WTSBtZW1v
cnkgZm9yIHBhZ2UgY2FjaGUsIG9mIGNvdXJzZSkuIFNvLCB3aGF0IGRvIHlvdQ0KaW1wbHkgYnkg
ImJhZCBibG9jayIgaXNzdWU/IA0KDQo+IA0KPiBBIHdoaWxlIGJhY2ssIERhdmUgQ2hpbm5lciBo
YWQgc3VnZ2VzdGVkIGEgbW92ZSB0b3dhcmRzIHNtYXJ0ZXINCj4gaGFuZGxpbmcsIGFuZCBJIHBv
c3RlZCBpbml0aWFsIFJGQyBwYXRjaGVzIFsxXSwgYnV0IHNpbmNlIHRoZW4gdGhlIHRvcGljDQo+
IGhhc24ndCByZWFsbHkgbW92ZWQgZm9yd2FyZC4NCj4gDQo+IEknZCBsaWtlIHRvIHByb3Bvc2Ug
YW5kIGhhdmUgYSBkaXNjdXNzaW9uIGFib3V0IHRoZSBmb2xsb3dpbmcgbmV3DQo+IGZ1bmN0aW9u
YWxpdHk6DQo+IA0KPiAxLiBGaWxlc3lzdGVtcyBkZXZlbG9wIGEgbmF0aXZlIHJlcHJlc2VudGF0
aW9uIG9mIGJhZGJsb2Nrcy4gRm9yDQo+IGV4YW1wbGUsIGluIHhmcywgdGhpcyB3b3VsZCAocHJl
c3VtYWJseSkgYmUgbGlua2VkIHRvIHRoZSByZXZlcnNlDQo+IG1hcHBpbmcgYnRyZWUuIFRoZSBm
aWxlc3lzdGVtIHJlcHJlc2VudGF0aW9uIGhhcyB0aGUgcG90ZW50aWFsIHRvIGJlwqANCj4gbW9y
ZSBlZmZpY2llbnQgdGhhbiB0aGUgYmxvY2sgZHJpdmVyIGRvaW5nIHRoZSBjaGVjaywgYXMgdGhl
IGZzIGNhbg0KPiBjaGVjayB0aGUgSU8gaGFwcGVuaW5nIG9uIGEgZmlsZSBhZ2FpbnN0IGp1c3Qg
dGhhdCBmaWxlJ3MgcmFuZ2UuIA0KDQpXaGF0IGRvIHlvdSBtZWFuIGJ5ICJmaWxlIHN5c3RlbSBj
YW4gY2hlY2sgdGhlIElPIGhhcHBlbmluZyBvbiBhIGZpbGUiPw0KRG8geW91IG1lYW4gcmVhZCBv
ciB3cml0ZSBvcGVyYXRpb24/IFdoYXQncyBhYm91dCBtZXRhZGF0YT8NCg0KSWYgd2UgYXJlIHRh
bGtpbmcgYWJvdXQgdGhlIGRpc2NvdmVyaW5nIGEgYmFkIGJsb2NrIG9uIHJlYWQgb3BlcmF0aW9u
IHRoZW4NCnJhcmUgbW9kZXJuIGZpbGUgc3lzdGVtIGlzIGFibGUgdG8gc3Vydml2ZSBhcyBmb3Ig
dGhlIGNhc2Ugb2YgbWV0YWRhdGEgYXMNCmZvciB0aGUgY2FzZSBvZiB1c2VyIGRhdGEuIExldCdz
IGltYWdpbmUgdGhhdCB3ZSBoYXZlIHJlYWxseSBtYXR1cmUgZmlsZQ0Kc3lzdGVtIGRyaXZlciB0
aGVuIHdoYXQgZG9lcyBpdCBtZWFuIHRvIGVuY291bnRlciBhIGJhZCBibG9jaz8gVGhlIGZhaWx1
cmUNCnRvIHJlYWQgYSBsb2dpY2FsIGJsb2NrIG9mIHNvbWUgbWV0YWRhdGEgKGJhZCBibG9jaykg
bWVhbnMgdGhhdCB3ZSBhcmUNCnVuYWJsZSB0byBleHRyYWN0IHNvbWUgcGFydCBvZiBhIG1ldGFk
YXRhIHN0cnVjdHVyZS4gRnJvbSBmaWxlIHN5c3RlbQ0KZHJpdmVyIHBvaW50IG9mIHZpZXcsIGl0
IGxvb2tzIGxpa2UgdGhhdCBvdXIgZmlsZSBzeXN0ZW0gaXMgY29ycnVwdGVkLCB3ZSBuZWVkDQp0
byBzdG9wIHRoZSBmaWxlIHN5c3RlbSBvcGVyYXRpb25zIGFuZCwgZmluYWxseSwgdG8gY2hlY2sg
YW5kIHJlY292ZXIgZmlsZQ0Kc3lzdGVtIHZvbHVtZSBieSBtZWFucyBvZiBmc2NrIHRvb2wuIElm
IHdlIGZpbmQgYSBiYWQgYmxvY2sgZm9yIHNvbWUNCnVzZXIgZmlsZSB0aGVuLCBhZ2FpbiwgaXQg
bG9va3MgbGlrZSBhbiBpc3N1ZS4gU29tZSBmaWxlIHN5c3RlbXMgc2ltcGx5DQpyZXR1cm4gInVu
cmVjb3ZlcmVkIHJlYWQgZXJyb3IiLiBBbm90aGVyIG9uZSwgdGhlb3JldGljYWxseSwgaXMgYWJs
ZQ0KdG8gc3Vydml2ZSBiZWNhdXNlIG9mIHNuYXBzaG90cywgZm9yIGV4YW1wbGUuIEJ1dCwgYW55
d2F5LCBpdCB3aWxsIGxvb2sNCmxpa2UgYXMgUmVhZC1Pbmx5IG1vdW50IHN0YXRlIGFuZCB0aGUg
dXNlciB3aWxsIG5lZWQgdG8gcmVzb2x2ZSBzdWNoDQp0cm91YmxlIGJ5IGhhbmRzLg0KDQpJZiB3
ZSBhcmUgdGFsa2luZyBhYm91dCBkaXNjb3ZlcmluZyBhIGJhZCBibG9jayBkdXJpbmcgd3JpdGUg
b3BlcmF0aW9uIHRoZW4sDQphZ2Fpbiwgd2UgYXJlIGluIHRyb3VibGUuIFVzdWFsbHksIHdlIGFy
ZSB1c2luZyBhc3luY2hyb25vdXMgbW9kZWwNCm9mIHdyaXRlL2ZsdXNoIG9wZXJhdGlvbi4gV2Ug
YXJlIHByZXBhcmluZyB0aGUgY29uc2lzdGVudCBzdGF0ZSBvZiBhbGwgb3VyDQptZXRhZGF0YSBz
dHJ1Y3R1cmVzIGluIHRoZSBtZW1vcnksIGF0IGZpcnN0LiBUaGUgZmx1c2ggb3BlcmF0aW9ucyBm
b3IgbWV0YWRhdGENCmFuZCB1c2VyIGRhdGEgY2FuIGJlIGRvbmUgaW4gZGlmZmVyZW50IHRpbWVz
LiBBbmQgd2hhdCBzaG91bGQgYmUgZG9uZSBpZiB3ZQ0KZGlzY292ZXIgYmFkIGJsb2NrIGZvciBh
bnkgcGllY2Ugb2YgbWV0YWRhdGEgb3IgdXNlciBkYXRhPyBTaW1wbGUgdHJhY2tpbmcgb2YNCmJh
ZCBibG9ja3MgaXMgbm90IGVub3VnaCBhdCBhbGwuIExldCdzIGNvbnNpZGVyIHVzZXIgZGF0YSwg
YXQgZmlyc3QuIElmIHdlIGNhbm5vdA0Kd3JpdGUgc29tZSBmaWxlJ3MgYmxvY2sgc3VjY2Vzc2Z1
bGx5IHRoZW4gd2UgaGF2ZSB0d28gd2F5czogKDEpIGZvcmdldCBhYm91dA0KdGhpcyBwaWVjZSBv
ZiBkYXRhOyAoMikgdHJ5IHRvIGNoYW5nZSB0aGUgYXNzb2NpYXRlZCBMQkEgZm9yIHRoaXMgcGll
Y2Ugb2YgZGF0YS4NClRoZSBvcGVyYXRpb24gb2YgcmUtYWxsb2NhdGlvbiBMQkEgbnVtYmVyIGZv
ciBkaXNjb3ZlcmVkIGJhZCBibG9jaw0KKHVzZXIgZGF0YSBjYXNlKSBzb3VuZHMgYXMgcmVhbCBw
YWluLiBCZWNhdXNlIHlvdSBuZWVkIHRvIHJlYnVpbGQgdGhlIG1ldGFkYXRhDQp0aGF0IHRyYWNr
IHRoZSBsb2NhdGlvbiBvZiB0aGlzIHBhcnQgb2YgZmlsZS4gQW5kIGl0IHNvdW5kcyBhcyBwcmFj
dGljYWxseQ0KaW1wb3NzaWJsZSBvcGVyYXRpb24sIGZvciB0aGUgY2FzZSBvZiBMRlMgZmlsZSBz
eXN0ZW0sIGZvciBleGFtcGxlLg0KSWYgd2UgaGF2ZSB0cm91YmxlIHdpdGggZmx1c2hpbmcgYW55
IHBhcnQgb2YgbWV0YWRhdGEgdGhlbiBpdCBzb3VuZHMgYXMNCmNvbXBsZXRlIGRpc2FzdGVyIGZv
ciBhbnkgZmlsZSBzeXN0ZW0uDQoNCkFyZSB5b3UgcmVhbGx5IHN1cmUgdGhhdCBmaWxlIHN5c3Rl
bSBzaG91bGQgcHJvY2VzcyBiYWQgYmxvY2sgaXNzdWU/DQoNCj5JbiBjb250cmFzdCwgdG9kYXks
IHRoZSBibG9jayBkcml2ZXIgY2hlY2tzIGFnYWluc3QgdGhlIHdob2xlIGJsb2NrIGRldmljZQ0K
PiByYW5nZSBmb3IgZXZlcnkgSU8uIE9uIGVuY291bnRlcmluZyBiYWRibG9ja3MsIHRoZSBmaWxl
c3lzdGVtIGNhbg0KPiBnZW5lcmF0ZSBhIGJldHRlciBub3RpZmljYXRpb24vZXJyb3IgbWVzc2Fn
ZSB0aGF0IHBvaW50cyB0aGUgdXNlciB0b8KgDQo+IChmaWxlLCBvZmZzZXQpIGFzIG9wcG9zZWQg
dG8gdGhlIGJsb2NrIGRyaXZlciwgd2hpY2ggY2FuIG9ubHkgcHJvdmlkZQ0KPiAoYmxvY2stZGV2
aWNlLCBzZWN0b3IpLg0KPg0KPiAyLiBUaGUgYmxvY2sgbGF5ZXIgYWRkcyBhIG5vdGlmaWVyIHRv
IGJhZGJsb2NrIGFkZGl0aW9uL3JlbW92YWwNCj4gb3BlcmF0aW9ucywgd2hpY2ggdGhlIGZpbGVz
eXN0ZW0gc3Vic2NyaWJlcyB0bywgYW5kIHVzZXMgdG8gbWFpbnRhaW4gaXRzDQo+IGJhZGJsb2Nr
cyBhY2NvdW50aW5nLiAoVGhpcyBwYXJ0IGlzIGltcGxlbWVudGVkIGFzIGEgcHJvb2Ygb2YgY29u
Y2VwdCBpbg0KPiB0aGUgUkZDIG1lbnRpb25lZCBhYm92ZSBbMV0pLg0KDQpJIGFtIG5vdCBzdXJl
IHRoYXQgYW55IGJhZCBibG9jayBub3RpZmljYXRpb24gZHVyaW5nL2FmdGVyIElPIG9wZXJhdGlv
bg0KaXMgdmFsdWFibGUgZm9yIGZpbGUgc3lzdGVtLiBNYXliZSwgaXQgY291bGQgaGVscCBpZiBm
aWxlIHN5c3RlbSBzaW1wbHkgd2lsbA0Ka25vdyBhYm91dCBiYWQgYmxvY2sgYmVmb3JlaGFuZCB0
aGUgb3BlcmF0aW9uIG9mIGxvZ2ljYWwgYmxvY2sgYWxsb2NhdGlvbi4NCkJ1dCB3aGF0IHN1YnN5
c3RlbSB3aWxsIGRpc2NvdmVyIGJhZCBibG9ja3MgYmVmb3JlIGFueSBJTyBvcGVyYXRpb25zPw0K
SG93IGZpbGUgc3lzdGVtIHdpbGwgcmVjZWl2ZSBpbmZvcm1hdGlvbiBvciBzb21lIGJhZCBibG9j
ayB0YWJsZT8NCkkgYW0gbm90IGNvbnZpbmNlZCB0aGF0IHN1Z2dlc3RlZCBiYWRibG9ja3MgYXBw
cm9hY2ggaXMgcmVhbGx5IGZlYXNpYmxlLg0KQWxzbyBJIGFtIG5vdCBzdXJlIHRoYXQgZmlsZSBz
eXN0ZW0gc2hvdWxkIHNlZSB0aGUgYmFkIGJsb2NrcyBhdCBhbGwuDQpXaHkgaGFyZHdhcmUgY2Fu
bm90IG1hbmFnZSB0aGlzIGlzc3VlIGZvciB1cz8NCg0KVGhhbmtzLA0KVnlhY2hlc2xhdiBEdWJl
eWtvLg0KIA0K

WARNING: multiple messages have this Message-ID (diff)
From: Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>
To: "vishal.l.verma@intel.com" <vishal.l.verma@intel.com>
Cc: "lsf-pc@lists.linux-foundation.org"
	<lsf-pc@lists.linux-foundation.org>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	Linux FS Devel <linux-fsdevel@vger.kernel.org>,
	Viacheslav Dubeyko <slava@dubeyko.com>
Subject: RE: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
Date: Sat, 14 Jan 2017 00:00:45 +0000	[thread overview]
Message-ID: <SN2PR04MB2191756EABCB0E9DAA3B5328887B0@SN2PR04MB2191.namprd04.prod.outlook.com> (raw)
In-Reply-To: <b9rbflutjt10mb4ofherta8j.1484345610771@email.android.com>


---- Original Message ----
Subject: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
Sent: Jan 13, 2017 1:40 PM
From: "Verma, Vishal L" <vishal.l.verma@intel.com>
To: lsf-pc@lists.linux-foundation.org
Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org

> The current implementation of badblocks, where we consult the badblocks
> list for every IO in the block driver works, and is a last option
> failsafe, but from a user perspective, it isn't the easiest interface to
> work with.

As I remember, FAT and HFS+ specifications contain description of bad blocks
(physical sectors) table. I believe that this table was used for the case of
floppy media. But, finally, this table becomes to be the completely obsolete
artefact because mostly storage devices are reliably enough. Why do you need
in exposing the bad blocks on the file system level?  Do you expect that next
generation of NVM memory will be so unreliable that file system needs to manage
bad blocks? What's about erasure coding schemes? Do file system really need to suffer
from the bad block issue? 

Usually, we are using LBAs and it is the responsibility of storage device to map
a bad physical block/page/sector into valid one. Do you mean that we have
access to physical NVM memory address directly? But it looks like that we can
have a "bad block" issue even we will access data into page cache's memory
page (if we will use NVM memory for page cache, of course). So, what do you
imply by "bad block" issue? 

> 
> A while back, Dave Chinner had suggested a move towards smarter
> handling, and I posted initial RFC patches [1], but since then the topic
> hasn't really moved forward.
> 
> I'd like to propose and have a discussion about the following new
> functionality:
> 
> 1. Filesystems develop a native representation of badblocks. For
> example, in xfs, this would (presumably) be linked to the reverse
> mapping btree. The filesystem representation has the potential to be 
> more efficient than the block driver doing the check, as the fs can
> check the IO happening on a file against just that file's range. 

What do you mean by "file system can check the IO happening on a file"?
Do you mean read or write operation? What's about metadata?

If we are talking about the discovering a bad block on read operation then
rare modern file system is able to survive as for the case of metadata as
for the case of user data. Let's imagine that we have really mature file
system driver then what does it mean to encounter a bad block? The failure
to read a logical block of some metadata (bad block) means that we are
unable to extract some part of a metadata structure. From file system
driver point of view, it looks like that our file system is corrupted, we need
to stop the file system operations and, finally, to check and recover file
system volume by means of fsck tool. If we find a bad block for some
user file then, again, it looks like an issue. Some file systems simply
return "unrecovered read error". Another one, theoretically, is able
to survive because of snapshots, for example. But, anyway, it will look
like as Read-Only mount state and the user will need to resolve such
trouble by hands.

If we are talking about discovering a bad block during write operation then,
again, we are in trouble. Usually, we are using asynchronous model
of write/flush operation. We are preparing the consistent state of all our
metadata structures in the memory, at first. The flush operations for metadata
and user data can be done in different times. And what should be done if we
discover bad block for any piece of metadata or user data? Simple tracking of
bad blocks is not enough at all. Let's consider user data, at first. If we cannot
write some file's block successfully then we have two ways: (1) forget about
this piece of data; (2) try to change the associated LBA for this piece of data.
The operation of re-allocation LBA number for discovered bad block
(user data case) sounds as real pain. Because you need to rebuild the metadata
that track the location of this part of file. And it sounds as practically
impossible operation, for the case of LFS file system, for example.
If we have trouble with flushing any part of metadata then it sounds as
complete disaster for any file system.

Are you really sure that file system should process bad block issue?

>In contrast, today, the block driver checks against the whole block device
> range for every IO. On encountering badblocks, the filesystem can
> generate a better notification/error message that points the user to 
> (file, offset) as opposed to the block driver, which can only provide
> (block-device, sector).
>
> 2. The block layer adds a notifier to badblock addition/removal
> operations, which the filesystem subscribes to, and uses to maintain its
> badblocks accounting. (This part is implemented as a proof of concept in
> the RFC mentioned above [1]).

I am not sure that any bad block notification during/after IO operation
is valuable for file system. Maybe, it could help if file system simply will
know about bad block beforehand the operation of logical block allocation.
But what subsystem will discover bad blocks before any IO operations?
How file system will receive information or some bad block table?
I am not convinced that suggested badblocks approach is really feasible.
Also I am not sure that file system should see the bad blocks at all.
Why hardware cannot manage this issue for us?

Thanks,
Vyacheslav Dubeyko.
 

       reply	other threads:[~2017-01-14  0:00 UTC|newest]

Thread overview: 89+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <at1mp6pou4lenesjdgh22k4p.1484345585589@email.android.com>
     [not found] ` <b9rbflutjt10mb4ofherta8j.1484345610771@email.android.com>
2017-01-14  0:00   ` Slava Dubeyko [this message]
2017-01-14  0:00     ` [LSF/MM TOPIC] Badblocks checking/representation in filesystems Slava Dubeyko
2017-01-14  0:00     ` Slava Dubeyko
2017-01-14  0:49     ` Vishal Verma
2017-01-14  0:49       ` Vishal Verma
2017-01-16  2:27       ` Slava Dubeyko
2017-01-16  2:27         ` Slava Dubeyko
2017-01-16  2:27         ` Slava Dubeyko
2017-01-17 14:37         ` [Lsf-pc] " Jan Kara
2017-01-17 14:37           ` Jan Kara
2017-01-17 15:08           ` Christoph Hellwig
2017-01-17 15:08             ` Christoph Hellwig
2017-01-17 22:14           ` Vishal Verma
2017-01-17 22:14             ` Vishal Verma
2017-01-18 10:16             ` Jan Kara
2017-01-18 10:16               ` Jan Kara
2017-01-18 20:39               ` Jeff Moyer
2017-01-18 20:39                 ` Jeff Moyer
2017-01-18 21:02                 ` Darrick J. Wong
2017-01-18 21:02                   ` Darrick J. Wong
2017-01-18 21:32                   ` Dan Williams
2017-01-18 21:32                     ` Dan Williams
     [not found]                     ` <CAPcyv4hd7bpCa7d9msX0Y8gLz7WsqXT3VExQwwLuAcsmMxVTPg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-01-18 21:56                       ` Verma, Vishal L
2017-01-18 21:56                         ` Verma, Vishal L
2017-01-18 21:56                         ` Verma, Vishal L
     [not found]                         ` <1484776549.4358.33.camel-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2017-01-19  8:10                           ` Jan Kara
2017-01-19  8:10                             ` Jan Kara
     [not found]                             ` <20170119081011.GA2565-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
2017-01-19 18:59                               ` Vishal Verma
2017-01-19 18:59                                 ` Vishal Verma
     [not found]                                 ` <20170119185910.GF4880-PxNA6LsHknajYZd8rzuJLNh3ngVCH38I@public.gmane.org>
2017-01-19 19:03                                   ` Dan Williams
2017-01-19 19:03                                     ` Dan Williams
     [not found]                                     ` <CAPcyv4jZz_iqLutd0gPEL3udqbFxvBH8CZY5oDgUjG5dGbC2gg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-01-20  9:03                                       ` Jan Kara
2017-01-20  9:03                                         ` Jan Kara
2017-01-17 23:15           ` Slava Dubeyko
2017-01-17 23:15             ` Slava Dubeyko
2017-01-17 23:15             ` Slava Dubeyko
2017-01-18 20:47             ` Jeff Moyer
2017-01-18 20:47               ` Jeff Moyer
2017-01-19  2:56               ` Slava Dubeyko
2017-01-19  2:56                 ` Slava Dubeyko
2017-01-19  2:56                 ` Slava Dubeyko
2017-01-19 19:33                 ` Jeff Moyer
2017-01-19 19:33                   ` Jeff Moyer
2017-01-17  6:33       ` Darrick J. Wong
2017-01-17  6:33         ` Darrick J. Wong
2017-01-17 21:35         ` Vishal Verma
2017-01-17 21:35           ` Vishal Verma
2017-01-17 22:15           ` Andiry Xu
2017-01-17 22:15             ` Andiry Xu
2017-01-17 22:37             ` Vishal Verma
2017-01-17 22:37               ` Vishal Verma
2017-01-17 23:20               ` Andiry Xu
2017-01-17 23:20                 ` Andiry Xu
2017-01-17 23:51                 ` Vishal Verma
2017-01-17 23:51                   ` Vishal Verma
2017-01-18  1:58                   ` Andiry Xu
2017-01-18  1:58                     ` Andiry Xu
     [not found]                     ` <CAOvWMLZCt39EDg-1uppVVUeRG40JvOo9sKLY2XMuynZdnc0W9w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-01-20  0:32                       ` Verma, Vishal L
2017-01-20  0:32                         ` Verma, Vishal L
2017-01-20  0:32                         ` Verma, Vishal L
2017-01-18  9:38               ` [Lsf-pc] " Jan Kara
2017-01-18  9:38                 ` Jan Kara
2017-01-19 21:17                 ` Vishal Verma
2017-01-19 21:17                   ` Vishal Verma
2017-01-20  9:47                   ` Jan Kara
2017-01-20  9:47                     ` Jan Kara
2017-01-20 15:42                     ` Dan Williams
2017-01-20 15:42                       ` Dan Williams
2017-01-24  7:46                       ` Jan Kara
2017-01-24  7:46                         ` Jan Kara
2017-01-24 19:59                         ` Vishal Verma
2017-01-24 19:59                           ` Vishal Verma
2017-01-18  0:16             ` Andreas Dilger
2017-01-18  2:01               ` Andiry Xu
2017-01-18  2:01                 ` Andiry Xu
2017-01-18  3:08                 ` Lu Zhang
2017-01-18  3:08                   ` Lu Zhang
2017-01-20  0:46                   ` Vishal Verma
2017-01-20  0:46                     ` Vishal Verma
2017-01-20  9:24                     ` Yasunori Goto
2017-01-20  9:24                       ` Yasunori Goto
     [not found]                       ` <20170120182435.0E12.E1E9C6FF-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2017-01-21  0:23                         ` Kani, Toshimitsu
2017-01-21  0:23                           ` Kani, Toshimitsu
2017-01-21  0:23                           ` Kani, Toshimitsu
2017-01-20  0:55                 ` Verma, Vishal L
2017-01-20  0:55                   ` Verma, Vishal L
2017-01-13 21:40 Verma, Vishal L
2017-01-13 21:40 ` Verma, Vishal L
2017-01-13 21:40 ` Verma, Vishal L

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=SN2PR04MB2191756EABCB0E9DAA3B5328887B0@SN2PR04MB2191.namprd04.prod.outlook.com \
    --to=vyacheslav.dubeyko@wdc.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=slava@dubeyko.com \
    --cc=vishal.l.verma@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.