All of lore.kernel.org
 help / color / mirror / Atom feed
* RE: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
       [not found] ` <b9rbflutjt10mb4ofherta8j.1484345610771@email.android.com>
  2017-01-14  0:00     ` Slava Dubeyko
@ 2017-01-14  0:00     ` Slava Dubeyko
  0 siblings, 0 replies; 89+ messages in thread
From: Slava Dubeyko @ 2017-01-14  0:00 UTC (permalink / raw)
  To: vishal.l.verma
  Cc: linux-block, Linux FS Devel, lsf-pc, Viacheslav Dubeyko, linux-nvdimm


---- Original Message ----
Subject: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
Sent: Jan 13, 2017 1:40 PM
From: "Verma, Vishal L" <vishal.l.verma@intel.com>
To: lsf-pc@lists.linux-foundation.org
Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org

> The current implementation of badblocks, where we consult the badblocks
> list for every IO in the block driver works, and is a last option
> failsafe, but from a user perspective, it isn't the easiest interface to
> work with.

As I remember, FAT and HFS+ specifications contain description of bad blocks
(physical sectors) table. I believe that this table was used for the case of
floppy media. But, finally, this table becomes to be the completely obsolete
artefact because mostly storage devices are reliably enough. Why do you need
in exposing the bad blocks on the file system level?  Do you expect that next
generation of NVM memory will be so unreliable that file system needs to manage
bad blocks? What's about erasure coding schemes? Do file system really need to suffer
from the bad block issue? 

Usually, we are using LBAs and it is the responsibility of storage device to map
a bad physical block/page/sector into valid one. Do you mean that we have
access to physical NVM memory address directly? But it looks like that we can
have a "bad block" issue even we will access data into page cache's memory
page (if we will use NVM memory for page cache, of course). So, what do you
imply by "bad block" issue? 

> 
> A while back, Dave Chinner had suggested a move towards smarter
> handling, and I posted initial RFC patches [1], but since then the topic
> hasn't really moved forward.
> 
> I'd like to propose and have a discussion about the following new
> functionality:
> 
> 1. Filesystems develop a native representation of badblocks. For
> example, in xfs, this would (presumably) be linked to the reverse
> mapping btree. The filesystem representation has the potential to be 
> more efficient than the block driver doing the check, as the fs can
> check the IO happening on a file against just that file's range. 

What do you mean by "file system can check the IO happening on a file"?
Do you mean read or write operation? What's about metadata?

If we are talking about the discovering a bad block on read operation then
rare modern file system is able to survive as for the case of metadata as
for the case of user data. Let's imagine that we have really mature file
system driver then what does it mean to encounter a bad block? The failure
to read a logical block of some metadata (bad block) means that we are
unable to extract some part of a metadata structure. From file system
driver point of view, it looks like that our file system is corrupted, we need
to stop the file system operations and, finally, to check and recover file
system volume by means of fsck tool. If we find a bad block for some
user file then, again, it looks like an issue. Some file systems simply
return "unrecovered read error". Another one, theoretically, is able
to survive because of snapshots, for example. But, anyway, it will look
like as Read-Only mount state and the user will need to resolve such
trouble by hands.

If we are talking about discovering a bad block during write operation then,
again, we are in trouble. Usually, we are using asynchronous model
of write/flush operation. We are preparing the consistent state of all our
metadata structures in the memory, at first. The flush operations for metadata
and user data can be done in different times. And what should be done if we
discover bad block for any piece of metadata or user data? Simple tracking of
bad blocks is not enough at all. Let's consider user data, at first. If we cannot
write some file's block successfully then we have two ways: (1) forget about
this piece of data; (2) try to change the associated LBA for this piece of data.
The operation of re-allocation LBA number for discovered bad block
(user data case) sounds as real pain. Because you need to rebuild the metadata
that track the location of this part of file. And it sounds as practically
impossible operation, for the case of LFS file system, for example.
If we have trouble with flushing any part of metadata then it sounds as
complete disaster for any file system.

Are you really sure that file system should process bad block issue?

>In contrast, today, the block driver checks against the whole block device
> range for every IO. On encountering badblocks, the filesystem can
> generate a better notification/error message that points the user to 
> (file, offset) as opposed to the block driver, which can only provide
> (block-device, sector).
>
> 2. The block layer adds a notifier to badblock addition/removal
> operations, which the filesystem subscribes to, and uses to maintain its
> badblocks accounting. (This part is implemented as a proof of concept in
> the RFC mentioned above [1]).

I am not sure that any bad block notification during/after IO operation
is valuable for file system. Maybe, it could help if file system simply will
know about bad block beforehand the operation of logical block allocation.
But what subsystem will discover bad blocks before any IO operations?
How file system will receive information or some bad block table?
I am not convinced that suggested badblocks approach is really feasible.
Also I am not sure that file system should see the bad blocks at all.
Why hardware cannot manage this issue for us?

Thanks,
Vyacheslav Dubeyko.
 
Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 89+ messages in thread

* RE: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-14  0:00     ` Slava Dubeyko
  0 siblings, 0 replies; 89+ messages in thread
From: Slava Dubeyko @ 2017-01-14  0:00 UTC (permalink / raw)
  To: vishal.l.verma
  Cc: lsf-pc, linux-nvdimm, linux-block, Linux FS Devel, Viacheslav Dubeyko

DQotLS0tIE9yaWdpbmFsIE1lc3NhZ2UgLS0tLQ0KU3ViamVjdDogW0xTRi9NTSBUT1BJQ10gQmFk
YmxvY2tzIGNoZWNraW5nL3JlcHJlc2VudGF0aW9uIGluIGZpbGVzeXN0ZW1zDQpTZW50OiBKYW4g
MTMsIDIwMTcgMTo0MCBQTQ0KRnJvbTogIlZlcm1hLCBWaXNoYWwgTCIgPHZpc2hhbC5sLnZlcm1h
QGludGVsLmNvbT4NClRvOiBsc2YtcGNAbGlzdHMubGludXgtZm91bmRhdGlvbi5vcmcNCkNjOiBs
aW51eC1udmRpbW1AbGlzdHMuMDEub3JnLCBsaW51eC1ibG9ja0B2Z2VyLmtlcm5lbC5vcmcsIGxp
bnV4LWZzZGV2ZWxAdmdlci5rZXJuZWwub3JnDQoNCj4gVGhlIGN1cnJlbnQgaW1wbGVtZW50YXRp
b24gb2YgYmFkYmxvY2tzLCB3aGVyZSB3ZSBjb25zdWx0IHRoZSBiYWRibG9ja3MNCj4gbGlzdCBm
b3IgZXZlcnkgSU8gaW4gdGhlIGJsb2NrIGRyaXZlciB3b3JrcywgYW5kIGlzIGEgbGFzdCBvcHRp
b24NCj4gZmFpbHNhZmUsIGJ1dCBmcm9tIGEgdXNlciBwZXJzcGVjdGl2ZSwgaXQgaXNuJ3QgdGhl
IGVhc2llc3QgaW50ZXJmYWNlIHRvDQo+IHdvcmsgd2l0aC4NCg0KQXMgSSByZW1lbWJlciwgRkFU
IGFuZCBIRlMrIHNwZWNpZmljYXRpb25zIGNvbnRhaW4gZGVzY3JpcHRpb24gb2YgYmFkIGJsb2Nr
cw0KKHBoeXNpY2FsIHNlY3RvcnMpIHRhYmxlLiBJIGJlbGlldmUgdGhhdCB0aGlzIHRhYmxlIHdh
cyB1c2VkIGZvciB0aGUgY2FzZSBvZg0KZmxvcHB5IG1lZGlhLiBCdXQsIGZpbmFsbHksIHRoaXMg
dGFibGUgYmVjb21lcyB0byBiZSB0aGUgY29tcGxldGVseSBvYnNvbGV0ZQ0KYXJ0ZWZhY3QgYmVj
YXVzZSBtb3N0bHkgc3RvcmFnZSBkZXZpY2VzIGFyZSByZWxpYWJseSBlbm91Z2guIFdoeSBkbyB5
b3UgbmVlZA0KaW4gZXhwb3NpbmcgdGhlIGJhZCBibG9ja3Mgb24gdGhlIGZpbGUgc3lzdGVtIGxl
dmVsPyAgRG8geW91IGV4cGVjdCB0aGF0IG5leHQNCmdlbmVyYXRpb24gb2YgTlZNIG1lbW9yeSB3
aWxsIGJlIHNvIHVucmVsaWFibGUgdGhhdCBmaWxlIHN5c3RlbSBuZWVkcyB0byBtYW5hZ2UNCmJh
ZCBibG9ja3M/IFdoYXQncyBhYm91dCBlcmFzdXJlIGNvZGluZyBzY2hlbWVzPyBEbyBmaWxlIHN5
c3RlbSByZWFsbHkgbmVlZCB0byBzdWZmZXINCmZyb20gdGhlIGJhZCBibG9jayBpc3N1ZT8gDQoN
ClVzdWFsbHksIHdlIGFyZSB1c2luZyBMQkFzIGFuZCBpdCBpcyB0aGUgcmVzcG9uc2liaWxpdHkg
b2Ygc3RvcmFnZSBkZXZpY2UgdG8gbWFwDQphIGJhZCBwaHlzaWNhbCBibG9jay9wYWdlL3NlY3Rv
ciBpbnRvIHZhbGlkIG9uZS4gRG8geW91IG1lYW4gdGhhdCB3ZSBoYXZlDQphY2Nlc3MgdG8gcGh5
c2ljYWwgTlZNIG1lbW9yeSBhZGRyZXNzIGRpcmVjdGx5PyBCdXQgaXQgbG9va3MgbGlrZSB0aGF0
IHdlIGNhbg0KaGF2ZSBhICJiYWQgYmxvY2siIGlzc3VlIGV2ZW4gd2Ugd2lsbCBhY2Nlc3MgZGF0
YSBpbnRvIHBhZ2UgY2FjaGUncyBtZW1vcnkNCnBhZ2UgKGlmIHdlIHdpbGwgdXNlIE5WTSBtZW1v
cnkgZm9yIHBhZ2UgY2FjaGUsIG9mIGNvdXJzZSkuIFNvLCB3aGF0IGRvIHlvdQ0KaW1wbHkgYnkg
ImJhZCBibG9jayIgaXNzdWU/IA0KDQo+IA0KPiBBIHdoaWxlIGJhY2ssIERhdmUgQ2hpbm5lciBo
YWQgc3VnZ2VzdGVkIGEgbW92ZSB0b3dhcmRzIHNtYXJ0ZXINCj4gaGFuZGxpbmcsIGFuZCBJIHBv
c3RlZCBpbml0aWFsIFJGQyBwYXRjaGVzIFsxXSwgYnV0IHNpbmNlIHRoZW4gdGhlIHRvcGljDQo+
IGhhc24ndCByZWFsbHkgbW92ZWQgZm9yd2FyZC4NCj4gDQo+IEknZCBsaWtlIHRvIHByb3Bvc2Ug
YW5kIGhhdmUgYSBkaXNjdXNzaW9uIGFib3V0IHRoZSBmb2xsb3dpbmcgbmV3DQo+IGZ1bmN0aW9u
YWxpdHk6DQo+IA0KPiAxLiBGaWxlc3lzdGVtcyBkZXZlbG9wIGEgbmF0aXZlIHJlcHJlc2VudGF0
aW9uIG9mIGJhZGJsb2Nrcy4gRm9yDQo+IGV4YW1wbGUsIGluIHhmcywgdGhpcyB3b3VsZCAocHJl
c3VtYWJseSkgYmUgbGlua2VkIHRvIHRoZSByZXZlcnNlDQo+IG1hcHBpbmcgYnRyZWUuIFRoZSBm
aWxlc3lzdGVtIHJlcHJlc2VudGF0aW9uIGhhcyB0aGUgcG90ZW50aWFsIHRvIGJlwqANCj4gbW9y
ZSBlZmZpY2llbnQgdGhhbiB0aGUgYmxvY2sgZHJpdmVyIGRvaW5nIHRoZSBjaGVjaywgYXMgdGhl
IGZzIGNhbg0KPiBjaGVjayB0aGUgSU8gaGFwcGVuaW5nIG9uIGEgZmlsZSBhZ2FpbnN0IGp1c3Qg
dGhhdCBmaWxlJ3MgcmFuZ2UuIA0KDQpXaGF0IGRvIHlvdSBtZWFuIGJ5ICJmaWxlIHN5c3RlbSBj
YW4gY2hlY2sgdGhlIElPIGhhcHBlbmluZyBvbiBhIGZpbGUiPw0KRG8geW91IG1lYW4gcmVhZCBv
ciB3cml0ZSBvcGVyYXRpb24/IFdoYXQncyBhYm91dCBtZXRhZGF0YT8NCg0KSWYgd2UgYXJlIHRh
bGtpbmcgYWJvdXQgdGhlIGRpc2NvdmVyaW5nIGEgYmFkIGJsb2NrIG9uIHJlYWQgb3BlcmF0aW9u
IHRoZW4NCnJhcmUgbW9kZXJuIGZpbGUgc3lzdGVtIGlzIGFibGUgdG8gc3Vydml2ZSBhcyBmb3Ig
dGhlIGNhc2Ugb2YgbWV0YWRhdGEgYXMNCmZvciB0aGUgY2FzZSBvZiB1c2VyIGRhdGEuIExldCdz
IGltYWdpbmUgdGhhdCB3ZSBoYXZlIHJlYWxseSBtYXR1cmUgZmlsZQ0Kc3lzdGVtIGRyaXZlciB0
aGVuIHdoYXQgZG9lcyBpdCBtZWFuIHRvIGVuY291bnRlciBhIGJhZCBibG9jaz8gVGhlIGZhaWx1
cmUNCnRvIHJlYWQgYSBsb2dpY2FsIGJsb2NrIG9mIHNvbWUgbWV0YWRhdGEgKGJhZCBibG9jaykg
bWVhbnMgdGhhdCB3ZSBhcmUNCnVuYWJsZSB0byBleHRyYWN0IHNvbWUgcGFydCBvZiBhIG1ldGFk
YXRhIHN0cnVjdHVyZS4gRnJvbSBmaWxlIHN5c3RlbQ0KZHJpdmVyIHBvaW50IG9mIHZpZXcsIGl0
IGxvb2tzIGxpa2UgdGhhdCBvdXIgZmlsZSBzeXN0ZW0gaXMgY29ycnVwdGVkLCB3ZSBuZWVkDQp0
byBzdG9wIHRoZSBmaWxlIHN5c3RlbSBvcGVyYXRpb25zIGFuZCwgZmluYWxseSwgdG8gY2hlY2sg
YW5kIHJlY292ZXIgZmlsZQ0Kc3lzdGVtIHZvbHVtZSBieSBtZWFucyBvZiBmc2NrIHRvb2wuIElm
IHdlIGZpbmQgYSBiYWQgYmxvY2sgZm9yIHNvbWUNCnVzZXIgZmlsZSB0aGVuLCBhZ2FpbiwgaXQg
bG9va3MgbGlrZSBhbiBpc3N1ZS4gU29tZSBmaWxlIHN5c3RlbXMgc2ltcGx5DQpyZXR1cm4gInVu
cmVjb3ZlcmVkIHJlYWQgZXJyb3IiLiBBbm90aGVyIG9uZSwgdGhlb3JldGljYWxseSwgaXMgYWJs
ZQ0KdG8gc3Vydml2ZSBiZWNhdXNlIG9mIHNuYXBzaG90cywgZm9yIGV4YW1wbGUuIEJ1dCwgYW55
d2F5LCBpdCB3aWxsIGxvb2sNCmxpa2UgYXMgUmVhZC1Pbmx5IG1vdW50IHN0YXRlIGFuZCB0aGUg
dXNlciB3aWxsIG5lZWQgdG8gcmVzb2x2ZSBzdWNoDQp0cm91YmxlIGJ5IGhhbmRzLg0KDQpJZiB3
ZSBhcmUgdGFsa2luZyBhYm91dCBkaXNjb3ZlcmluZyBhIGJhZCBibG9jayBkdXJpbmcgd3JpdGUg
b3BlcmF0aW9uIHRoZW4sDQphZ2Fpbiwgd2UgYXJlIGluIHRyb3VibGUuIFVzdWFsbHksIHdlIGFy
ZSB1c2luZyBhc3luY2hyb25vdXMgbW9kZWwNCm9mIHdyaXRlL2ZsdXNoIG9wZXJhdGlvbi4gV2Ug
YXJlIHByZXBhcmluZyB0aGUgY29uc2lzdGVudCBzdGF0ZSBvZiBhbGwgb3VyDQptZXRhZGF0YSBz
dHJ1Y3R1cmVzIGluIHRoZSBtZW1vcnksIGF0IGZpcnN0LiBUaGUgZmx1c2ggb3BlcmF0aW9ucyBm
b3IgbWV0YWRhdGENCmFuZCB1c2VyIGRhdGEgY2FuIGJlIGRvbmUgaW4gZGlmZmVyZW50IHRpbWVz
LiBBbmQgd2hhdCBzaG91bGQgYmUgZG9uZSBpZiB3ZQ0KZGlzY292ZXIgYmFkIGJsb2NrIGZvciBh
bnkgcGllY2Ugb2YgbWV0YWRhdGEgb3IgdXNlciBkYXRhPyBTaW1wbGUgdHJhY2tpbmcgb2YNCmJh
ZCBibG9ja3MgaXMgbm90IGVub3VnaCBhdCBhbGwuIExldCdzIGNvbnNpZGVyIHVzZXIgZGF0YSwg
YXQgZmlyc3QuIElmIHdlIGNhbm5vdA0Kd3JpdGUgc29tZSBmaWxlJ3MgYmxvY2sgc3VjY2Vzc2Z1
bGx5IHRoZW4gd2UgaGF2ZSB0d28gd2F5czogKDEpIGZvcmdldCBhYm91dA0KdGhpcyBwaWVjZSBv
ZiBkYXRhOyAoMikgdHJ5IHRvIGNoYW5nZSB0aGUgYXNzb2NpYXRlZCBMQkEgZm9yIHRoaXMgcGll
Y2Ugb2YgZGF0YS4NClRoZSBvcGVyYXRpb24gb2YgcmUtYWxsb2NhdGlvbiBMQkEgbnVtYmVyIGZv
ciBkaXNjb3ZlcmVkIGJhZCBibG9jaw0KKHVzZXIgZGF0YSBjYXNlKSBzb3VuZHMgYXMgcmVhbCBw
YWluLiBCZWNhdXNlIHlvdSBuZWVkIHRvIHJlYnVpbGQgdGhlIG1ldGFkYXRhDQp0aGF0IHRyYWNr
IHRoZSBsb2NhdGlvbiBvZiB0aGlzIHBhcnQgb2YgZmlsZS4gQW5kIGl0IHNvdW5kcyBhcyBwcmFj
dGljYWxseQ0KaW1wb3NzaWJsZSBvcGVyYXRpb24sIGZvciB0aGUgY2FzZSBvZiBMRlMgZmlsZSBz
eXN0ZW0sIGZvciBleGFtcGxlLg0KSWYgd2UgaGF2ZSB0cm91YmxlIHdpdGggZmx1c2hpbmcgYW55
IHBhcnQgb2YgbWV0YWRhdGEgdGhlbiBpdCBzb3VuZHMgYXMNCmNvbXBsZXRlIGRpc2FzdGVyIGZv
ciBhbnkgZmlsZSBzeXN0ZW0uDQoNCkFyZSB5b3UgcmVhbGx5IHN1cmUgdGhhdCBmaWxlIHN5c3Rl
bSBzaG91bGQgcHJvY2VzcyBiYWQgYmxvY2sgaXNzdWU/DQoNCj5JbiBjb250cmFzdCwgdG9kYXks
IHRoZSBibG9jayBkcml2ZXIgY2hlY2tzIGFnYWluc3QgdGhlIHdob2xlIGJsb2NrIGRldmljZQ0K
PiByYW5nZSBmb3IgZXZlcnkgSU8uIE9uIGVuY291bnRlcmluZyBiYWRibG9ja3MsIHRoZSBmaWxl
c3lzdGVtIGNhbg0KPiBnZW5lcmF0ZSBhIGJldHRlciBub3RpZmljYXRpb24vZXJyb3IgbWVzc2Fn
ZSB0aGF0IHBvaW50cyB0aGUgdXNlciB0b8KgDQo+IChmaWxlLCBvZmZzZXQpIGFzIG9wcG9zZWQg
dG8gdGhlIGJsb2NrIGRyaXZlciwgd2hpY2ggY2FuIG9ubHkgcHJvdmlkZQ0KPiAoYmxvY2stZGV2
aWNlLCBzZWN0b3IpLg0KPg0KPiAyLiBUaGUgYmxvY2sgbGF5ZXIgYWRkcyBhIG5vdGlmaWVyIHRv
IGJhZGJsb2NrIGFkZGl0aW9uL3JlbW92YWwNCj4gb3BlcmF0aW9ucywgd2hpY2ggdGhlIGZpbGVz
eXN0ZW0gc3Vic2NyaWJlcyB0bywgYW5kIHVzZXMgdG8gbWFpbnRhaW4gaXRzDQo+IGJhZGJsb2Nr
cyBhY2NvdW50aW5nLiAoVGhpcyBwYXJ0IGlzIGltcGxlbWVudGVkIGFzIGEgcHJvb2Ygb2YgY29u
Y2VwdCBpbg0KPiB0aGUgUkZDIG1lbnRpb25lZCBhYm92ZSBbMV0pLg0KDQpJIGFtIG5vdCBzdXJl
IHRoYXQgYW55IGJhZCBibG9jayBub3RpZmljYXRpb24gZHVyaW5nL2FmdGVyIElPIG9wZXJhdGlv
bg0KaXMgdmFsdWFibGUgZm9yIGZpbGUgc3lzdGVtLiBNYXliZSwgaXQgY291bGQgaGVscCBpZiBm
aWxlIHN5c3RlbSBzaW1wbHkgd2lsbA0Ka25vdyBhYm91dCBiYWQgYmxvY2sgYmVmb3JlaGFuZCB0
aGUgb3BlcmF0aW9uIG9mIGxvZ2ljYWwgYmxvY2sgYWxsb2NhdGlvbi4NCkJ1dCB3aGF0IHN1YnN5
c3RlbSB3aWxsIGRpc2NvdmVyIGJhZCBibG9ja3MgYmVmb3JlIGFueSBJTyBvcGVyYXRpb25zPw0K
SG93IGZpbGUgc3lzdGVtIHdpbGwgcmVjZWl2ZSBpbmZvcm1hdGlvbiBvciBzb21lIGJhZCBibG9j
ayB0YWJsZT8NCkkgYW0gbm90IGNvbnZpbmNlZCB0aGF0IHN1Z2dlc3RlZCBiYWRibG9ja3MgYXBw
cm9hY2ggaXMgcmVhbGx5IGZlYXNpYmxlLg0KQWxzbyBJIGFtIG5vdCBzdXJlIHRoYXQgZmlsZSBz
eXN0ZW0gc2hvdWxkIHNlZSB0aGUgYmFkIGJsb2NrcyBhdCBhbGwuDQpXaHkgaGFyZHdhcmUgY2Fu
bm90IG1hbmFnZSB0aGlzIGlzc3VlIGZvciB1cz8NCg0KVGhhbmtzLA0KVnlhY2hlc2xhdiBEdWJl
eWtvLg0KIA0K

^ permalink raw reply	[flat|nested] 89+ messages in thread

* RE: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-14  0:00     ` Slava Dubeyko
  0 siblings, 0 replies; 89+ messages in thread
From: Slava Dubeyko @ 2017-01-14  0:00 UTC (permalink / raw)
  To: vishal.l.verma
  Cc: lsf-pc, linux-nvdimm, linux-block, Linux FS Devel, Viacheslav Dubeyko


---- Original Message ----
Subject: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
Sent: Jan 13, 2017 1:40 PM
From: "Verma, Vishal L" <vishal.l.verma@intel.com>
To: lsf-pc@lists.linux-foundation.org
Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org

> The current implementation of badblocks, where we consult the badblocks
> list for every IO in the block driver works, and is a last option
> failsafe, but from a user perspective, it isn't the easiest interface to
> work with.

As I remember, FAT and HFS+ specifications contain description of bad blocks
(physical sectors) table. I believe that this table was used for the case of
floppy media. But, finally, this table becomes to be the completely obsolete
artefact because mostly storage devices are reliably enough. Why do you need
in exposing the bad blocks on the file system level?  Do you expect that next
generation of NVM memory will be so unreliable that file system needs to manage
bad blocks? What's about erasure coding schemes? Do file system really need to suffer
from the bad block issue? 

Usually, we are using LBAs and it is the responsibility of storage device to map
a bad physical block/page/sector into valid one. Do you mean that we have
access to physical NVM memory address directly? But it looks like that we can
have a "bad block" issue even we will access data into page cache's memory
page (if we will use NVM memory for page cache, of course). So, what do you
imply by "bad block" issue? 

> 
> A while back, Dave Chinner had suggested a move towards smarter
> handling, and I posted initial RFC patches [1], but since then the topic
> hasn't really moved forward.
> 
> I'd like to propose and have a discussion about the following new
> functionality:
> 
> 1. Filesystems develop a native representation of badblocks. For
> example, in xfs, this would (presumably) be linked to the reverse
> mapping btree. The filesystem representation has the potential to be 
> more efficient than the block driver doing the check, as the fs can
> check the IO happening on a file against just that file's range. 

What do you mean by "file system can check the IO happening on a file"?
Do you mean read or write operation? What's about metadata?

If we are talking about the discovering a bad block on read operation then
rare modern file system is able to survive as for the case of metadata as
for the case of user data. Let's imagine that we have really mature file
system driver then what does it mean to encounter a bad block? The failure
to read a logical block of some metadata (bad block) means that we are
unable to extract some part of a metadata structure. From file system
driver point of view, it looks like that our file system is corrupted, we need
to stop the file system operations and, finally, to check and recover file
system volume by means of fsck tool. If we find a bad block for some
user file then, again, it looks like an issue. Some file systems simply
return "unrecovered read error". Another one, theoretically, is able
to survive because of snapshots, for example. But, anyway, it will look
like as Read-Only mount state and the user will need to resolve such
trouble by hands.

If we are talking about discovering a bad block during write operation then,
again, we are in trouble. Usually, we are using asynchronous model
of write/flush operation. We are preparing the consistent state of all our
metadata structures in the memory, at first. The flush operations for metadata
and user data can be done in different times. And what should be done if we
discover bad block for any piece of metadata or user data? Simple tracking of
bad blocks is not enough at all. Let's consider user data, at first. If we cannot
write some file's block successfully then we have two ways: (1) forget about
this piece of data; (2) try to change the associated LBA for this piece of data.
The operation of re-allocation LBA number for discovered bad block
(user data case) sounds as real pain. Because you need to rebuild the metadata
that track the location of this part of file. And it sounds as practically
impossible operation, for the case of LFS file system, for example.
If we have trouble with flushing any part of metadata then it sounds as
complete disaster for any file system.

Are you really sure that file system should process bad block issue?

>In contrast, today, the block driver checks against the whole block device
> range for every IO. On encountering badblocks, the filesystem can
> generate a better notification/error message that points the user to 
> (file, offset) as opposed to the block driver, which can only provide
> (block-device, sector).
>
> 2. The block layer adds a notifier to badblock addition/removal
> operations, which the filesystem subscribes to, and uses to maintain its
> badblocks accounting. (This part is implemented as a proof of concept in
> the RFC mentioned above [1]).

I am not sure that any bad block notification during/after IO operation
is valuable for file system. Maybe, it could help if file system simply will
know about bad block beforehand the operation of logical block allocation.
But what subsystem will discover bad blocks before any IO operations?
How file system will receive information or some bad block table?
I am not convinced that suggested badblocks approach is really feasible.
Also I am not sure that file system should see the bad blocks at all.
Why hardware cannot manage this issue for us?

Thanks,
Vyacheslav Dubeyko.
 

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-14  0:00     ` Slava Dubeyko
@ 2017-01-14  0:49       ` Vishal Verma
  -1 siblings, 0 replies; 89+ messages in thread
From: Vishal Verma @ 2017-01-14  0:49 UTC (permalink / raw)
  To: Slava Dubeyko
  Cc: linux-block, Linux FS Devel, lsf-pc, Viacheslav Dubeyko, linux-nvdimm

On 01/14, Slava Dubeyko wrote:
> 
> ---- Original Message ----
> Subject: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
> Sent: Jan 13, 2017 1:40 PM
> From: "Verma, Vishal L" <vishal.l.verma@intel.com>
> To: lsf-pc@lists.linux-foundation.org
> Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org
> 
> > The current implementation of badblocks, where we consult the badblocks
> > list for every IO in the block driver works, and is a last option
> > failsafe, but from a user perspective, it isn't the easiest interface to
> > work with.
> 
> As I remember, FAT and HFS+ specifications contain description of bad blocks
> (physical sectors) table. I believe that this table was used for the case of
> floppy media. But, finally, this table becomes to be the completely obsolete
> artefact because mostly storage devices are reliably enough. Why do you need
> in exposing the bad blocks on the file system level?  Do you expect that next
> generation of NVM memory will be so unreliable that file system needs to manage
> bad blocks? What's about erasure coding schemes? Do file system really need to suffer
> from the bad block issue? 
> 
> Usually, we are using LBAs and it is the responsibility of storage device to map
> a bad physical block/page/sector into valid one. Do you mean that we have
> access to physical NVM memory address directly? But it looks like that we can
> have a "bad block" issue even we will access data into page cache's memory
> page (if we will use NVM memory for page cache, of course). So, what do you
> imply by "bad block" issue? 

We don't have direct physical access to the device's address space, in
the sense the device is still free to perform remapping of chunks of NVM
underneath us. The problem is that when a block or address range (as
small as a cache line) goes bad, the device maintains a poison bit for
every affected cache line. Behind the scenes, it may have already
remapped the range, but the cache line poison has to be kept so that
there is a notification to the user/owner of the data that something has
been lost. Since NVM is byte addressable memory sitting on the memory
bus, such a poisoned cache line results in memory errors and SIGBUSes.
Compared to tradational storage where an app will get nice and friendly
(relatively speaking..) -EIOs. The whole badblocks implementation was
done so that the driver can intercept IO (i.e. reads) to _known_ bad
locations, and short-circuit them with an EIO. If the driver doesn't
catch these, the reads will turn into a memory bus access, and the
poison will cause a SIGBUS.

This effort is to try and make this badblock checking smarter - and try
and reduce the penalty on every IO to a smaller range, which only the
filesystem can do.

> 
> > 
> > A while back, Dave Chinner had suggested a move towards smarter
> > handling, and I posted initial RFC patches [1], but since then the topic
> > hasn't really moved forward.
> > 
> > I'd like to propose and have a discussion about the following new
> > functionality:
> > 
> > 1. Filesystems develop a native representation of badblocks. For
> > example, in xfs, this would (presumably) be linked to the reverse
> > mapping btree. The filesystem representation has the potential to be 
> > more efficient than the block driver doing the check, as the fs can
> > check the IO happening on a file against just that file's range. 
> 
> What do you mean by "file system can check the IO happening on a file"?
> Do you mean read or write operation? What's about metadata?

For the purpose described above, i.e. returning early EIOs when
possible, this will be limited to reads and metadata reads. If we're
about to do a metadata read, and realize the block(s) about to be read
are on the badblocks list, then we do the same thing as when we discover
other kinds of metadata corruption.

> 
> If we are talking about the discovering a bad block on read operation then
> rare modern file system is able to survive as for the case of metadata as
> for the case of user data. Let's imagine that we have really mature file
> system driver then what does it mean to encounter a bad block? The failure
> to read a logical block of some metadata (bad block) means that we are
> unable to extract some part of a metadata structure. From file system
> driver point of view, it looks like that our file system is corrupted, we need
> to stop the file system operations and, finally, to check and recover file
> system volume by means of fsck tool. If we find a bad block for some
> user file then, again, it looks like an issue. Some file systems simply
> return "unrecovered read error". Another one, theoretically, is able
> to survive because of snapshots, for example. But, anyway, it will look
> like as Read-Only mount state and the user will need to resolve such
> trouble by hands.

As far as I can tell, all of these things remain the same. The goal here
isn't to survive more NVM badblocks than we would've before, and lost
data or lost metadata will continue to have the same consequences as
before, and will need the same recovery actions/intervention as before.
The goal is to make the failure model similar to what users expect
today, and as much as possible make recovery actions too similarly
intuitive.

> 
> If we are talking about discovering a bad block during write operation then,
> again, we are in trouble. Usually, we are using asynchronous model
> of write/flush operation. We are preparing the consistent state of all our
> metadata structures in the memory, at first. The flush operations for metadata
> and user data can be done in different times. And what should be done if we
> discover bad block for any piece of metadata or user data? Simple tracking of
> bad blocks is not enough at all. Let's consider user data, at first. If we cannot
> write some file's block successfully then we have two ways: (1) forget about
> this piece of data; (2) try to change the associated LBA for this piece of data.
> The operation of re-allocation LBA number for discovered bad block
> (user data case) sounds as real pain. Because you need to rebuild the metadata
> that track the location of this part of file. And it sounds as practically
> impossible operation, for the case of LFS file system, for example.
> If we have trouble with flushing any part of metadata then it sounds as
> complete disaster for any file system.

Writes can get more complicated in certain cases. If it is a regular
page cache writeback, or any aligned write that goes through the block
driver, that is completely fine. The block driver will check that the
block was previously marked as bad, do a "clear poison" operation
(defined in the ACPI spec), which tells the firmware that the poison bit
is not OK to be cleared, and writes the new data. This also removes the
block from the badblocks list, and in this scheme, triggers a
notification to the filesystem that it too can remove the block from its
accounting. mmap writes and DAX can get more complicated, and at times
they will just trigger a SIGBUS, and there's no way around that.

> 
> Are you really sure that file system should process bad block issue?
> 
> >In contrast, today, the block driver checks against the whole block device
> > range for every IO. On encountering badblocks, the filesystem can
> > generate a better notification/error message that points the user to 
> > (file, offset) as opposed to the block driver, which can only provide
> > (block-device, sector).
> >
> > 2. The block layer adds a notifier to badblock addition/removal
> > operations, which the filesystem subscribes to, and uses to maintain its
> > badblocks accounting. (This part is implemented as a proof of concept in
> > the RFC mentioned above [1]).
> 
> I am not sure that any bad block notification during/after IO operation
> is valuable for file system. Maybe, it could help if file system simply will
> know about bad block beforehand the operation of logical block allocation.
> But what subsystem will discover bad blocks before any IO operations?
> How file system will receive information or some bad block table?

The driver populates its badblocks lists whenever an Address Range Scrub
is started (also via ACPI methods). This is always done at
initialization time, so that it can build an in-memory representation of
the badblocks. Additionally, this can also be triggered manually. And
finally badblocks can also get populated for new latent errors when a
machine check exception occurs. All of these can trigger notification to
the file system without actual user reads happening.

> I am not convinced that suggested badblocks approach is really feasible.
> Also I am not sure that file system should see the bad blocks at all.
> Why hardware cannot manage this issue for us?

Hardware does manage the actual badblocks issue for us in the sense that
when it discovers a badblock it will do the remapping. But since this is
on the memory bus, and has different error signatures than applications
are used to, we want to make the error handling similar to the existing
storage model.

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-14  0:49       ` Vishal Verma
  0 siblings, 0 replies; 89+ messages in thread
From: Vishal Verma @ 2017-01-14  0:49 UTC (permalink / raw)
  To: Slava Dubeyko
  Cc: lsf-pc, linux-nvdimm, linux-block, Linux FS Devel, Viacheslav Dubeyko

On 01/14, Slava Dubeyko wrote:
> 
> ---- Original Message ----
> Subject: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
> Sent: Jan 13, 2017 1:40 PM
> From: "Verma, Vishal L" <vishal.l.verma@intel.com>
> To: lsf-pc@lists.linux-foundation.org
> Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org
> 
> > The current implementation of badblocks, where we consult the badblocks
> > list for every IO in the block driver works, and is a last option
> > failsafe, but from a user perspective, it isn't the easiest interface to
> > work with.
> 
> As I remember, FAT and HFS+ specifications contain description of bad blocks
> (physical sectors) table. I believe that this table was used for the case of
> floppy media. But, finally, this table becomes to be the completely obsolete
> artefact because mostly storage devices are reliably enough. Why do you need
> in exposing the bad blocks on the file system level?  Do you expect that next
> generation of NVM memory will be so unreliable that file system needs to manage
> bad blocks? What's about erasure coding schemes? Do file system really need to suffer
> from the bad block issue? 
> 
> Usually, we are using LBAs and it is the responsibility of storage device to map
> a bad physical block/page/sector into valid one. Do you mean that we have
> access to physical NVM memory address directly? But it looks like that we can
> have a "bad block" issue even we will access data into page cache's memory
> page (if we will use NVM memory for page cache, of course). So, what do you
> imply by "bad block" issue? 

We don't have direct physical access to the device's address space, in
the sense the device is still free to perform remapping of chunks of NVM
underneath us. The problem is that when a block or address range (as
small as a cache line) goes bad, the device maintains a poison bit for
every affected cache line. Behind the scenes, it may have already
remapped the range, but the cache line poison has to be kept so that
there is a notification to the user/owner of the data that something has
been lost. Since NVM is byte addressable memory sitting on the memory
bus, such a poisoned cache line results in memory errors and SIGBUSes.
Compared to tradational storage where an app will get nice and friendly
(relatively speaking..) -EIOs. The whole badblocks implementation was
done so that the driver can intercept IO (i.e. reads) to _known_ bad
locations, and short-circuit them with an EIO. If the driver doesn't
catch these, the reads will turn into a memory bus access, and the
poison will cause a SIGBUS.

This effort is to try and make this badblock checking smarter - and try
and reduce the penalty on every IO to a smaller range, which only the
filesystem can do.

> 
> > 
> > A while back, Dave Chinner had suggested a move towards smarter
> > handling, and I posted initial RFC patches [1], but since then the topic
> > hasn't really moved forward.
> > 
> > I'd like to propose and have a discussion about the following new
> > functionality:
> > 
> > 1. Filesystems develop a native representation of badblocks. For
> > example, in xfs, this would (presumably) be linked to the reverse
> > mapping btree. The filesystem representation has the potential to be�
> > more efficient than the block driver doing the check, as the fs can
> > check the IO happening on a file against just that file's range. 
> 
> What do you mean by "file system can check the IO happening on a file"?
> Do you mean read or write operation? What's about metadata?

For the purpose described above, i.e. returning early EIOs when
possible, this will be limited to reads and metadata reads. If we're
about to do a metadata read, and realize the block(s) about to be read
are on the badblocks list, then we do the same thing as when we discover
other kinds of metadata corruption.

> 
> If we are talking about the discovering a bad block on read operation then
> rare modern file system is able to survive as for the case of metadata as
> for the case of user data. Let's imagine that we have really mature file
> system driver then what does it mean to encounter a bad block? The failure
> to read a logical block of some metadata (bad block) means that we are
> unable to extract some part of a metadata structure. From file system
> driver point of view, it looks like that our file system is corrupted, we need
> to stop the file system operations and, finally, to check and recover file
> system volume by means of fsck tool. If we find a bad block for some
> user file then, again, it looks like an issue. Some file systems simply
> return "unrecovered read error". Another one, theoretically, is able
> to survive because of snapshots, for example. But, anyway, it will look
> like as Read-Only mount state and the user will need to resolve such
> trouble by hands.

As far as I can tell, all of these things remain the same. The goal here
isn't to survive more NVM badblocks than we would've before, and lost
data or lost metadata will continue to have the same consequences as
before, and will need the same recovery actions/intervention as before.
The goal is to make the failure model similar to what users expect
today, and as much as possible make recovery actions too similarly
intuitive.

> 
> If we are talking about discovering a bad block during write operation then,
> again, we are in trouble. Usually, we are using asynchronous model
> of write/flush operation. We are preparing the consistent state of all our
> metadata structures in the memory, at first. The flush operations for metadata
> and user data can be done in different times. And what should be done if we
> discover bad block for any piece of metadata or user data? Simple tracking of
> bad blocks is not enough at all. Let's consider user data, at first. If we cannot
> write some file's block successfully then we have two ways: (1) forget about
> this piece of data; (2) try to change the associated LBA for this piece of data.
> The operation of re-allocation LBA number for discovered bad block
> (user data case) sounds as real pain. Because you need to rebuild the metadata
> that track the location of this part of file. And it sounds as practically
> impossible operation, for the case of LFS file system, for example.
> If we have trouble with flushing any part of metadata then it sounds as
> complete disaster for any file system.

Writes can get more complicated in certain cases. If it is a regular
page cache writeback, or any aligned write that goes through the block
driver, that is completely fine. The block driver will check that the
block was previously marked as bad, do a "clear poison" operation
(defined in the ACPI spec), which tells the firmware that the poison bit
is not OK to be cleared, and writes the new data. This also removes the
block from the badblocks list, and in this scheme, triggers a
notification to the filesystem that it too can remove the block from its
accounting. mmap writes and DAX can get more complicated, and at times
they will just trigger a SIGBUS, and there's no way around that.

> 
> Are you really sure that file system should process bad block issue?
> 
> >In contrast, today, the block driver checks against the whole block device
> > range for every IO. On encountering badblocks, the filesystem can
> > generate a better notification/error message that points the user to�
> > (file, offset) as opposed to the block driver, which can only provide
> > (block-device, sector).
> >
> > 2. The block layer adds a notifier to badblock addition/removal
> > operations, which the filesystem subscribes to, and uses to maintain its
> > badblocks accounting. (This part is implemented as a proof of concept in
> > the RFC mentioned above [1]).
> 
> I am not sure that any bad block notification during/after IO operation
> is valuable for file system. Maybe, it could help if file system simply will
> know about bad block beforehand the operation of logical block allocation.
> But what subsystem will discover bad blocks before any IO operations?
> How file system will receive information or some bad block table?

The driver populates its badblocks lists whenever an Address Range Scrub
is started (also via ACPI methods). This is always done at
initialization time, so that it can build an in-memory representation of
the badblocks. Additionally, this can also be triggered manually. And
finally badblocks can also get populated for new latent errors when a
machine check exception occurs. All of these can trigger notification to
the file system without actual user reads happening.

> I am not convinced that suggested badblocks approach is really feasible.
> Also I am not sure that file system should see the bad blocks at all.
> Why hardware cannot manage this issue for us?

Hardware does manage the actual badblocks issue for us in the sense that
when it discovers a badblock it will do the remapping. But since this is
on the memory bus, and has different error signatures than applications
are used to, we want to make the error handling similar to the existing
storage model.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* RE: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-14  0:49       ` Vishal Verma
  (?)
@ 2017-01-16  2:27         ` Slava Dubeyko
  -1 siblings, 0 replies; 89+ messages in thread
From: Slava Dubeyko @ 2017-01-16  2:27 UTC (permalink / raw)
  To: Vishal Verma
  Cc: linux-block, Linux FS Devel, lsf-pc, Viacheslav Dubeyko, linux-nvdimm


-----Original Message-----
From: Vishal Verma [mailto:vishal.l.verma@intel.com] 
Sent: Friday, January 13, 2017 4:49 PM
To: Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>
Cc: lsf-pc@lists.linux-foundation.org; linux-nvdimm@lists.01.org; linux-block@vger.kernel.org; Linux FS Devel <linux-fsdevel@vger.kernel.org>; Viacheslav Dubeyko <slava@dubeyko.com>
Subject: Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems

<skipped>

> We don't have direct physical access to the device's address space, in the sense
> the device is still free to perform remapping of chunks of NVM underneath us.
> The problem is that when a block or address range (as small as a cache line) goes bad,
> the device maintains a poison bit for every affected cache line. Behind the scenes,
> it may have already remapped the range, but the cache line poison has to be kept so that
> there is a notification to the user/owner of the data that something has been lost.
> Since NVM is byte addressable memory sitting on the memory bus, such a poisoned
> cache line results in memory errors and SIGBUSes.
> Compared to tradational storage where an app will get nice and friendly (relatively speaking..) -EIOs.
> The whole badblocks implementation was done so that the driver can intercept IO (i.e. reads)
> to _known_ bad locations, and short-circuit them with an EIO. If the driver doesn't catch these,
> the reads will turn into a memory bus access, and the poison will cause a SIGBUS.
>
> This effort is to try and make this badblock checking smarter - and try and reduce the penalty
> on every IO to a smaller range, which only the filesystem can do.

I still slightly puzzled and I cannot understand why the situation looks like a dead end.
As far as I can see, first of all, a NVM device is able to use hardware-based LDPC,
Reed-Solomon error correction or any other fancy code. It could provide some error
correction basis. Also it can provide the way of estimation of BER value. So, if a NVM memory's
address range degrades gradually (during weeks or months) then, practically, it's possible
to remap and to migrate the affected address ranges in the background. Otherwise,
if a NVM memory so unreliable that address range is able to degrade during seconds or minutes
then who will use such NVM memory?

OK. Let's imagine that NVM memory device hasn't any internal error correction hardware-based
scheme. Next level of defense could be any erasure coding scheme on device driver level. So, any
piece of data can be protected by parities. And device driver will be responsible for management
of erasure coding scheme. It will increase latency of read operation for the case of necessity
to recover the affected memory page. But, finally, all recovering activity will be behind the scene
and file system will be unaware about such recovering activity.

If you are going not to provide any erasure coding or error correction scheme then it's really
bad case. The fsck tool is not regular case tool but the last resort. If you are going to rely on
the fsck tool then simply forget about using your hardware. Some file systems haven't the fsck
tool at all. Some guys really believe that file system has to work without support of the fsck tool.
Even if a mature file system has reliable fsck tool then the probability of file system recovering
is very low in the case of serious metadata corruptions. So, it means that you are trying to suggest
the technique when we will lose the whole file system volumes on regular basis without any hope
to recover data. Even if file system has snapshots then, again, we haven't hope because we can
suffer from read error and for operation with snapshot.

But if we will have support of any erasure coding scheme and NVM device discovers poisoned
cache line for some memory page then, I suppose, that such situation could looks like as page fault
and memory subsystem will need to re-read the page with background recovery of memory page's
content.

It sounds for me that we simply have some poorly designed hardware. And it is impossible
to push such issue on file system level. I believe that such issue can be managed by block
device or DAX subsystem in the presence of any erasure coding scheme. Otherwise, no
file system is able to survive in such wild environment. Because, I assume that any file
system volume will be in unrecoverable state in 50% (or significantly more) cases of bad block
discovering. Because any affection of metadata block can be resulted in severely inconsistent
state of file system's metadata structures. And it's very non-trivial task to recover the consistent
state of file system's metadata structures in the case of losing some part of it.

> > > 
> > > A while back, Dave Chinner had suggested a move towards smarter 
> > > handling, and I posted initial RFC patches [1], but since then the 
> > > topic hasn't really moved forward.
> > > 
> > > I'd like to propose and have a discussion about the following new
> > > functionality:
> > > 
> > > 1. Filesystems develop a native representation of badblocks. For 
> > > example, in xfs, this would (presumably) be linked to the reverse 
> > > mapping btree. The filesystem representation has the potential to be 
> > > more efficient than the block driver doing the check, as the fs can 
> > > check the IO happening on a file against just that file's range.
> > 
> > What do you mean by "file system can check the IO happening on a file"?
> > Do you mean read or write operation? What's about metadata?
>
> For the purpose described above, i.e. returning early EIOs when possible,
> this will be limited to reads and metadata reads. If we're about to do a metadata
> read, and realize the block(s) about to be read are on the badblocks list, then
> we do the same thing as when we discover other kinds of metadata corruption.

Frankly speaking, I cannot follow how badblock list is able to help the file system
driver to survive. Every time when file system driver encounters the bad block
presence then it stops the activity with: (1) unrecovered read error; (2) remount
in RO mode; (3) simple crash. It means that it needs to unmount a file system volume
(if driver hasn't crashed) and to run fsck tool. So, file system driver cannot gain
from tracking bad blocks in the special list because, mostly, it will stop the regular
operation in the case of access the bad block. Even if the file system driver extracts
the badblock list from some low-level driver then what can be done by file system
driver? Let's imagine that file system driver knows that LBA#N is bad then the best
behavior will be simply panic or remount in RO state, nothing more.

<skipped>

> As far as I can tell, all of these things remain the same. The goal here isn't to survive
> more NVM badblocks than we would've before, and lost data or
> lost metadata will continue to have the same consequences as before, and
> will need the same recovery actions/intervention as before.
> The goal is to make the failure model similar to what users expect
> today, and as much as possible make recovery actions too similarly intuitive.

OK. Nowadays, user expects that hardware is reliably enough. It's the same
situation like for NAND flash. NAND flash can have bad erase blocks. But FTL
hides this reality from a file system. Otherwise, file system should be
NAND flash oriented and to be able to manage bad erase blocks presence.
Your suggestion will increase probability of unrecoverable state of file
system volume dramatically. So, it's hard to see the point for such approach.

> Writes can get more complicated in certain cases. If it is a regular page cache
> writeback, or any aligned write that goes through the block driver, that is completely
> fine. The block driver will check that the block was previously marked as bad,
> do a "clear poison" operation (defined in the ACPI spec), which tells the firmware that
> the poison bit is not OK to be cleared, and writes the new data. This also removes
> the block from the badblocks list, and in this scheme, triggers a notification to
> the filesystem that it too can remove the block from its accounting.
> mmap writes and DAX can get more complicated, and at times they will just
>trigger a SIGBUS, and there's no way around that.

If page cache writeback finishes with writing data in valid location then
no troubles here at all. But I assume that critical point will on the read path.
Because, we still will have the same troubles as I mentioned above.

<skipped>

> Hardware does manage the actual badblocks issue for us
> in the sense that when it discovers a badblock it will do the remapping.
> But since this is on the memory bus, and has different error signatures
> than applications are used to, we want to make the error handling
> similar to the existing storage model.

So, if hardware is able to do the remapping of bad portions of memory page
then it is possible to see the valid logical page always. The key point here
that hardware controller should manage migration of data from aged/pre-bad
NVM memory ranges into valid ones. Or it needs to use some fancy
error-correction techniques or erasure coding schemes.

Thanks,
Vyacheslav Dubeyko.

Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 89+ messages in thread

* RE: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-16  2:27         ` Slava Dubeyko
  0 siblings, 0 replies; 89+ messages in thread
From: Slava Dubeyko @ 2017-01-16  2:27 UTC (permalink / raw)
  To: Vishal Verma
  Cc: lsf-pc, linux-nvdimm, linux-block, Linux FS Devel, Viacheslav Dubeyko


-----Original Message-----
From: Vishal Verma [mailto:vishal.l.verma@intel.com]=20
Sent: Friday, January 13, 2017 4:49 PM
To: Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>
Cc: lsf-pc@lists.linux-foundation.org; linux-nvdimm@lists.01.org; linux-blo=
ck@vger.kernel.org; Linux FS Devel <linux-fsdevel@vger.kernel.org>; Viaches=
lav Dubeyko <slava@dubeyko.com>
Subject: Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystem=
s

<skipped>

> We don't have direct physical access to the device's address space, in th=
e sense
> the device is still free to perform remapping of chunks of NVM underneath=
 us.
> The problem is that when a block or address range (as small as a cache li=
ne) goes bad,
> the device maintains a poison bit for every affected cache line. Behind t=
he scenes,
> it may have already remapped the range, but the cache line poison has to =
be kept so that
> there is a notification to the user/owner of the data that something has =
been lost.
> Since NVM is byte addressable memory sitting on the memory bus, such a po=
isoned
> cache line results in memory errors and SIGBUSes.
> Compared to tradational storage where an app will get nice and friendly (=
relatively speaking..) -EIOs.
> The whole badblocks implementation was done so that the driver can interc=
ept IO (i.e. reads)
> to _known_ bad locations, and short-circuit them with an EIO. If the driv=
er doesn't catch these,
> the reads will turn into a memory bus access, and the poison will cause a=
 SIGBUS.
>
> This effort is to try and make this badblock checking smarter - and try a=
nd reduce the penalty
> on every IO to a smaller range, which only the filesystem can do.

I still slightly puzzled and I cannot understand why the situation looks li=
ke a dead end.
As far as I can see, first of all, a NVM device is able to use hardware-bas=
ed LDPC,
Reed-Solomon error correction or any other fancy code. It could provide som=
e error
correction basis. Also it can provide the way of estimation of BER value. S=
o, if a NVM memory's
address range degrades gradually (during weeks or months) then, practically=
, it's possible
to remap and to migrate the affected address ranges in the background. Othe=
rwise,
if a NVM memory so unreliable that address range is able to degrade during =
seconds or minutes
then who will use such NVM memory?

OK. Let's imagine that NVM memory device hasn't any internal error correcti=
on hardware-based
scheme. Next level of defense could be any erasure coding scheme on device =
driver level. So, any
piece of data can be protected by parities. And device driver will be respo=
nsible for management
of erasure coding scheme. It will increase latency of read operation for th=
e case of necessity
to recover the affected memory page. But, finally, all recovering activity =
will be behind the scene
and file system will be unaware about such recovering activity.

If you are going not to provide any erasure coding or error correction sche=
me then it's really
bad case. The fsck tool is not regular case tool but the last resort. If yo=
u are going to rely on
the fsck tool then simply forget about using your hardware. Some file syste=
ms haven't the fsck
tool at all. Some guys really believe that file system has to work without =
support of the fsck tool.
Even if a mature file system has reliable fsck tool then the probability of=
 file system recovering
is very low in the case of serious metadata corruptions. So, it means that =
you are trying to suggest
the technique when we will lose the whole file system volumes on regular ba=
sis without any hope
to recover data. Even if file system has snapshots then, again, we haven't =
hope because we can
suffer from read error and for operation with snapshot.

But if we will have support of any erasure coding scheme and NVM device dis=
covers poisoned
cache line for some memory page then, I suppose, that such situation could =
looks like as page fault
and memory subsystem will need to re-read the page with background recovery=
 of memory page's
content.

It sounds for me that we simply have some poorly designed hardware. And it =
is impossible
to push such issue on file system level. I believe that such issue can be m=
anaged by block
device or DAX subsystem in the presence of any erasure coding scheme. Other=
wise, no
file system is able to survive in such wild environment. Because, I assume =
that any file
system volume will be in unrecoverable state in 50% (or significantly more)=
 cases of bad block
discovering. Because any affection of metadata block can be resulted in sev=
erely inconsistent
state of file system's metadata structures. And it's very non-trivial task =
to recover the consistent
state of file system's metadata structures in the case of losing some part =
of it.

> > >=20
> > > A while back, Dave Chinner had suggested a move towards smarter=20
> > > handling, and I posted initial RFC patches [1], but since then the=20
> > > topic hasn't really moved forward.
> > >=20
> > > I'd like to propose and have a discussion about the following new
> > > functionality:
> > >=20
> > > 1. Filesystems develop a native representation of badblocks. For=20
> > > example, in xfs, this would (presumably) be linked to the reverse=20
> > > mapping btree. The filesystem representation has the potential to be=
=20
> > > more efficient than the block driver doing the check, as the fs can=20
> > > check the IO happening on a file against just that file's range.
> >=20
> > What do you mean by "file system can check the IO happening on a file"?
> > Do you mean read or write operation? What's about metadata?
>
> For the purpose described above, i.e. returning early EIOs when possible,
> this will be limited to reads and metadata reads. If we're about to do a =
metadata
> read, and realize the block(s) about to be read are on the badblocks list=
, then
> we do the same thing as when we discover other kinds of metadata corrupti=
on.

Frankly speaking, I cannot follow how badblock list is able to help the fil=
e system
driver to survive. Every time when file system driver encounters the bad bl=
ock
presence then it stops the activity with: (1) unrecovered read error; (2) r=
emount
in RO mode; (3) simple crash. It means that it needs to unmount a file syst=
em volume
(if driver hasn't crashed) and to run fsck tool. So, file system driver can=
not gain
from tracking bad blocks in the special list because, mostly, it will stop =
the regular
operation in the case of access the bad block. Even if the file system driv=
er extracts
the badblock list from some low-level driver then what can be done by file =
system
driver? Let's imagine that file system driver knows that LBA#N is bad then =
the best
behavior will be simply panic or remount in RO state, nothing more.

<skipped>

> As far as I can tell, all of these things remain the same. The goal here =
isn't to survive
> more NVM badblocks than we would've before, and lost data or
> lost metadata will continue to have the same consequences as before, and
> will need the same recovery actions/intervention as before.
> The goal is to make the failure model similar to what users expect
> today, and as much as possible make recovery actions too similarly intuit=
ive.

OK. Nowadays, user expects that hardware is reliably enough. It's the same
situation like for NAND flash. NAND flash can have bad erase blocks. But FT=
L
hides this reality from a file system. Otherwise, file system should be
NAND flash oriented and to be able to manage bad erase blocks presence.
Your suggestion will increase probability of unrecoverable state of file
system volume dramatically. So, it's hard to see the point for such approac=
h.

> Writes can get more complicated in certain cases. If it is a regular page=
 cache
> writeback, or any aligned write that goes through the block driver, that =
is completely
> fine. The block driver will check that the block was previously marked as=
 bad,
> do a "clear poison" operation (defined in the ACPI spec), which tells the=
 firmware that
> the poison bit is not OK to be cleared, and writes the new data. This als=
o removes
> the block from the badblocks list, and in this scheme, triggers a notific=
ation to
> the filesystem that it too can remove the block from its accounting.
> mmap writes and DAX can get more complicated, and at times they will just
>trigger a SIGBUS, and there's no way around that.

If page cache writeback finishes with writing data in valid location then
no troubles here at all. But I assume that critical point will on the read =
path.
Because, we still will have the same troubles as I mentioned above.

<skipped>

> Hardware does manage the actual badblocks issue for us
> in the sense that when it discovers a badblock it will do the remapping.
> But since this is on the memory bus, and has different error signatures
> than applications are used to, we want to make the error handling
> similar to the existing storage model.

So, if hardware is able to do the remapping of bad portions of memory page
then it is possible to see the valid logical page always. The key point her=
e
that hardware controller should manage migration of data from aged/pre-bad
NVM memory ranges into valid ones. Or it needs to use some fancy
error-correction techniques or erasure coding schemes.

Thanks,
Vyacheslav Dubeyko.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* RE: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-16  2:27         ` Slava Dubeyko
  0 siblings, 0 replies; 89+ messages in thread
From: Slava Dubeyko @ 2017-01-16  2:27 UTC (permalink / raw)
  To: Vishal Verma
  Cc: lsf-pc, linux-nvdimm, linux-block, Linux FS Devel, Viacheslav Dubeyko


-----Original Message-----
From: Vishal Verma [mailto:vishal.l.verma@intel.com] 
Sent: Friday, January 13, 2017 4:49 PM
To: Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>
Cc: lsf-pc@lists.linux-foundation.org; linux-nvdimm@lists.01.org; linux-block@vger.kernel.org; Linux FS Devel <linux-fsdevel@vger.kernel.org>; Viacheslav Dubeyko <slava@dubeyko.com>
Subject: Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems

<skipped>

> We don't have direct physical access to the device's address space, in the sense
> the device is still free to perform remapping of chunks of NVM underneath us.
> The problem is that when a block or address range (as small as a cache line) goes bad,
> the device maintains a poison bit for every affected cache line. Behind the scenes,
> it may have already remapped the range, but the cache line poison has to be kept so that
> there is a notification to the user/owner of the data that something has been lost.
> Since NVM is byte addressable memory sitting on the memory bus, such a poisoned
> cache line results in memory errors and SIGBUSes.
> Compared to tradational storage where an app will get nice and friendly (relatively speaking..) -EIOs.
> The whole badblocks implementation was done so that the driver can intercept IO (i.e. reads)
> to _known_ bad locations, and short-circuit them with an EIO. If the driver doesn't catch these,
> the reads will turn into a memory bus access, and the poison will cause a SIGBUS.
>
> This effort is to try and make this badblock checking smarter - and try and reduce the penalty
> on every IO to a smaller range, which only the filesystem can do.

I still slightly puzzled and I cannot understand why the situation looks like a dead end.
As far as I can see, first of all, a NVM device is able to use hardware-based LDPC,
Reed-Solomon error correction or any other fancy code. It could provide some error
correction basis. Also it can provide the way of estimation of BER value. So, if a NVM memory's
address range degrades gradually (during weeks or months) then, practically, it's possible
to remap and to migrate the affected address ranges in the background. Otherwise,
if a NVM memory so unreliable that address range is able to degrade during seconds or minutes
then who will use such NVM memory?

OK. Let's imagine that NVM memory device hasn't any internal error correction hardware-based
scheme. Next level of defense could be any erasure coding scheme on device driver level. So, any
piece of data can be protected by parities. And device driver will be responsible for management
of erasure coding scheme. It will increase latency of read operation for the case of necessity
to recover the affected memory page. But, finally, all recovering activity will be behind the scene
and file system will be unaware about such recovering activity.

If you are going not to provide any erasure coding or error correction scheme then it's really
bad case. The fsck tool is not regular case tool but the last resort. If you are going to rely on
the fsck tool then simply forget about using your hardware. Some file systems haven't the fsck
tool at all. Some guys really believe that file system has to work without support of the fsck tool.
Even if a mature file system has reliable fsck tool then the probability of file system recovering
is very low in the case of serious metadata corruptions. So, it means that you are trying to suggest
the technique when we will lose the whole file system volumes on regular basis without any hope
to recover data. Even if file system has snapshots then, again, we haven't hope because we can
suffer from read error and for operation with snapshot.

But if we will have support of any erasure coding scheme and NVM device discovers poisoned
cache line for some memory page then, I suppose, that such situation could looks like as page fault
and memory subsystem will need to re-read the page with background recovery of memory page's
content.

It sounds for me that we simply have some poorly designed hardware. And it is impossible
to push such issue on file system level. I believe that such issue can be managed by block
device or DAX subsystem in the presence of any erasure coding scheme. Otherwise, no
file system is able to survive in such wild environment. Because, I assume that any file
system volume will be in unrecoverable state in 50% (or significantly more) cases of bad block
discovering. Because any affection of metadata block can be resulted in severely inconsistent
state of file system's metadata structures. And it's very non-trivial task to recover the consistent
state of file system's metadata structures in the case of losing some part of it.

> > > 
> > > A while back, Dave Chinner had suggested a move towards smarter 
> > > handling, and I posted initial RFC patches [1], but since then the 
> > > topic hasn't really moved forward.
> > > 
> > > I'd like to propose and have a discussion about the following new
> > > functionality:
> > > 
> > > 1. Filesystems develop a native representation of badblocks. For 
> > > example, in xfs, this would (presumably) be linked to the reverse 
> > > mapping btree. The filesystem representation has the potential to be 
> > > more efficient than the block driver doing the check, as the fs can 
> > > check the IO happening on a file against just that file's range.
> > 
> > What do you mean by "file system can check the IO happening on a file"?
> > Do you mean read or write operation? What's about metadata?
>
> For the purpose described above, i.e. returning early EIOs when possible,
> this will be limited to reads and metadata reads. If we're about to do a metadata
> read, and realize the block(s) about to be read are on the badblocks list, then
> we do the same thing as when we discover other kinds of metadata corruption.

Frankly speaking, I cannot follow how badblock list is able to help the file system
driver to survive. Every time when file system driver encounters the bad block
presence then it stops the activity with: (1) unrecovered read error; (2) remount
in RO mode; (3) simple crash. It means that it needs to unmount a file system volume
(if driver hasn't crashed) and to run fsck tool. So, file system driver cannot gain
from tracking bad blocks in the special list because, mostly, it will stop the regular
operation in the case of access the bad block. Even if the file system driver extracts
the badblock list from some low-level driver then what can be done by file system
driver? Let's imagine that file system driver knows that LBA#N is bad then the best
behavior will be simply panic or remount in RO state, nothing more.

<skipped>

> As far as I can tell, all of these things remain the same. The goal here isn't to survive
> more NVM badblocks than we would've before, and lost data or
> lost metadata will continue to have the same consequences as before, and
> will need the same recovery actions/intervention as before.
> The goal is to make the failure model similar to what users expect
> today, and as much as possible make recovery actions too similarly intuitive.

OK. Nowadays, user expects that hardware is reliably enough. It's the same
situation like for NAND flash. NAND flash can have bad erase blocks. But FTL
hides this reality from a file system. Otherwise, file system should be
NAND flash oriented and to be able to manage bad erase blocks presence.
Your suggestion will increase probability of unrecoverable state of file
system volume dramatically. So, it's hard to see the point for such approach.

> Writes can get more complicated in certain cases. If it is a regular page cache
> writeback, or any aligned write that goes through the block driver, that is completely
> fine. The block driver will check that the block was previously marked as bad,
> do a "clear poison" operation (defined in the ACPI spec), which tells the firmware that
> the poison bit is not OK to be cleared, and writes the new data. This also removes
> the block from the badblocks list, and in this scheme, triggers a notification to
> the filesystem that it too can remove the block from its accounting.
> mmap writes and DAX can get more complicated, and at times they will just
>trigger a SIGBUS, and there's no way around that.

If page cache writeback finishes with writing data in valid location then
no troubles here at all. But I assume that critical point will on the read path.
Because, we still will have the same troubles as I mentioned above.

<skipped>

> Hardware does manage the actual badblocks issue for us
> in the sense that when it discovers a badblock it will do the remapping.
> But since this is on the memory bus, and has different error signatures
> than applications are used to, we want to make the error handling
> similar to the existing storage model.

So, if hardware is able to do the remapping of bad portions of memory page
then it is possible to see the valid logical page always. The key point here
that hardware controller should manage migration of data from aged/pre-bad
NVM memory ranges into valid ones. Or it needs to use some fancy
error-correction techniques or erasure coding schemes.

Thanks,
Vyacheslav Dubeyko.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-14  0:49       ` Vishal Verma
@ 2017-01-17  6:33         ` Darrick J. Wong
  -1 siblings, 0 replies; 89+ messages in thread
From: Darrick J. Wong @ 2017-01-17  6:33 UTC (permalink / raw)
  To: Vishal Verma
  Cc: Slava Dubeyko, linux-nvdimm@lists.01.org, linux-block,
	Viacheslav Dubeyko, Linux FS Devel, lsf-pc

On Fri, Jan 13, 2017 at 05:49:10PM -0700, Vishal Verma wrote:
> On 01/14, Slava Dubeyko wrote:
> > 
> > ---- Original Message ----
> > Subject: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
> > Sent: Jan 13, 2017 1:40 PM
> > From: "Verma, Vishal L" <vishal.l.verma@intel.com>
> > To: lsf-pc@lists.linux-foundation.org
> > Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org
> > 
> > > The current implementation of badblocks, where we consult the badblocks
> > > list for every IO in the block driver works, and is a last option
> > > failsafe, but from a user perspective, it isn't the easiest interface to
> > > work with.
> > 
> > As I remember, FAT and HFS+ specifications contain description of bad blocks
> > (physical sectors) table. I believe that this table was used for the case of
> > floppy media. But, finally, this table becomes to be the completely obsolete
> > artefact because mostly storage devices are reliably enough. Why do you need

ext4 has a badblocks inode to own all the bad spots on disk, but ISTR it
doesn't support(??) extents or 64-bit filesystems, and might just be a
vestigial organ at this point.  XFS doesn't have anything to track bad
blocks currently....

> > in exposing the bad blocks on the file system level?  Do you expect that next
> > generation of NVM memory will be so unreliable that file system needs to manage
> > bad blocks? What's about erasure coding schemes? Do file system really need to suffer
> > from the bad block issue? 
> > 
> > Usually, we are using LBAs and it is the responsibility of storage device to map
> > a bad physical block/page/sector into valid one. Do you mean that we have
> > access to physical NVM memory address directly? But it looks like that we can
> > have a "bad block" issue even we will access data into page cache's memory
> > page (if we will use NVM memory for page cache, of course). So, what do you
> > imply by "bad block" issue? 
> 
> We don't have direct physical access to the device's address space, in
> the sense the device is still free to perform remapping of chunks of NVM
> underneath us. The problem is that when a block or address range (as
> small as a cache line) goes bad, the device maintains a poison bit for
> every affected cache line. Behind the scenes, it may have already
> remapped the range, but the cache line poison has to be kept so that
> there is a notification to the user/owner of the data that something has
> been lost. Since NVM is byte addressable memory sitting on the memory
> bus, such a poisoned cache line results in memory errors and SIGBUSes.
> Compared to tradational storage where an app will get nice and friendly
> (relatively speaking..) -EIOs. The whole badblocks implementation was
> done so that the driver can intercept IO (i.e. reads) to _known_ bad
> locations, and short-circuit them with an EIO. If the driver doesn't
> catch these, the reads will turn into a memory bus access, and the
> poison will cause a SIGBUS.

"driver" ... you mean XFS?  Or do you mean the thing that makes pmem
look kind of like a traditional block device? :)

> This effort is to try and make this badblock checking smarter - and try
> and reduce the penalty on every IO to a smaller range, which only the
> filesystem can do.

Though... now that XFS merged the reverse mapping support, I've been
wondering if there'll be a resubmission of the device errors callback?
It still would be useful to be able to inform the user that part of
their fs has gone bad, or, better yet, if the buffer is still in memory
someplace else, just write it back out.

Or I suppose if we had some kind of raid1 set up between memories we
could read one of the other copies and rewrite it into the failing
region immediately.

> > > A while back, Dave Chinner had suggested a move towards smarter
> > > handling, and I posted initial RFC patches [1], but since then the topic
> > > hasn't really moved forward.
> > > 
> > > I'd like to propose and have a discussion about the following new
> > > functionality:
> > > 
> > > 1. Filesystems develop a native representation of badblocks. For
> > > example, in xfs, this would (presumably) be linked to the reverse
> > > mapping btree. The filesystem representation has the potential to be 
> > > more efficient than the block driver doing the check, as the fs can
> > > check the IO happening on a file against just that file's range. 

OTOH that means we'd have to check /every/ file IO request against the
rmapbt, which will make things reaaaaaally slow.  I suspect it might be
preferable just to let the underlying pmem driver throw an error at us.

(Or possibly just cache the bad extents in memory.)

> > What do you mean by "file system can check the IO happening on a file"?
> > Do you mean read or write operation? What's about metadata?
> 
> For the purpose described above, i.e. returning early EIOs when
> possible, this will be limited to reads and metadata reads. If we're
> about to do a metadata read, and realize the block(s) about to be read
> are on the badblocks list, then we do the same thing as when we discover
> other kinds of metadata corruption.

...fail and shut down? :)

Actually, for metadata either we look at the xfs_bufs to see if it's in
memory (XFS doesn't directly access metadata) and write it back out; or
we could fire up the online repair tool to rebuild the metadata.

> > If we are talking about the discovering a bad block on read operation then
> > rare modern file system is able to survive as for the case of metadata as
> > for the case of user data. Let's imagine that we have really mature file
> > system driver then what does it mean to encounter a bad block? The failure
> > to read a logical block of some metadata (bad block) means that we are
> > unable to extract some part of a metadata structure. From file system
> > driver point of view, it looks like that our file system is corrupted, we need
> > to stop the file system operations and, finally, to check and recover file
> > system volume by means of fsck tool. If we find a bad block for some
> > user file then, again, it looks like an issue. Some file systems simply
> > return "unrecovered read error". Another one, theoretically, is able
> > to survive because of snapshots, for example. But, anyway, it will look
> > like as Read-Only mount state and the user will need to resolve such
> > trouble by hands.
> 
> As far as I can tell, all of these things remain the same. The goal here
> isn't to survive more NVM badblocks than we would've before, and lost
> data or lost metadata will continue to have the same consequences as
> before, and will need the same recovery actions/intervention as before.
> The goal is to make the failure model similar to what users expect
> today, and as much as possible make recovery actions too similarly
> intuitive.
> 
> > 
> > If we are talking about discovering a bad block during write operation then,
> > again, we are in trouble. Usually, we are using asynchronous model
> > of write/flush operation. We are preparing the consistent state of all our
> > metadata structures in the memory, at first. The flush operations for metadata
> > and user data can be done in different times. And what should be done if we
> > discover bad block for any piece of metadata or user data? Simple tracking of
> > bad blocks is not enough at all. Let's consider user data, at first. If we cannot
> > write some file's block successfully then we have two ways: (1) forget about
> > this piece of data; (2) try to change the associated LBA for this piece of data.
> > The operation of re-allocation LBA number for discovered bad block
> > (user data case) sounds as real pain. Because you need to rebuild the metadata
> > that track the location of this part of file. And it sounds as practically
> > impossible operation, for the case of LFS file system, for example.
> > If we have trouble with flushing any part of metadata then it sounds as
> > complete disaster for any file system.
> 
> Writes can get more complicated in certain cases. If it is a regular
> page cache writeback, or any aligned write that goes through the block
> driver, that is completely fine. The block driver will check that the
> block was previously marked as bad, do a "clear poison" operation
> (defined in the ACPI spec), which tells the firmware that the poison bit
> is not OK to be cleared, and writes the new data. This also removes the
> block from the badblocks list, and in this scheme, triggers a
> notification to the filesystem that it too can remove the block from its
> accounting. mmap writes and DAX can get more complicated, and at times
> they will just trigger a SIGBUS, and there's no way around that.
> 
> > 
> > Are you really sure that file system should process bad block issue?
> > 
> > >In contrast, today, the block driver checks against the whole block device
> > > range for every IO. On encountering badblocks, the filesystem can
> > > generate a better notification/error message that points the user to 
> > > (file, offset) as opposed to the block driver, which can only provide
> > > (block-device, sector).

<shrug> We can do the translation with the backref info...

> > > 2. The block layer adds a notifier to badblock addition/removal
> > > operations, which the filesystem subscribes to, and uses to maintain its
> > > badblocks accounting. (This part is implemented as a proof of concept in
> > > the RFC mentioned above [1]).
> > 
> > I am not sure that any bad block notification during/after IO operation
> > is valuable for file system. Maybe, it could help if file system simply will
> > know about bad block beforehand the operation of logical block allocation.
> > But what subsystem will discover bad blocks before any IO operations?
> > How file system will receive information or some bad block table?
> 
> The driver populates its badblocks lists whenever an Address Range Scrub
> is started (also via ACPI methods). This is always done at
> initialization time, so that it can build an in-memory representation of
> the badblocks. Additionally, this can also be triggered manually. And
> finally badblocks can also get populated for new latent errors when a
> machine check exception occurs. All of these can trigger notification to
> the file system without actual user reads happening.
> 
> > I am not convinced that suggested badblocks approach is really feasible.
> > Also I am not sure that file system should see the bad blocks at all.
> > Why hardware cannot manage this issue for us?
> 
> Hardware does manage the actual badblocks issue for us in the sense that
> when it discovers a badblock it will do the remapping. But since this is
> on the memory bus, and has different error signatures than applications
> are used to, we want to make the error handling similar to the existing
> storage model.

Yes please and thank you, to the "error handling similar to the existing
storage model".  Even better if this just gets added to a layer
underneath the fs so that IO to bad regions returns EIO. 8-)

(Sleeeeep...)

--D

> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-17  6:33         ` Darrick J. Wong
  0 siblings, 0 replies; 89+ messages in thread
From: Darrick J. Wong @ 2017-01-17  6:33 UTC (permalink / raw)
  To: Vishal Verma
  Cc: Slava Dubeyko, lsf-pc, linux-nvdimm@lists.01.org, linux-block,
	Linux FS Devel, Viacheslav Dubeyko

On Fri, Jan 13, 2017 at 05:49:10PM -0700, Vishal Verma wrote:
> On 01/14, Slava Dubeyko wrote:
> > 
> > ---- Original Message ----
> > Subject: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
> > Sent: Jan 13, 2017 1:40 PM
> > From: "Verma, Vishal L" <vishal.l.verma@intel.com>
> > To: lsf-pc@lists.linux-foundation.org
> > Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org
> > 
> > > The current implementation of badblocks, where we consult the badblocks
> > > list for every IO in the block driver works, and is a last option
> > > failsafe, but from a user perspective, it isn't the easiest interface to
> > > work with.
> > 
> > As I remember, FAT and HFS+ specifications contain description of bad blocks
> > (physical sectors) table. I believe that this table was used for the case of
> > floppy media. But, finally, this table becomes to be the completely obsolete
> > artefact because mostly storage devices are reliably enough. Why do you need

ext4 has a badblocks inode to own all the bad spots on disk, but ISTR it
doesn't support(??) extents or 64-bit filesystems, and might just be a
vestigial organ at this point.  XFS doesn't have anything to track bad
blocks currently....

> > in exposing the bad blocks on the file system level?  Do you expect that next
> > generation of NVM memory will be so unreliable that file system needs to manage
> > bad blocks? What's about erasure coding schemes? Do file system really need to suffer
> > from the bad block issue? 
> > 
> > Usually, we are using LBAs and it is the responsibility of storage device to map
> > a bad physical block/page/sector into valid one. Do you mean that we have
> > access to physical NVM memory address directly? But it looks like that we can
> > have a "bad block" issue even we will access data into page cache's memory
> > page (if we will use NVM memory for page cache, of course). So, what do you
> > imply by "bad block" issue? 
> 
> We don't have direct physical access to the device's address space, in
> the sense the device is still free to perform remapping of chunks of NVM
> underneath us. The problem is that when a block or address range (as
> small as a cache line) goes bad, the device maintains a poison bit for
> every affected cache line. Behind the scenes, it may have already
> remapped the range, but the cache line poison has to be kept so that
> there is a notification to the user/owner of the data that something has
> been lost. Since NVM is byte addressable memory sitting on the memory
> bus, such a poisoned cache line results in memory errors and SIGBUSes.
> Compared to tradational storage where an app will get nice and friendly
> (relatively speaking..) -EIOs. The whole badblocks implementation was
> done so that the driver can intercept IO (i.e. reads) to _known_ bad
> locations, and short-circuit them with an EIO. If the driver doesn't
> catch these, the reads will turn into a memory bus access, and the
> poison will cause a SIGBUS.

"driver" ... you mean XFS?  Or do you mean the thing that makes pmem
look kind of like a traditional block device? :)

> This effort is to try and make this badblock checking smarter - and try
> and reduce the penalty on every IO to a smaller range, which only the
> filesystem can do.

Though... now that XFS merged the reverse mapping support, I've been
wondering if there'll be a resubmission of the device errors callback?
It still would be useful to be able to inform the user that part of
their fs has gone bad, or, better yet, if the buffer is still in memory
someplace else, just write it back out.

Or I suppose if we had some kind of raid1 set up between memories we
could read one of the other copies and rewrite it into the failing
region immediately.

> > > A while back, Dave Chinner had suggested a move towards smarter
> > > handling, and I posted initial RFC patches [1], but since then the topic
> > > hasn't really moved forward.
> > > 
> > > I'd like to propose and have a discussion about the following new
> > > functionality:
> > > 
> > > 1. Filesystems develop a native representation of badblocks. For
> > > example, in xfs, this would (presumably) be linked to the reverse
> > > mapping btree. The filesystem representation has the potential to be�
> > > more efficient than the block driver doing the check, as the fs can
> > > check the IO happening on a file against just that file's range. 

OTOH that means we'd have to check /every/ file IO request against the
rmapbt, which will make things reaaaaaally slow.  I suspect it might be
preferable just to let the underlying pmem driver throw an error at us.

(Or possibly just cache the bad extents in memory.)

> > What do you mean by "file system can check the IO happening on a file"?
> > Do you mean read or write operation? What's about metadata?
> 
> For the purpose described above, i.e. returning early EIOs when
> possible, this will be limited to reads and metadata reads. If we're
> about to do a metadata read, and realize the block(s) about to be read
> are on the badblocks list, then we do the same thing as when we discover
> other kinds of metadata corruption.

...fail and shut down? :)

Actually, for metadata either we look at the xfs_bufs to see if it's in
memory (XFS doesn't directly access metadata) and write it back out; or
we could fire up the online repair tool to rebuild the metadata.

> > If we are talking about the discovering a bad block on read operation then
> > rare modern file system is able to survive as for the case of metadata as
> > for the case of user data. Let's imagine that we have really mature file
> > system driver then what does it mean to encounter a bad block? The failure
> > to read a logical block of some metadata (bad block) means that we are
> > unable to extract some part of a metadata structure. From file system
> > driver point of view, it looks like that our file system is corrupted, we need
> > to stop the file system operations and, finally, to check and recover file
> > system volume by means of fsck tool. If we find a bad block for some
> > user file then, again, it looks like an issue. Some file systems simply
> > return "unrecovered read error". Another one, theoretically, is able
> > to survive because of snapshots, for example. But, anyway, it will look
> > like as Read-Only mount state and the user will need to resolve such
> > trouble by hands.
> 
> As far as I can tell, all of these things remain the same. The goal here
> isn't to survive more NVM badblocks than we would've before, and lost
> data or lost metadata will continue to have the same consequences as
> before, and will need the same recovery actions/intervention as before.
> The goal is to make the failure model similar to what users expect
> today, and as much as possible make recovery actions too similarly
> intuitive.
> 
> > 
> > If we are talking about discovering a bad block during write operation then,
> > again, we are in trouble. Usually, we are using asynchronous model
> > of write/flush operation. We are preparing the consistent state of all our
> > metadata structures in the memory, at first. The flush operations for metadata
> > and user data can be done in different times. And what should be done if we
> > discover bad block for any piece of metadata or user data? Simple tracking of
> > bad blocks is not enough at all. Let's consider user data, at first. If we cannot
> > write some file's block successfully then we have two ways: (1) forget about
> > this piece of data; (2) try to change the associated LBA for this piece of data.
> > The operation of re-allocation LBA number for discovered bad block
> > (user data case) sounds as real pain. Because you need to rebuild the metadata
> > that track the location of this part of file. And it sounds as practically
> > impossible operation, for the case of LFS file system, for example.
> > If we have trouble with flushing any part of metadata then it sounds as
> > complete disaster for any file system.
> 
> Writes can get more complicated in certain cases. If it is a regular
> page cache writeback, or any aligned write that goes through the block
> driver, that is completely fine. The block driver will check that the
> block was previously marked as bad, do a "clear poison" operation
> (defined in the ACPI spec), which tells the firmware that the poison bit
> is not OK to be cleared, and writes the new data. This also removes the
> block from the badblocks list, and in this scheme, triggers a
> notification to the filesystem that it too can remove the block from its
> accounting. mmap writes and DAX can get more complicated, and at times
> they will just trigger a SIGBUS, and there's no way around that.
> 
> > 
> > Are you really sure that file system should process bad block issue?
> > 
> > >In contrast, today, the block driver checks against the whole block device
> > > range for every IO. On encountering badblocks, the filesystem can
> > > generate a better notification/error message that points the user to�
> > > (file, offset) as opposed to the block driver, which can only provide
> > > (block-device, sector).

<shrug> We can do the translation with the backref info...

> > > 2. The block layer adds a notifier to badblock addition/removal
> > > operations, which the filesystem subscribes to, and uses to maintain its
> > > badblocks accounting. (This part is implemented as a proof of concept in
> > > the RFC mentioned above [1]).
> > 
> > I am not sure that any bad block notification during/after IO operation
> > is valuable for file system. Maybe, it could help if file system simply will
> > know about bad block beforehand the operation of logical block allocation.
> > But what subsystem will discover bad blocks before any IO operations?
> > How file system will receive information or some bad block table?
> 
> The driver populates its badblocks lists whenever an Address Range Scrub
> is started (also via ACPI methods). This is always done at
> initialization time, so that it can build an in-memory representation of
> the badblocks. Additionally, this can also be triggered manually. And
> finally badblocks can also get populated for new latent errors when a
> machine check exception occurs. All of these can trigger notification to
> the file system without actual user reads happening.
> 
> > I am not convinced that suggested badblocks approach is really feasible.
> > Also I am not sure that file system should see the bad blocks at all.
> > Why hardware cannot manage this issue for us?
> 
> Hardware does manage the actual badblocks issue for us in the sense that
> when it discovers a badblock it will do the remapping. But since this is
> on the memory bus, and has different error signatures than applications
> are used to, we want to make the error handling similar to the existing
> storage model.

Yes please and thank you, to the "error handling similar to the existing
storage model".  Even better if this just gets added to a layer
underneath the fs so that IO to bad regions returns EIO. 8-)

(Sleeeeep...)

--D

> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-16  2:27         ` Slava Dubeyko
@ 2017-01-17 14:37           ` Jan Kara
  -1 siblings, 0 replies; 89+ messages in thread
From: Jan Kara @ 2017-01-17 14:37 UTC (permalink / raw)
  To: Slava Dubeyko
  Cc: linux-nvdimm, linux-block, Viacheslav Dubeyko, Linux FS Devel, lsf-pc

On Mon 16-01-17 02:27:52, Slava Dubeyko wrote:
> 
> -----Original Message-----
> From: Vishal Verma [mailto:vishal.l.verma@intel.com] 
> Sent: Friday, January 13, 2017 4:49 PM
> To: Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>
> Cc: lsf-pc@lists.linux-foundation.org; linux-nvdimm@lists.01.org; linux-block@vger.kernel.org; Linux FS Devel <linux-fsdevel@vger.kernel.org>; Viacheslav Dubeyko <slava@dubeyko.com>
> Subject: Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
> 
> <skipped>
> 
> > We don't have direct physical access to the device's address space, in the sense
> > the device is still free to perform remapping of chunks of NVM underneath us.
> > The problem is that when a block or address range (as small as a cache line) goes bad,
> > the device maintains a poison bit for every affected cache line. Behind the scenes,
> > it may have already remapped the range, but the cache line poison has to be kept so that
> > there is a notification to the user/owner of the data that something has been lost.
> > Since NVM is byte addressable memory sitting on the memory bus, such a poisoned
> > cache line results in memory errors and SIGBUSes.
> > Compared to tradational storage where an app will get nice and friendly (relatively speaking..) -EIOs.
> > The whole badblocks implementation was done so that the driver can intercept IO (i.e. reads)
> > to _known_ bad locations, and short-circuit them with an EIO. If the driver doesn't catch these,
> > the reads will turn into a memory bus access, and the poison will cause a SIGBUS.
> >
> > This effort is to try and make this badblock checking smarter - and try and reduce the penalty
> > on every IO to a smaller range, which only the filesystem can do.
> 
> I still slightly puzzled and I cannot understand why the situation looks
> like a dead end.  As far as I can see, first of all, a NVM device is able
> to use hardware-based LDPC, Reed-Solomon error correction or any other
> fancy code. It could provide some error correction basis. Also it can
> provide the way of estimation of BER value. So, if a NVM memory's address
> range degrades gradually (during weeks or months) then, practically, it's
> possible to remap and to migrate the affected address ranges in the
> background. Otherwise, if a NVM memory so unreliable that address range
> is able to degrade during seconds or minutes then who will use such NVM
> memory?

Well, the situation with NVM is more like with DRAM AFAIU. It is quite
reliable but given the size the probability *some* cell has degraded is
quite high. And similar to DRAM you'll get MCE (Machine Check Exception)
when you try to read such cell. As Vishal wrote, the hardware does some
background scrubbing and relocates stuff early if needed but nothing is 100%.

The reason why we play games with badblocks is to avoid those MCEs (i.e.,
even trying to read the data we know that are bad). Even if it would be
rare event, MCE may mean the machine just immediately reboots (although I
find such platforms hardly usable with NVM then) and that is no good. And
even on hardware platforms that allow for more graceful recovery from MCE
it is asynchronous in its nature and our error handling
around IO is all synchronous so it is difficult to join these two models
together.

But I think it is a good question to ask whether we cannot improve on MCE
handling instead of trying to avoid them and pushing around responsibility
for handling bad blocks. Actually I thought someone was working on that.
Cannot we e.g. wrap in-kernel accesses to persistent memory (those are now
well identified anyway so that we can consult the badblocks list) so that
it MCE happens during these accesses, we note it somewhere and at the end
of the magic block we will just pick up the errors and report them back?

> OK. Let's imagine that NVM memory device hasn't any internal error
> correction hardware-based scheme. Next level of defense could be any
> erasure coding scheme on device driver level. So, any piece of data can
> be protected by parities. And device driver will be responsible for
> management of erasure coding scheme. It will increase latency of read
> operation for the case of necessity to recover the affected memory page.
> But, finally, all recovering activity will be behind the scene and file
> system will be unaware about such recovering activity.

Note that your options are limited by the byte addressability and the
direct CPU access to the memory. But even with these limitations it is not
that error rate would but unusually high, it is just not zero.
 
> If you are going not to provide any erasure coding or error correction
> scheme then it's really bad case. The fsck tool is not regular case tool
> but the last resort. If you are going to rely on the fsck tool then
> simply forget about using your hardware. Some file systems haven't the
> fsck tool at all. Some guys really believe that file system has to work
> without support of the fsck tool.  Even if a mature file system has
> reliable fsck tool then the probability of file system recovering is very
> low in the case of serious metadata corruptions. So, it means that you
> are trying to suggest the technique when we will lose the whole file
> system volumes on regular basis without any hope to recover data. Even if
> file system has snapshots then, again, we haven't hope because we can
> suffer from read error and for operation with snapshot.

I hope I have cleared out that this is not about higher error rate of
persistent memory above. As a side note, XFS guys are working on automatic
background scrubbing and online filesystem checking. Not specifically for
persistent memory but simply because with growing size of the filesystem
the likelihood of some problem somewhere is growing. 
 
								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-17 14:37           ` Jan Kara
  0 siblings, 0 replies; 89+ messages in thread
From: Jan Kara @ 2017-01-17 14:37 UTC (permalink / raw)
  To: Slava Dubeyko
  Cc: Vishal Verma, linux-block, Linux FS Devel, lsf-pc,
	Viacheslav Dubeyko, linux-nvdimm

On Mon 16-01-17 02:27:52, Slava Dubeyko wrote:
> 
> -----Original Message-----
> From: Vishal Verma [mailto:vishal.l.verma@intel.com] 
> Sent: Friday, January 13, 2017 4:49 PM
> To: Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>
> Cc: lsf-pc@lists.linux-foundation.org; linux-nvdimm@lists.01.org; linux-block@vger.kernel.org; Linux FS Devel <linux-fsdevel@vger.kernel.org>; Viacheslav Dubeyko <slava@dubeyko.com>
> Subject: Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
> 
> <skipped>
> 
> > We don't have direct physical access to the device's address space, in the sense
> > the device is still free to perform remapping of chunks of NVM underneath us.
> > The problem is that when a block or address range (as small as a cache line) goes bad,
> > the device maintains a poison bit for every affected cache line. Behind the scenes,
> > it may have already remapped the range, but the cache line poison has to be kept so that
> > there is a notification to the user/owner of the data that something has been lost.
> > Since NVM is byte addressable memory sitting on the memory bus, such a poisoned
> > cache line results in memory errors and SIGBUSes.
> > Compared to tradational storage where an app will get nice and friendly (relatively speaking..) -EIOs.
> > The whole badblocks implementation was done so that the driver can intercept IO (i.e. reads)
> > to _known_ bad locations, and short-circuit them with an EIO. If the driver doesn't catch these,
> > the reads will turn into a memory bus access, and the poison will cause a SIGBUS.
> >
> > This effort is to try and make this badblock checking smarter - and try and reduce the penalty
> > on every IO to a smaller range, which only the filesystem can do.
> 
> I still slightly puzzled and I cannot understand why the situation looks
> like a dead end.  As far as I can see, first of all, a NVM device is able
> to use hardware-based LDPC, Reed-Solomon error correction or any other
> fancy code. It could provide some error correction basis. Also it can
> provide the way of estimation of BER value. So, if a NVM memory's address
> range degrades gradually (during weeks or months) then, practically, it's
> possible to remap and to migrate the affected address ranges in the
> background. Otherwise, if a NVM memory so unreliable that address range
> is able to degrade during seconds or minutes then who will use such NVM
> memory?

Well, the situation with NVM is more like with DRAM AFAIU. It is quite
reliable but given the size the probability *some* cell has degraded is
quite high. And similar to DRAM you'll get MCE (Machine Check Exception)
when you try to read such cell. As Vishal wrote, the hardware does some
background scrubbing and relocates stuff early if needed but nothing is 100%.

The reason why we play games with badblocks is to avoid those MCEs (i.e.,
even trying to read the data we know that are bad). Even if it would be
rare event, MCE may mean the machine just immediately reboots (although I
find such platforms hardly usable with NVM then) and that is no good. And
even on hardware platforms that allow for more graceful recovery from MCE
it is asynchronous in its nature and our error handling
around IO is all synchronous so it is difficult to join these two models
together.

But I think it is a good question to ask whether we cannot improve on MCE
handling instead of trying to avoid them and pushing around responsibility
for handling bad blocks. Actually I thought someone was working on that.
Cannot we e.g. wrap in-kernel accesses to persistent memory (those are now
well identified anyway so that we can consult the badblocks list) so that
it MCE happens during these accesses, we note it somewhere and at the end
of the magic block we will just pick up the errors and report them back?

> OK. Let's imagine that NVM memory device hasn't any internal error
> correction hardware-based scheme. Next level of defense could be any
> erasure coding scheme on device driver level. So, any piece of data can
> be protected by parities. And device driver will be responsible for
> management of erasure coding scheme. It will increase latency of read
> operation for the case of necessity to recover the affected memory page.
> But, finally, all recovering activity will be behind the scene and file
> system will be unaware about such recovering activity.

Note that your options are limited by the byte addressability and the
direct CPU access to the memory. But even with these limitations it is not
that error rate would but unusually high, it is just not zero.
 
> If you are going not to provide any erasure coding or error correction
> scheme then it's really bad case. The fsck tool is not regular case tool
> but the last resort. If you are going to rely on the fsck tool then
> simply forget about using your hardware. Some file systems haven't the
> fsck tool at all. Some guys really believe that file system has to work
> without support of the fsck tool.  Even if a mature file system has
> reliable fsck tool then the probability of file system recovering is very
> low in the case of serious metadata corruptions. So, it means that you
> are trying to suggest the technique when we will lose the whole file
> system volumes on regular basis without any hope to recover data. Even if
> file system has snapshots then, again, we haven't hope because we can
> suffer from read error and for operation with snapshot.

I hope I have cleared out that this is not about higher error rate of
persistent memory above. As a side note, XFS guys are working on automatic
background scrubbing and online filesystem checking. Not specifically for
persistent memory but simply because with growing size of the filesystem
the likelihood of some problem somewhere is growing. 
 
								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-17 14:37           ` Jan Kara
@ 2017-01-17 15:08             ` Christoph Hellwig
  -1 siblings, 0 replies; 89+ messages in thread
From: Christoph Hellwig @ 2017-01-17 15:08 UTC (permalink / raw)
  To: Jan Kara
  Cc: Slava Dubeyko, linux-nvdimm, linux-block, Viacheslav Dubeyko,
	Linux FS Devel, lsf-pc

On Tue, Jan 17, 2017 at 03:37:03PM +0100, Jan Kara wrote:
> Well, the situation with NVM is more like with DRAM AFAIU. It is quite
> reliable but given the size the probability *some* cell has degraded is
> quite high. And similar to DRAM you'll get MCE (Machine Check Exception)
> when you try to read such cell. As Vishal wrote, the hardware does some
> background scrubbing and relocates stuff early if needed but nothing is 100%.

Based on publically available papers and little information leaks
there is no persistent NVM that comes even close to the error rate
for DRAM - they all appear to be magnitudes worse.

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-17 15:08             ` Christoph Hellwig
  0 siblings, 0 replies; 89+ messages in thread
From: Christoph Hellwig @ 2017-01-17 15:08 UTC (permalink / raw)
  To: Jan Kara
  Cc: Slava Dubeyko, Vishal Verma, linux-block, Linux FS Devel, lsf-pc,
	Viacheslav Dubeyko, linux-nvdimm

On Tue, Jan 17, 2017 at 03:37:03PM +0100, Jan Kara wrote:
> Well, the situation with NVM is more like with DRAM AFAIU. It is quite
> reliable but given the size the probability *some* cell has degraded is
> quite high. And similar to DRAM you'll get MCE (Machine Check Exception)
> when you try to read such cell. As Vishal wrote, the hardware does some
> background scrubbing and relocates stuff early if needed but nothing is 100%.

Based on publically available papers and little information leaks
there is no persistent NVM that comes even close to the error rate
for DRAM - they all appear to be magnitudes worse.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-17  6:33         ` Darrick J. Wong
@ 2017-01-17 21:35           ` Vishal Verma
  -1 siblings, 0 replies; 89+ messages in thread
From: Vishal Verma @ 2017-01-17 21:35 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Slava Dubeyko, linux-nvdimm@lists.01.org, linux-block,
	Viacheslav Dubeyko, Linux FS Devel, lsf-pc

On 01/16, Darrick J. Wong wrote:
> On Fri, Jan 13, 2017 at 05:49:10PM -0700, Vishal Verma wrote:
> > On 01/14, Slava Dubeyko wrote:
> > > 
> > > ---- Original Message ----
> > > Subject: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
> > > Sent: Jan 13, 2017 1:40 PM
> > > From: "Verma, Vishal L" <vishal.l.verma@intel.com>
> > > To: lsf-pc@lists.linux-foundation.org
> > > Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org
> > > 
> > > > The current implementation of badblocks, where we consult the badblocks
> > > > list for every IO in the block driver works, and is a last option
> > > > failsafe, but from a user perspective, it isn't the easiest interface to
> > > > work with.
> > > 
> > > As I remember, FAT and HFS+ specifications contain description of bad blocks
> > > (physical sectors) table. I believe that this table was used for the case of
> > > floppy media. But, finally, this table becomes to be the completely obsolete
> > > artefact because mostly storage devices are reliably enough. Why do you need
> 
> ext4 has a badblocks inode to own all the bad spots on disk, but ISTR it
> doesn't support(??) extents or 64-bit filesystems, and might just be a
> vestigial organ at this point.  XFS doesn't have anything to track bad
> blocks currently....
> 
> > > in exposing the bad blocks on the file system level?  Do you expect that next
> > > generation of NVM memory will be so unreliable that file system needs to manage
> > > bad blocks? What's about erasure coding schemes? Do file system really need to suffer
> > > from the bad block issue? 
> > > 
> > > Usually, we are using LBAs and it is the responsibility of storage device to map
> > > a bad physical block/page/sector into valid one. Do you mean that we have
> > > access to physical NVM memory address directly? But it looks like that we can
> > > have a "bad block" issue even we will access data into page cache's memory
> > > page (if we will use NVM memory for page cache, of course). So, what do you
> > > imply by "bad block" issue? 
> > 
> > We don't have direct physical access to the device's address space, in
> > the sense the device is still free to perform remapping of chunks of NVM
> > underneath us. The problem is that when a block or address range (as
> > small as a cache line) goes bad, the device maintains a poison bit for
> > every affected cache line. Behind the scenes, it may have already
> > remapped the range, but the cache line poison has to be kept so that
> > there is a notification to the user/owner of the data that something has
> > been lost. Since NVM is byte addressable memory sitting on the memory
> > bus, such a poisoned cache line results in memory errors and SIGBUSes.
> > Compared to tradational storage where an app will get nice and friendly
> > (relatively speaking..) -EIOs. The whole badblocks implementation was
> > done so that the driver can intercept IO (i.e. reads) to _known_ bad
> > locations, and short-circuit them with an EIO. If the driver doesn't
> > catch these, the reads will turn into a memory bus access, and the
> > poison will cause a SIGBUS.
> 
> "driver" ... you mean XFS?  Or do you mean the thing that makes pmem
> look kind of like a traditional block device? :)

Yes, the thing that makes pmem look like a block device :) --
drivers/nvdimm/pmem.c

> 
> > This effort is to try and make this badblock checking smarter - and try
> > and reduce the penalty on every IO to a smaller range, which only the
> > filesystem can do.
> 
> Though... now that XFS merged the reverse mapping support, I've been
> wondering if there'll be a resubmission of the device errors callback?
> It still would be useful to be able to inform the user that part of
> their fs has gone bad, or, better yet, if the buffer is still in memory
> someplace else, just write it back out.
> 
> Or I suppose if we had some kind of raid1 set up between memories we
> could read one of the other copies and rewrite it into the failing
> region immediately.

Yes, that is kind of what I was hoping to accomplish via this
discussion. How much would filesystems want to be involved in this sort
of badblocks handling, if at all. I can refresh my patches that provide
the fs notification, but that's the easy bit, and a starting point.

> 
> > > > A while back, Dave Chinner had suggested a move towards smarter
> > > > handling, and I posted initial RFC patches [1], but since then the topic
> > > > hasn't really moved forward.
> > > > 
> > > > I'd like to propose and have a discussion about the following new
> > > > functionality:
> > > > 
> > > > 1. Filesystems develop a native representation of badblocks. For
> > > > example, in xfs, this would (presumably) be linked to the reverse
> > > > mapping btree. The filesystem representation has the potential to be 
> > > > more efficient than the block driver doing the check, as the fs can
> > > > check the IO happening on a file against just that file's range. 
> 
> OTOH that means we'd have to check /every/ file IO request against the
> rmapbt, which will make things reaaaaaally slow.  I suspect it might be
> preferable just to let the underlying pmem driver throw an error at us.
> 
> (Or possibly just cache the bad extents in memory.)

Interesting - this would be a good discussion to have. My motivation for
this was the reasoning that the pmem driver has to check every single IO
against badblocks, and maybe the fs can do a better job. But if you
think the fs will actually be slower, we should try to somehow benchmark
that!

> 
> > > What do you mean by "file system can check the IO happening on a file"?
> > > Do you mean read or write operation? What's about metadata?
> > 
> > For the purpose described above, i.e. returning early EIOs when
> > possible, this will be limited to reads and metadata reads. If we're
> > about to do a metadata read, and realize the block(s) about to be read
> > are on the badblocks list, then we do the same thing as when we discover
> > other kinds of metadata corruption.
> 
> ...fail and shut down? :)
> 
> Actually, for metadata either we look at the xfs_bufs to see if it's in
> memory (XFS doesn't directly access metadata) and write it back out; or
> we could fire up the online repair tool to rebuild the metadata.

Agreed, I was just stressing that this scenario does not change from
status quo, and really recovering from corruption isn't the problem
we're trying to solve here :)

> 
> > > If we are talking about the discovering a bad block on read operation then
> > > rare modern file system is able to survive as for the case of metadata as
> > > for the case of user data. Let's imagine that we have really mature file
> > > system driver then what does it mean to encounter a bad block? The failure
> > > to read a logical block of some metadata (bad block) means that we are
> > > unable to extract some part of a metadata structure. From file system
> > > driver point of view, it looks like that our file system is corrupted, we need
> > > to stop the file system operations and, finally, to check and recover file
> > > system volume by means of fsck tool. If we find a bad block for some
> > > user file then, again, it looks like an issue. Some file systems simply
> > > return "unrecovered read error". Another one, theoretically, is able
> > > to survive because of snapshots, for example. But, anyway, it will look
> > > like as Read-Only mount state and the user will need to resolve such
> > > trouble by hands.
> > 
> > As far as I can tell, all of these things remain the same. The goal here
> > isn't to survive more NVM badblocks than we would've before, and lost
> > data or lost metadata will continue to have the same consequences as
> > before, and will need the same recovery actions/intervention as before.
> > The goal is to make the failure model similar to what users expect
> > today, and as much as possible make recovery actions too similarly
> > intuitive.
> > 
> > > 
> > > If we are talking about discovering a bad block during write operation then,
> > > again, we are in trouble. Usually, we are using asynchronous model
> > > of write/flush operation. We are preparing the consistent state of all our
> > > metadata structures in the memory, at first. The flush operations for metadata
> > > and user data can be done in different times. And what should be done if we
> > > discover bad block for any piece of metadata or user data? Simple tracking of
> > > bad blocks is not enough at all. Let's consider user data, at first. If we cannot
> > > write some file's block successfully then we have two ways: (1) forget about
> > > this piece of data; (2) try to change the associated LBA for this piece of data.
> > > The operation of re-allocation LBA number for discovered bad block
> > > (user data case) sounds as real pain. Because you need to rebuild the metadata
> > > that track the location of this part of file. And it sounds as practically
> > > impossible operation, for the case of LFS file system, for example.
> > > If we have trouble with flushing any part of metadata then it sounds as
> > > complete disaster for any file system.
> > 
> > Writes can get more complicated in certain cases. If it is a regular
> > page cache writeback, or any aligned write that goes through the block
> > driver, that is completely fine. The block driver will check that the
> > block was previously marked as bad, do a "clear poison" operation
> > (defined in the ACPI spec), which tells the firmware that the poison bit
> > is not OK to be cleared, and writes the new data. This also removes the
> > block from the badblocks list, and in this scheme, triggers a
> > notification to the filesystem that it too can remove the block from its
> > accounting. mmap writes and DAX can get more complicated, and at times
> > they will just trigger a SIGBUS, and there's no way around that.
> > 
> > > 
> > > Are you really sure that file system should process bad block issue?
> > > 
> > > >In contrast, today, the block driver checks against the whole block device
> > > > range for every IO. On encountering badblocks, the filesystem can
> > > > generate a better notification/error message that points the user to 
> > > > (file, offset) as opposed to the block driver, which can only provide
> > > > (block-device, sector).
> 
> <shrug> We can do the translation with the backref info...

Yes we should at least do that. I'm guessing this would happen in XFS
when it gets an EIO from an IO submission? The bio submission path in
the fs is probably not synchronous (correct?), but whenever it gets the
EIO, I'm guessing we just print a loud error message after doing the
backref lookup..

> 
> > > > 2. The block layer adds a notifier to badblock addition/removal
> > > > operations, which the filesystem subscribes to, and uses to maintain its
> > > > badblocks accounting. (This part is implemented as a proof of concept in
> > > > the RFC mentioned above [1]).
> > > 
> > > I am not sure that any bad block notification during/after IO operation
> > > is valuable for file system. Maybe, it could help if file system simply will
> > > know about bad block beforehand the operation of logical block allocation.
> > > But what subsystem will discover bad blocks before any IO operations?
> > > How file system will receive information or some bad block table?
> > 
> > The driver populates its badblocks lists whenever an Address Range Scrub
> > is started (also via ACPI methods). This is always done at
> > initialization time, so that it can build an in-memory representation of
> > the badblocks. Additionally, this can also be triggered manually. And
> > finally badblocks can also get populated for new latent errors when a
> > machine check exception occurs. All of these can trigger notification to
> > the file system without actual user reads happening.
> > 
> > > I am not convinced that suggested badblocks approach is really feasible.
> > > Also I am not sure that file system should see the bad blocks at all.
> > > Why hardware cannot manage this issue for us?
> > 
> > Hardware does manage the actual badblocks issue for us in the sense that
> > when it discovers a badblock it will do the remapping. But since this is
> > on the memory bus, and has different error signatures than applications
> > are used to, we want to make the error handling similar to the existing
> > storage model.
> 
> Yes please and thank you, to the "error handling similar to the existing
> storage model".  Even better if this just gets added to a layer
> underneath the fs so that IO to bad regions returns EIO. 8-)

This (if this just gets added to a layer underneath the fs so that IO to bad
regions returns EIO) already happens :)  See pmem_do_bvec() in
drivers/nvdimm/pmem.c, where we return EIO for a known badblock on a
read. I'm wondering if this can be improved..

> 
> (Sleeeeep...)
> 
> --D
> 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-17 21:35           ` Vishal Verma
  0 siblings, 0 replies; 89+ messages in thread
From: Vishal Verma @ 2017-01-17 21:35 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Slava Dubeyko, lsf-pc, linux-nvdimm@lists.01.org, linux-block,
	Linux FS Devel, Viacheslav Dubeyko

On 01/16, Darrick J. Wong wrote:
> On Fri, Jan 13, 2017 at 05:49:10PM -0700, Vishal Verma wrote:
> > On 01/14, Slava Dubeyko wrote:
> > > 
> > > ---- Original Message ----
> > > Subject: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
> > > Sent: Jan 13, 2017 1:40 PM
> > > From: "Verma, Vishal L" <vishal.l.verma@intel.com>
> > > To: lsf-pc@lists.linux-foundation.org
> > > Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org
> > > 
> > > > The current implementation of badblocks, where we consult the badblocks
> > > > list for every IO in the block driver works, and is a last option
> > > > failsafe, but from a user perspective, it isn't the easiest interface to
> > > > work with.
> > > 
> > > As I remember, FAT and HFS+ specifications contain description of bad blocks
> > > (physical sectors) table. I believe that this table was used for the case of
> > > floppy media. But, finally, this table becomes to be the completely obsolete
> > > artefact because mostly storage devices are reliably enough. Why do you need
> 
> ext4 has a badblocks inode to own all the bad spots on disk, but ISTR it
> doesn't support(??) extents or 64-bit filesystems, and might just be a
> vestigial organ at this point.  XFS doesn't have anything to track bad
> blocks currently....
> 
> > > in exposing the bad blocks on the file system level?  Do you expect that next
> > > generation of NVM memory will be so unreliable that file system needs to manage
> > > bad blocks? What's about erasure coding schemes? Do file system really need to suffer
> > > from the bad block issue? 
> > > 
> > > Usually, we are using LBAs and it is the responsibility of storage device to map
> > > a bad physical block/page/sector into valid one. Do you mean that we have
> > > access to physical NVM memory address directly? But it looks like that we can
> > > have a "bad block" issue even we will access data into page cache's memory
> > > page (if we will use NVM memory for page cache, of course). So, what do you
> > > imply by "bad block" issue? 
> > 
> > We don't have direct physical access to the device's address space, in
> > the sense the device is still free to perform remapping of chunks of NVM
> > underneath us. The problem is that when a block or address range (as
> > small as a cache line) goes bad, the device maintains a poison bit for
> > every affected cache line. Behind the scenes, it may have already
> > remapped the range, but the cache line poison has to be kept so that
> > there is a notification to the user/owner of the data that something has
> > been lost. Since NVM is byte addressable memory sitting on the memory
> > bus, such a poisoned cache line results in memory errors and SIGBUSes.
> > Compared to tradational storage where an app will get nice and friendly
> > (relatively speaking..) -EIOs. The whole badblocks implementation was
> > done so that the driver can intercept IO (i.e. reads) to _known_ bad
> > locations, and short-circuit them with an EIO. If the driver doesn't
> > catch these, the reads will turn into a memory bus access, and the
> > poison will cause a SIGBUS.
> 
> "driver" ... you mean XFS?  Or do you mean the thing that makes pmem
> look kind of like a traditional block device? :)

Yes, the thing that makes pmem look like a block device :) --
drivers/nvdimm/pmem.c

> 
> > This effort is to try and make this badblock checking smarter - and try
> > and reduce the penalty on every IO to a smaller range, which only the
> > filesystem can do.
> 
> Though... now that XFS merged the reverse mapping support, I've been
> wondering if there'll be a resubmission of the device errors callback?
> It still would be useful to be able to inform the user that part of
> their fs has gone bad, or, better yet, if the buffer is still in memory
> someplace else, just write it back out.
> 
> Or I suppose if we had some kind of raid1 set up between memories we
> could read one of the other copies and rewrite it into the failing
> region immediately.

Yes, that is kind of what I was hoping to accomplish via this
discussion. How much would filesystems want to be involved in this sort
of badblocks handling, if at all. I can refresh my patches that provide
the fs notification, but that's the easy bit, and a starting point.

> 
> > > > A while back, Dave Chinner had suggested a move towards smarter
> > > > handling, and I posted initial RFC patches [1], but since then the topic
> > > > hasn't really moved forward.
> > > > 
> > > > I'd like to propose and have a discussion about the following new
> > > > functionality:
> > > > 
> > > > 1. Filesystems develop a native representation of badblocks. For
> > > > example, in xfs, this would (presumably) be linked to the reverse
> > > > mapping btree. The filesystem representation has the potential to be�
> > > > more efficient than the block driver doing the check, as the fs can
> > > > check the IO happening on a file against just that file's range. 
> 
> OTOH that means we'd have to check /every/ file IO request against the
> rmapbt, which will make things reaaaaaally slow.  I suspect it might be
> preferable just to let the underlying pmem driver throw an error at us.
> 
> (Or possibly just cache the bad extents in memory.)

Interesting - this would be a good discussion to have. My motivation for
this was the reasoning that the pmem driver has to check every single IO
against badblocks, and maybe the fs can do a better job. But if you
think the fs will actually be slower, we should try to somehow benchmark
that!

> 
> > > What do you mean by "file system can check the IO happening on a file"?
> > > Do you mean read or write operation? What's about metadata?
> > 
> > For the purpose described above, i.e. returning early EIOs when
> > possible, this will be limited to reads and metadata reads. If we're
> > about to do a metadata read, and realize the block(s) about to be read
> > are on the badblocks list, then we do the same thing as when we discover
> > other kinds of metadata corruption.
> 
> ...fail and shut down? :)
> 
> Actually, for metadata either we look at the xfs_bufs to see if it's in
> memory (XFS doesn't directly access metadata) and write it back out; or
> we could fire up the online repair tool to rebuild the metadata.

Agreed, I was just stressing that this scenario does not change from
status quo, and really recovering from corruption isn't the problem
we're trying to solve here :)

> 
> > > If we are talking about the discovering a bad block on read operation then
> > > rare modern file system is able to survive as for the case of metadata as
> > > for the case of user data. Let's imagine that we have really mature file
> > > system driver then what does it mean to encounter a bad block? The failure
> > > to read a logical block of some metadata (bad block) means that we are
> > > unable to extract some part of a metadata structure. From file system
> > > driver point of view, it looks like that our file system is corrupted, we need
> > > to stop the file system operations and, finally, to check and recover file
> > > system volume by means of fsck tool. If we find a bad block for some
> > > user file then, again, it looks like an issue. Some file systems simply
> > > return "unrecovered read error". Another one, theoretically, is able
> > > to survive because of snapshots, for example. But, anyway, it will look
> > > like as Read-Only mount state and the user will need to resolve such
> > > trouble by hands.
> > 
> > As far as I can tell, all of these things remain the same. The goal here
> > isn't to survive more NVM badblocks than we would've before, and lost
> > data or lost metadata will continue to have the same consequences as
> > before, and will need the same recovery actions/intervention as before.
> > The goal is to make the failure model similar to what users expect
> > today, and as much as possible make recovery actions too similarly
> > intuitive.
> > 
> > > 
> > > If we are talking about discovering a bad block during write operation then,
> > > again, we are in trouble. Usually, we are using asynchronous model
> > > of write/flush operation. We are preparing the consistent state of all our
> > > metadata structures in the memory, at first. The flush operations for metadata
> > > and user data can be done in different times. And what should be done if we
> > > discover bad block for any piece of metadata or user data? Simple tracking of
> > > bad blocks is not enough at all. Let's consider user data, at first. If we cannot
> > > write some file's block successfully then we have two ways: (1) forget about
> > > this piece of data; (2) try to change the associated LBA for this piece of data.
> > > The operation of re-allocation LBA number for discovered bad block
> > > (user data case) sounds as real pain. Because you need to rebuild the metadata
> > > that track the location of this part of file. And it sounds as practically
> > > impossible operation, for the case of LFS file system, for example.
> > > If we have trouble with flushing any part of metadata then it sounds as
> > > complete disaster for any file system.
> > 
> > Writes can get more complicated in certain cases. If it is a regular
> > page cache writeback, or any aligned write that goes through the block
> > driver, that is completely fine. The block driver will check that the
> > block was previously marked as bad, do a "clear poison" operation
> > (defined in the ACPI spec), which tells the firmware that the poison bit
> > is not OK to be cleared, and writes the new data. This also removes the
> > block from the badblocks list, and in this scheme, triggers a
> > notification to the filesystem that it too can remove the block from its
> > accounting. mmap writes and DAX can get more complicated, and at times
> > they will just trigger a SIGBUS, and there's no way around that.
> > 
> > > 
> > > Are you really sure that file system should process bad block issue?
> > > 
> > > >In contrast, today, the block driver checks against the whole block device
> > > > range for every IO. On encountering badblocks, the filesystem can
> > > > generate a better notification/error message that points the user to�
> > > > (file, offset) as opposed to the block driver, which can only provide
> > > > (block-device, sector).
> 
> <shrug> We can do the translation with the backref info...

Yes we should at least do that. I'm guessing this would happen in XFS
when it gets an EIO from an IO submission? The bio submission path in
the fs is probably not synchronous (correct?), but whenever it gets the
EIO, I'm guessing we just print a loud error message after doing the
backref lookup..

> 
> > > > 2. The block layer adds a notifier to badblock addition/removal
> > > > operations, which the filesystem subscribes to, and uses to maintain its
> > > > badblocks accounting. (This part is implemented as a proof of concept in
> > > > the RFC mentioned above [1]).
> > > 
> > > I am not sure that any bad block notification during/after IO operation
> > > is valuable for file system. Maybe, it could help if file system simply will
> > > know about bad block beforehand the operation of logical block allocation.
> > > But what subsystem will discover bad blocks before any IO operations?
> > > How file system will receive information or some bad block table?
> > 
> > The driver populates its badblocks lists whenever an Address Range Scrub
> > is started (also via ACPI methods). This is always done at
> > initialization time, so that it can build an in-memory representation of
> > the badblocks. Additionally, this can also be triggered manually. And
> > finally badblocks can also get populated for new latent errors when a
> > machine check exception occurs. All of these can trigger notification to
> > the file system without actual user reads happening.
> > 
> > > I am not convinced that suggested badblocks approach is really feasible.
> > > Also I am not sure that file system should see the bad blocks at all.
> > > Why hardware cannot manage this issue for us?
> > 
> > Hardware does manage the actual badblocks issue for us in the sense that
> > when it discovers a badblock it will do the remapping. But since this is
> > on the memory bus, and has different error signatures than applications
> > are used to, we want to make the error handling similar to the existing
> > storage model.
> 
> Yes please and thank you, to the "error handling similar to the existing
> storage model".  Even better if this just gets added to a layer
> underneath the fs so that IO to bad regions returns EIO. 8-)

This (if this just gets added to a layer underneath the fs so that IO to bad
regions returns EIO) already happens :)  See pmem_do_bvec() in
drivers/nvdimm/pmem.c, where we return EIO for a known badblock on a
read. I'm wondering if this can be improved..

> 
> (Sleeeeep...)
> 
> --D
> 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-17 14:37           ` Jan Kara
@ 2017-01-17 22:14             ` Vishal Verma
  -1 siblings, 0 replies; 89+ messages in thread
From: Vishal Verma @ 2017-01-17 22:14 UTC (permalink / raw)
  To: Jan Kara
  Cc: Slava Dubeyko, linux-nvdimm, linux-block, Viacheslav Dubeyko,
	Linux FS Devel, lsf-pc

On 01/17, Jan Kara wrote:
> On Mon 16-01-17 02:27:52, Slava Dubeyko wrote:
> > 
> > -----Original Message-----
> > From: Vishal Verma [mailto:vishal.l.verma@intel.com] 
> > Sent: Friday, January 13, 2017 4:49 PM
> > To: Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>
> > Cc: lsf-pc@lists.linux-foundation.org; linux-nvdimm@lists.01.org; linux-block@vger.kernel.org; Linux FS Devel <linux-fsdevel@vger.kernel.org>; Viacheslav Dubeyko <slava@dubeyko.com>
> > Subject: Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
> > 
> > <skipped>
> > 
> > > We don't have direct physical access to the device's address space, in the sense
> > > the device is still free to perform remapping of chunks of NVM underneath us.
> > > The problem is that when a block or address range (as small as a cache line) goes bad,
> > > the device maintains a poison bit for every affected cache line. Behind the scenes,
> > > it may have already remapped the range, but the cache line poison has to be kept so that
> > > there is a notification to the user/owner of the data that something has been lost.
> > > Since NVM is byte addressable memory sitting on the memory bus, such a poisoned
> > > cache line results in memory errors and SIGBUSes.
> > > Compared to tradational storage where an app will get nice and friendly (relatively speaking..) -EIOs.
> > > The whole badblocks implementation was done so that the driver can intercept IO (i.e. reads)
> > > to _known_ bad locations, and short-circuit them with an EIO. If the driver doesn't catch these,
> > > the reads will turn into a memory bus access, and the poison will cause a SIGBUS.
> > >
> > > This effort is to try and make this badblock checking smarter - and try and reduce the penalty
> > > on every IO to a smaller range, which only the filesystem can do.
> > 
> > I still slightly puzzled and I cannot understand why the situation looks
> > like a dead end.  As far as I can see, first of all, a NVM device is able
> > to use hardware-based LDPC, Reed-Solomon error correction or any other
> > fancy code. It could provide some error correction basis. Also it can
> > provide the way of estimation of BER value. So, if a NVM memory's address
> > range degrades gradually (during weeks or months) then, practically, it's
> > possible to remap and to migrate the affected address ranges in the
> > background. Otherwise, if a NVM memory so unreliable that address range
> > is able to degrade during seconds or minutes then who will use such NVM
> > memory?
> 
> Well, the situation with NVM is more like with DRAM AFAIU. It is quite
> reliable but given the size the probability *some* cell has degraded is
> quite high. And similar to DRAM you'll get MCE (Machine Check Exception)
> when you try to read such cell. As Vishal wrote, the hardware does some
> background scrubbing and relocates stuff early if needed but nothing is 100%.
> 
> The reason why we play games with badblocks is to avoid those MCEs (i.e.,
> even trying to read the data we know that are bad). Even if it would be
> rare event, MCE may mean the machine just immediately reboots (although I
> find such platforms hardly usable with NVM then) and that is no good. And
> even on hardware platforms that allow for more graceful recovery from MCE
> it is asynchronous in its nature and our error handling
> around IO is all synchronous so it is difficult to join these two models
> together.
> 
> But I think it is a good question to ask whether we cannot improve on MCE
> handling instead of trying to avoid them and pushing around responsibility
> for handling bad blocks. Actually I thought someone was working on that.
> Cannot we e.g. wrap in-kernel accesses to persistent memory (those are now
> well identified anyway so that we can consult the badblocks list) so that
> it MCE happens during these accesses, we note it somewhere and at the end
> of the magic block we will just pick up the errors and report them back?

Yes that is an interesting topic, how/if we can improve MCE handling
from a storage point of view. Tradationally it has been designed for the
memory use case, and what we have so far is adaptation of it for the
pmem/storage uses.

> 
> > OK. Let's imagine that NVM memory device hasn't any internal error
> > correction hardware-based scheme. Next level of defense could be any
> > erasure coding scheme on device driver level. So, any piece of data can
> > be protected by parities. And device driver will be responsible for
> > management of erasure coding scheme. It will increase latency of read
> > operation for the case of necessity to recover the affected memory page.
> > But, finally, all recovering activity will be behind the scene and file
> > system will be unaware about such recovering activity.
> 
> Note that your options are limited by the byte addressability and the
> direct CPU access to the memory. But even with these limitations it is not
> that error rate would but unusually high, it is just not zero.
>  
> > If you are going not to provide any erasure coding or error correction
> > scheme then it's really bad case. The fsck tool is not regular case tool
> > but the last resort. If you are going to rely on the fsck tool then
> > simply forget about using your hardware. Some file systems haven't the
> > fsck tool at all. Some guys really believe that file system has to work
> > without support of the fsck tool.  Even if a mature file system has
> > reliable fsck tool then the probability of file system recovering is very
> > low in the case of serious metadata corruptions. So, it means that you
> > are trying to suggest the technique when we will lose the whole file
> > system volumes on regular basis without any hope to recover data. Even if
> > file system has snapshots then, again, we haven't hope because we can
> > suffer from read error and for operation with snapshot.
> 
> I hope I have cleared out that this is not about higher error rate of
> persistent memory above. As a side note, XFS guys are working on automatic
> background scrubbing and online filesystem checking. Not specifically for
> persistent memory but simply because with growing size of the filesystem
> the likelihood of some problem somewhere is growing.

Your note on the online repair does raise another tangentially related
topic. Currently, if there are badblocks, writes via the bio submission
path will clear the error (if the hardware is able to remap the bad
locations). However, if the filesystem is mounted eith DAX, even
non-mmap operations - read() and write() will go through the dax paths
(dax_do_io()). We haven't found a good/agreeable way to perform
error-clearing in this case. So currently, if a dax mounted filesystem
has badblocks, the only way to clear those badblocks is to mount it
without DAX, and overwrite/zero the bad locations. This is a pretty
terrible user experience, and I'm hoping this can be solved in a better
way.

If the filesystem is 'badblocks-aware', perhaps it can redirect dax_io
to happen via the driver (bio submission) path for files/ranges with
known errors. This removes the ability to do free-form, unaligned IO
in the DAX path, but we gain a way to actually repair (online) a dax
filesystem, which currently doesn't exist.

>  
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-17 22:14             ` Vishal Verma
  0 siblings, 0 replies; 89+ messages in thread
From: Vishal Verma @ 2017-01-17 22:14 UTC (permalink / raw)
  To: Jan Kara
  Cc: Slava Dubeyko, linux-block, Linux FS Devel, lsf-pc,
	Viacheslav Dubeyko, linux-nvdimm

On 01/17, Jan Kara wrote:
> On Mon 16-01-17 02:27:52, Slava Dubeyko wrote:
> > 
> > -----Original Message-----
> > From: Vishal Verma [mailto:vishal.l.verma@intel.com] 
> > Sent: Friday, January 13, 2017 4:49 PM
> > To: Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>
> > Cc: lsf-pc@lists.linux-foundation.org; linux-nvdimm@lists.01.org; linux-block@vger.kernel.org; Linux FS Devel <linux-fsdevel@vger.kernel.org>; Viacheslav Dubeyko <slava@dubeyko.com>
> > Subject: Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
> > 
> > <skipped>
> > 
> > > We don't have direct physical access to the device's address space, in the sense
> > > the device is still free to perform remapping of chunks of NVM underneath us.
> > > The problem is that when a block or address range (as small as a cache line) goes bad,
> > > the device maintains a poison bit for every affected cache line. Behind the scenes,
> > > it may have already remapped the range, but the cache line poison has to be kept so that
> > > there is a notification to the user/owner of the data that something has been lost.
> > > Since NVM is byte addressable memory sitting on the memory bus, such a poisoned
> > > cache line results in memory errors and SIGBUSes.
> > > Compared to tradational storage where an app will get nice and friendly (relatively speaking..) -EIOs.
> > > The whole badblocks implementation was done so that the driver can intercept IO (i.e. reads)
> > > to _known_ bad locations, and short-circuit them with an EIO. If the driver doesn't catch these,
> > > the reads will turn into a memory bus access, and the poison will cause a SIGBUS.
> > >
> > > This effort is to try and make this badblock checking smarter - and try and reduce the penalty
> > > on every IO to a smaller range, which only the filesystem can do.
> > 
> > I still slightly puzzled and I cannot understand why the situation looks
> > like a dead end.  As far as I can see, first of all, a NVM device is able
> > to use hardware-based LDPC, Reed-Solomon error correction or any other
> > fancy code. It could provide some error correction basis. Also it can
> > provide the way of estimation of BER value. So, if a NVM memory's address
> > range degrades gradually (during weeks or months) then, practically, it's
> > possible to remap and to migrate the affected address ranges in the
> > background. Otherwise, if a NVM memory so unreliable that address range
> > is able to degrade during seconds or minutes then who will use such NVM
> > memory?
> 
> Well, the situation with NVM is more like with DRAM AFAIU. It is quite
> reliable but given the size the probability *some* cell has degraded is
> quite high. And similar to DRAM you'll get MCE (Machine Check Exception)
> when you try to read such cell. As Vishal wrote, the hardware does some
> background scrubbing and relocates stuff early if needed but nothing is 100%.
> 
> The reason why we play games with badblocks is to avoid those MCEs (i.e.,
> even trying to read the data we know that are bad). Even if it would be
> rare event, MCE may mean the machine just immediately reboots (although I
> find such platforms hardly usable with NVM then) and that is no good. And
> even on hardware platforms that allow for more graceful recovery from MCE
> it is asynchronous in its nature and our error handling
> around IO is all synchronous so it is difficult to join these two models
> together.
> 
> But I think it is a good question to ask whether we cannot improve on MCE
> handling instead of trying to avoid them and pushing around responsibility
> for handling bad blocks. Actually I thought someone was working on that.
> Cannot we e.g. wrap in-kernel accesses to persistent memory (those are now
> well identified anyway so that we can consult the badblocks list) so that
> it MCE happens during these accesses, we note it somewhere and at the end
> of the magic block we will just pick up the errors and report them back?

Yes that is an interesting topic, how/if we can improve MCE handling
from a storage point of view. Tradationally it has been designed for the
memory use case, and what we have so far is adaptation of it for the
pmem/storage uses.

> 
> > OK. Let's imagine that NVM memory device hasn't any internal error
> > correction hardware-based scheme. Next level of defense could be any
> > erasure coding scheme on device driver level. So, any piece of data can
> > be protected by parities. And device driver will be responsible for
> > management of erasure coding scheme. It will increase latency of read
> > operation for the case of necessity to recover the affected memory page.
> > But, finally, all recovering activity will be behind the scene and file
> > system will be unaware about such recovering activity.
> 
> Note that your options are limited by the byte addressability and the
> direct CPU access to the memory. But even with these limitations it is not
> that error rate would but unusually high, it is just not zero.
>  
> > If you are going not to provide any erasure coding or error correction
> > scheme then it's really bad case. The fsck tool is not regular case tool
> > but the last resort. If you are going to rely on the fsck tool then
> > simply forget about using your hardware. Some file systems haven't the
> > fsck tool at all. Some guys really believe that file system has to work
> > without support of the fsck tool.  Even if a mature file system has
> > reliable fsck tool then the probability of file system recovering is very
> > low in the case of serious metadata corruptions. So, it means that you
> > are trying to suggest the technique when we will lose the whole file
> > system volumes on regular basis without any hope to recover data. Even if
> > file system has snapshots then, again, we haven't hope because we can
> > suffer from read error and for operation with snapshot.
> 
> I hope I have cleared out that this is not about higher error rate of
> persistent memory above. As a side note, XFS guys are working on automatic
> background scrubbing and online filesystem checking. Not specifically for
> persistent memory but simply because with growing size of the filesystem
> the likelihood of some problem somewhere is growing.

Your note on the online repair does raise another tangentially related
topic. Currently, if there are badblocks, writes via the bio submission
path will clear the error (if the hardware is able to remap the bad
locations). However, if the filesystem is mounted eith DAX, even
non-mmap operations - read() and write() will go through the dax paths
(dax_do_io()). We haven't found a good/agreeable way to perform
error-clearing in this case. So currently, if a dax mounted filesystem
has badblocks, the only way to clear those badblocks is to mount it
without DAX, and overwrite/zero the bad locations. This is a pretty
terrible user experience, and I'm hoping this can be solved in a better
way.

If the filesystem is 'badblocks-aware', perhaps it can redirect dax_io
to happen via the driver (bio submission) path for files/ranges with
known errors. This removes the ability to do free-form, unaligned IO
in the DAX path, but we gain a way to actually repair (online) a dax
filesystem, which currently doesn't exist.

>  
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-17 21:35           ` Vishal Verma
@ 2017-01-17 22:15             ` Andiry Xu
  -1 siblings, 0 replies; 89+ messages in thread
From: Andiry Xu @ 2017-01-17 22:15 UTC (permalink / raw)
  To: Vishal Verma
  Cc: Slava Dubeyko, Darrick J. Wong, linux-nvdimm@lists.01.org,
	linux-block, Viacheslav Dubeyko, Linux FS Devel, lsf-pc

Hi,

On Tue, Jan 17, 2017 at 1:35 PM, Vishal Verma <vishal.l.verma@intel.com> wrote:
> On 01/16, Darrick J. Wong wrote:
>> On Fri, Jan 13, 2017 at 05:49:10PM -0700, Vishal Verma wrote:
>> > On 01/14, Slava Dubeyko wrote:
>> > >
>> > > ---- Original Message ----
>> > > Subject: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
>> > > Sent: Jan 13, 2017 1:40 PM
>> > > From: "Verma, Vishal L" <vishal.l.verma@intel.com>
>> > > To: lsf-pc@lists.linux-foundation.org
>> > > Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org
>> > >
>> > > > The current implementation of badblocks, where we consult the badblocks
>> > > > list for every IO in the block driver works, and is a last option
>> > > > failsafe, but from a user perspective, it isn't the easiest interface to
>> > > > work with.
>> > >
>> > > As I remember, FAT and HFS+ specifications contain description of bad blocks
>> > > (physical sectors) table. I believe that this table was used for the case of
>> > > floppy media. But, finally, this table becomes to be the completely obsolete
>> > > artefact because mostly storage devices are reliably enough. Why do you need
>>
>> ext4 has a badblocks inode to own all the bad spots on disk, but ISTR it
>> doesn't support(??) extents or 64-bit filesystems, and might just be a
>> vestigial organ at this point.  XFS doesn't have anything to track bad
>> blocks currently....
>>
>> > > in exposing the bad blocks on the file system level?  Do you expect that next
>> > > generation of NVM memory will be so unreliable that file system needs to manage
>> > > bad blocks? What's about erasure coding schemes? Do file system really need to suffer
>> > > from the bad block issue?
>> > >
>> > > Usually, we are using LBAs and it is the responsibility of storage device to map
>> > > a bad physical block/page/sector into valid one. Do you mean that we have
>> > > access to physical NVM memory address directly? But it looks like that we can
>> > > have a "bad block" issue even we will access data into page cache's memory
>> > > page (if we will use NVM memory for page cache, of course). So, what do you
>> > > imply by "bad block" issue?
>> >
>> > We don't have direct physical access to the device's address space, in
>> > the sense the device is still free to perform remapping of chunks of NVM
>> > underneath us. The problem is that when a block or address range (as
>> > small as a cache line) goes bad, the device maintains a poison bit for
>> > every affected cache line. Behind the scenes, it may have already
>> > remapped the range, but the cache line poison has to be kept so that
>> > there is a notification to the user/owner of the data that something has
>> > been lost. Since NVM is byte addressable memory sitting on the memory
>> > bus, such a poisoned cache line results in memory errors and SIGBUSes.
>> > Compared to tradational storage where an app will get nice and friendly
>> > (relatively speaking..) -EIOs. The whole badblocks implementation was
>> > done so that the driver can intercept IO (i.e. reads) to _known_ bad
>> > locations, and short-circuit them with an EIO. If the driver doesn't
>> > catch these, the reads will turn into a memory bus access, and the
>> > poison will cause a SIGBUS.
>>
>> "driver" ... you mean XFS?  Or do you mean the thing that makes pmem
>> look kind of like a traditional block device? :)
>
> Yes, the thing that makes pmem look like a block device :) --
> drivers/nvdimm/pmem.c
>
>>
>> > This effort is to try and make this badblock checking smarter - and try
>> > and reduce the penalty on every IO to a smaller range, which only the
>> > filesystem can do.
>>
>> Though... now that XFS merged the reverse mapping support, I've been
>> wondering if there'll be a resubmission of the device errors callback?
>> It still would be useful to be able to inform the user that part of
>> their fs has gone bad, or, better yet, if the buffer is still in memory
>> someplace else, just write it back out.
>>
>> Or I suppose if we had some kind of raid1 set up between memories we
>> could read one of the other copies and rewrite it into the failing
>> region immediately.
>
> Yes, that is kind of what I was hoping to accomplish via this
> discussion. How much would filesystems want to be involved in this sort
> of badblocks handling, if at all. I can refresh my patches that provide
> the fs notification, but that's the easy bit, and a starting point.
>

I have some questions. Why moving badblock handling to file system
level avoid the checking phase? In file system level for each I/O I
still have to check the badblock list, right? Do you mean during mount
it can go through the pmem device and locates all the data structures
mangled by badblocks and handle them accordingly, so that during
normal running the badblocks will never be accessed? Or, if there is
replicataion/snapshot support, use a copy to recover the badblocks?

How about operations bypass the file system, i.e. mmap?


>>
>> > > > A while back, Dave Chinner had suggested a move towards smarter
>> > > > handling, and I posted initial RFC patches [1], but since then the topic
>> > > > hasn't really moved forward.
>> > > >
>> > > > I'd like to propose and have a discussion about the following new
>> > > > functionality:
>> > > >
>> > > > 1. Filesystems develop a native representation of badblocks. For
>> > > > example, in xfs, this would (presumably) be linked to the reverse
>> > > > mapping btree. The filesystem representation has the potential to be
>> > > > more efficient than the block driver doing the check, as the fs can
>> > > > check the IO happening on a file against just that file's range.
>>
>> OTOH that means we'd have to check /every/ file IO request against the
>> rmapbt, which will make things reaaaaaally slow.  I suspect it might be
>> preferable just to let the underlying pmem driver throw an error at us.
>>
>> (Or possibly just cache the bad extents in memory.)
>
> Interesting - this would be a good discussion to have. My motivation for
> this was the reasoning that the pmem driver has to check every single IO
> against badblocks, and maybe the fs can do a better job. But if you
> think the fs will actually be slower, we should try to somehow benchmark
> that!
>
>>
>> > > What do you mean by "file system can check the IO happening on a file"?
>> > > Do you mean read or write operation? What's about metadata?
>> >
>> > For the purpose described above, i.e. returning early EIOs when
>> > possible, this will be limited to reads and metadata reads. If we're
>> > about to do a metadata read, and realize the block(s) about to be read
>> > are on the badblocks list, then we do the same thing as when we discover
>> > other kinds of metadata corruption.
>>
>> ...fail and shut down? :)
>>
>> Actually, for metadata either we look at the xfs_bufs to see if it's in
>> memory (XFS doesn't directly access metadata) and write it back out; or
>> we could fire up the online repair tool to rebuild the metadata.
>
> Agreed, I was just stressing that this scenario does not change from
> status quo, and really recovering from corruption isn't the problem
> we're trying to solve here :)
>
>>
>> > > If we are talking about the discovering a bad block on read operation then
>> > > rare modern file system is able to survive as for the case of metadata as
>> > > for the case of user data. Let's imagine that we have really mature file
>> > > system driver then what does it mean to encounter a bad block? The failure
>> > > to read a logical block of some metadata (bad block) means that we are
>> > > unable to extract some part of a metadata structure. From file system
>> > > driver point of view, it looks like that our file system is corrupted, we need
>> > > to stop the file system operations and, finally, to check and recover file
>> > > system volume by means of fsck tool. If we find a bad block for some
>> > > user file then, again, it looks like an issue. Some file systems simply
>> > > return "unrecovered read error". Another one, theoretically, is able
>> > > to survive because of snapshots, for example. But, anyway, it will look
>> > > like as Read-Only mount state and the user will need to resolve such
>> > > trouble by hands.
>> >
>> > As far as I can tell, all of these things remain the same. The goal here
>> > isn't to survive more NVM badblocks than we would've before, and lost
>> > data or lost metadata will continue to have the same consequences as
>> > before, and will need the same recovery actions/intervention as before.
>> > The goal is to make the failure model similar to what users expect
>> > today, and as much as possible make recovery actions too similarly
>> > intuitive.
>> >
>> > >
>> > > If we are talking about discovering a bad block during write operation then,
>> > > again, we are in trouble. Usually, we are using asynchronous model
>> > > of write/flush operation. We are preparing the consistent state of all our
>> > > metadata structures in the memory, at first. The flush operations for metadata
>> > > and user data can be done in different times. And what should be done if we
>> > > discover bad block for any piece of metadata or user data? Simple tracking of
>> > > bad blocks is not enough at all. Let's consider user data, at first. If we cannot
>> > > write some file's block successfully then we have two ways: (1) forget about
>> > > this piece of data; (2) try to change the associated LBA for this piece of data.
>> > > The operation of re-allocation LBA number for discovered bad block
>> > > (user data case) sounds as real pain. Because you need to rebuild the metadata
>> > > that track the location of this part of file. And it sounds as practically
>> > > impossible operation, for the case of LFS file system, for example.
>> > > If we have trouble with flushing any part of metadata then it sounds as
>> > > complete disaster for any file system.
>> >
>> > Writes can get more complicated in certain cases. If it is a regular
>> > page cache writeback, or any aligned write that goes through the block
>> > driver, that is completely fine. The block driver will check that the
>> > block was previously marked as bad, do a "clear poison" operation
>> > (defined in the ACPI spec), which tells the firmware that the poison bit
>> > is not OK to be cleared, and writes the new data. This also removes the
>> > block from the badblocks list, and in this scheme, triggers a
>> > notification to the filesystem that it too can remove the block from its
>> > accounting. mmap writes and DAX can get more complicated, and at times
>> > they will just trigger a SIGBUS, and there's no way around that.
>> >
>> > >
>> > > Are you really sure that file system should process bad block issue?
>> > >
>> > > >In contrast, today, the block driver checks against the whole block device
>> > > > range for every IO. On encountering badblocks, the filesystem can
>> > > > generate a better notification/error message that points the user to
>> > > > (file, offset) as opposed to the block driver, which can only provide
>> > > > (block-device, sector).
>>
>> <shrug> We can do the translation with the backref info...
>
> Yes we should at least do that. I'm guessing this would happen in XFS
> when it gets an EIO from an IO submission? The bio submission path in
> the fs is probably not synchronous (correct?), but whenever it gets the
> EIO, I'm guessing we just print a loud error message after doing the
> backref lookup..
>
>>
>> > > > 2. The block layer adds a notifier to badblock addition/removal
>> > > > operations, which the filesystem subscribes to, and uses to maintain its
>> > > > badblocks accounting. (This part is implemented as a proof of concept in
>> > > > the RFC mentioned above [1]).
>> > >
>> > > I am not sure that any bad block notification during/after IO operation
>> > > is valuable for file system. Maybe, it could help if file system simply will
>> > > know about bad block beforehand the operation of logical block allocation.
>> > > But what subsystem will discover bad blocks before any IO operations?
>> > > How file system will receive information or some bad block table?
>> >
>> > The driver populates its badblocks lists whenever an Address Range Scrub
>> > is started (also via ACPI methods). This is always done at
>> > initialization time, so that it can build an in-memory representation of
>> > the badblocks. Additionally, this can also be triggered manually. And
>> > finally badblocks can also get populated for new latent errors when a
>> > machine check exception occurs. All of these can trigger notification to
>> > the file system without actual user reads happening.
>> >
>> > > I am not convinced that suggested badblocks approach is really feasible.
>> > > Also I am not sure that file system should see the bad blocks at all.
>> > > Why hardware cannot manage this issue for us?
>> >
>> > Hardware does manage the actual badblocks issue for us in the sense that
>> > when it discovers a badblock it will do the remapping. But since this is
>> > on the memory bus, and has different error signatures than applications
>> > are used to, we want to make the error handling similar to the existing
>> > storage model.
>>
>> Yes please and thank you, to the "error handling similar to the existing
>> storage model".  Even better if this just gets added to a layer
>> underneath the fs so that IO to bad regions returns EIO. 8-)
>
> This (if this just gets added to a layer underneath the fs so that IO to bad
> regions returns EIO) already happens :)  See pmem_do_bvec() in
> drivers/nvdimm/pmem.c, where we return EIO for a known badblock on a
> read. I'm wondering if this can be improved..
>

The pmem_do_bvec() read logic is like this:

pmem_do_bvec()
    if (is_bad_pmem())
        return -EIO;
    else
        memcpy_from_pmem();

Note memcpy_from_pmem() is calling memcpy_mcsafe(). Does this imply
that even if a block is not in the badblock list, it still can be bad
and causes MCE? Does the badblock list get changed during file system
running? If that is the case, should the file system get a
notification when it gets changed? If a block is good when I first
read it, can I still trust it to be good for the second access?

Thanks,
Andiry

>>
>> (Sleeeeep...)
>>
>> --D
>>
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-17 22:15             ` Andiry Xu
  0 siblings, 0 replies; 89+ messages in thread
From: Andiry Xu @ 2017-01-17 22:15 UTC (permalink / raw)
  To: Vishal Verma
  Cc: Darrick J. Wong, Slava Dubeyko, lsf-pc,
	linux-nvdimm@lists.01.org, linux-block, Linux FS Devel,
	Viacheslav Dubeyko

Hi,

On Tue, Jan 17, 2017 at 1:35 PM, Vishal Verma <vishal.l.verma@intel.com> wrote:
> On 01/16, Darrick J. Wong wrote:
>> On Fri, Jan 13, 2017 at 05:49:10PM -0700, Vishal Verma wrote:
>> > On 01/14, Slava Dubeyko wrote:
>> > >
>> > > ---- Original Message ----
>> > > Subject: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
>> > > Sent: Jan 13, 2017 1:40 PM
>> > > From: "Verma, Vishal L" <vishal.l.verma@intel.com>
>> > > To: lsf-pc@lists.linux-foundation.org
>> > > Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org
>> > >
>> > > > The current implementation of badblocks, where we consult the badblocks
>> > > > list for every IO in the block driver works, and is a last option
>> > > > failsafe, but from a user perspective, it isn't the easiest interface to
>> > > > work with.
>> > >
>> > > As I remember, FAT and HFS+ specifications contain description of bad blocks
>> > > (physical sectors) table. I believe that this table was used for the case of
>> > > floppy media. But, finally, this table becomes to be the completely obsolete
>> > > artefact because mostly storage devices are reliably enough. Why do you need
>>
>> ext4 has a badblocks inode to own all the bad spots on disk, but ISTR it
>> doesn't support(??) extents or 64-bit filesystems, and might just be a
>> vestigial organ at this point.  XFS doesn't have anything to track bad
>> blocks currently....
>>
>> > > in exposing the bad blocks on the file system level?  Do you expect that next
>> > > generation of NVM memory will be so unreliable that file system needs to manage
>> > > bad blocks? What's about erasure coding schemes? Do file system really need to suffer
>> > > from the bad block issue?
>> > >
>> > > Usually, we are using LBAs and it is the responsibility of storage device to map
>> > > a bad physical block/page/sector into valid one. Do you mean that we have
>> > > access to physical NVM memory address directly? But it looks like that we can
>> > > have a "bad block" issue even we will access data into page cache's memory
>> > > page (if we will use NVM memory for page cache, of course). So, what do you
>> > > imply by "bad block" issue?
>> >
>> > We don't have direct physical access to the device's address space, in
>> > the sense the device is still free to perform remapping of chunks of NVM
>> > underneath us. The problem is that when a block or address range (as
>> > small as a cache line) goes bad, the device maintains a poison bit for
>> > every affected cache line. Behind the scenes, it may have already
>> > remapped the range, but the cache line poison has to be kept so that
>> > there is a notification to the user/owner of the data that something has
>> > been lost. Since NVM is byte addressable memory sitting on the memory
>> > bus, such a poisoned cache line results in memory errors and SIGBUSes.
>> > Compared to tradational storage where an app will get nice and friendly
>> > (relatively speaking..) -EIOs. The whole badblocks implementation was
>> > done so that the driver can intercept IO (i.e. reads) to _known_ bad
>> > locations, and short-circuit them with an EIO. If the driver doesn't
>> > catch these, the reads will turn into a memory bus access, and the
>> > poison will cause a SIGBUS.
>>
>> "driver" ... you mean XFS?  Or do you mean the thing that makes pmem
>> look kind of like a traditional block device? :)
>
> Yes, the thing that makes pmem look like a block device :) --
> drivers/nvdimm/pmem.c
>
>>
>> > This effort is to try and make this badblock checking smarter - and try
>> > and reduce the penalty on every IO to a smaller range, which only the
>> > filesystem can do.
>>
>> Though... now that XFS merged the reverse mapping support, I've been
>> wondering if there'll be a resubmission of the device errors callback?
>> It still would be useful to be able to inform the user that part of
>> their fs has gone bad, or, better yet, if the buffer is still in memory
>> someplace else, just write it back out.
>>
>> Or I suppose if we had some kind of raid1 set up between memories we
>> could read one of the other copies and rewrite it into the failing
>> region immediately.
>
> Yes, that is kind of what I was hoping to accomplish via this
> discussion. How much would filesystems want to be involved in this sort
> of badblocks handling, if at all. I can refresh my patches that provide
> the fs notification, but that's the easy bit, and a starting point.
>

I have some questions. Why moving badblock handling to file system
level avoid the checking phase? In file system level for each I/O I
still have to check the badblock list, right? Do you mean during mount
it can go through the pmem device and locates all the data structures
mangled by badblocks and handle them accordingly, so that during
normal running the badblocks will never be accessed? Or, if there is
replicataion/snapshot support, use a copy to recover the badblocks?

How about operations bypass the file system, i.e. mmap?


>>
>> > > > A while back, Dave Chinner had suggested a move towards smarter
>> > > > handling, and I posted initial RFC patches [1], but since then the topic
>> > > > hasn't really moved forward.
>> > > >
>> > > > I'd like to propose and have a discussion about the following new
>> > > > functionality:
>> > > >
>> > > > 1. Filesystems develop a native representation of badblocks. For
>> > > > example, in xfs, this would (presumably) be linked to the reverse
>> > > > mapping btree. The filesystem representation has the potential to be
>> > > > more efficient than the block driver doing the check, as the fs can
>> > > > check the IO happening on a file against just that file's range.
>>
>> OTOH that means we'd have to check /every/ file IO request against the
>> rmapbt, which will make things reaaaaaally slow.  I suspect it might be
>> preferable just to let the underlying pmem driver throw an error at us.
>>
>> (Or possibly just cache the bad extents in memory.)
>
> Interesting - this would be a good discussion to have. My motivation for
> this was the reasoning that the pmem driver has to check every single IO
> against badblocks, and maybe the fs can do a better job. But if you
> think the fs will actually be slower, we should try to somehow benchmark
> that!
>
>>
>> > > What do you mean by "file system can check the IO happening on a file"?
>> > > Do you mean read or write operation? What's about metadata?
>> >
>> > For the purpose described above, i.e. returning early EIOs when
>> > possible, this will be limited to reads and metadata reads. If we're
>> > about to do a metadata read, and realize the block(s) about to be read
>> > are on the badblocks list, then we do the same thing as when we discover
>> > other kinds of metadata corruption.
>>
>> ...fail and shut down? :)
>>
>> Actually, for metadata either we look at the xfs_bufs to see if it's in
>> memory (XFS doesn't directly access metadata) and write it back out; or
>> we could fire up the online repair tool to rebuild the metadata.
>
> Agreed, I was just stressing that this scenario does not change from
> status quo, and really recovering from corruption isn't the problem
> we're trying to solve here :)
>
>>
>> > > If we are talking about the discovering a bad block on read operation then
>> > > rare modern file system is able to survive as for the case of metadata as
>> > > for the case of user data. Let's imagine that we have really mature file
>> > > system driver then what does it mean to encounter a bad block? The failure
>> > > to read a logical block of some metadata (bad block) means that we are
>> > > unable to extract some part of a metadata structure. From file system
>> > > driver point of view, it looks like that our file system is corrupted, we need
>> > > to stop the file system operations and, finally, to check and recover file
>> > > system volume by means of fsck tool. If we find a bad block for some
>> > > user file then, again, it looks like an issue. Some file systems simply
>> > > return "unrecovered read error". Another one, theoretically, is able
>> > > to survive because of snapshots, for example. But, anyway, it will look
>> > > like as Read-Only mount state and the user will need to resolve such
>> > > trouble by hands.
>> >
>> > As far as I can tell, all of these things remain the same. The goal here
>> > isn't to survive more NVM badblocks than we would've before, and lost
>> > data or lost metadata will continue to have the same consequences as
>> > before, and will need the same recovery actions/intervention as before.
>> > The goal is to make the failure model similar to what users expect
>> > today, and as much as possible make recovery actions too similarly
>> > intuitive.
>> >
>> > >
>> > > If we are talking about discovering a bad block during write operation then,
>> > > again, we are in trouble. Usually, we are using asynchronous model
>> > > of write/flush operation. We are preparing the consistent state of all our
>> > > metadata structures in the memory, at first. The flush operations for metadata
>> > > and user data can be done in different times. And what should be done if we
>> > > discover bad block for any piece of metadata or user data? Simple tracking of
>> > > bad blocks is not enough at all. Let's consider user data, at first. If we cannot
>> > > write some file's block successfully then we have two ways: (1) forget about
>> > > this piece of data; (2) try to change the associated LBA for this piece of data.
>> > > The operation of re-allocation LBA number for discovered bad block
>> > > (user data case) sounds as real pain. Because you need to rebuild the metadata
>> > > that track the location of this part of file. And it sounds as practically
>> > > impossible operation, for the case of LFS file system, for example.
>> > > If we have trouble with flushing any part of metadata then it sounds as
>> > > complete disaster for any file system.
>> >
>> > Writes can get more complicated in certain cases. If it is a regular
>> > page cache writeback, or any aligned write that goes through the block
>> > driver, that is completely fine. The block driver will check that the
>> > block was previously marked as bad, do a "clear poison" operation
>> > (defined in the ACPI spec), which tells the firmware that the poison bit
>> > is not OK to be cleared, and writes the new data. This also removes the
>> > block from the badblocks list, and in this scheme, triggers a
>> > notification to the filesystem that it too can remove the block from its
>> > accounting. mmap writes and DAX can get more complicated, and at times
>> > they will just trigger a SIGBUS, and there's no way around that.
>> >
>> > >
>> > > Are you really sure that file system should process bad block issue?
>> > >
>> > > >In contrast, today, the block driver checks against the whole block device
>> > > > range for every IO. On encountering badblocks, the filesystem can
>> > > > generate a better notification/error message that points the user to
>> > > > (file, offset) as opposed to the block driver, which can only provide
>> > > > (block-device, sector).
>>
>> <shrug> We can do the translation with the backref info...
>
> Yes we should at least do that. I'm guessing this would happen in XFS
> when it gets an EIO from an IO submission? The bio submission path in
> the fs is probably not synchronous (correct?), but whenever it gets the
> EIO, I'm guessing we just print a loud error message after doing the
> backref lookup..
>
>>
>> > > > 2. The block layer adds a notifier to badblock addition/removal
>> > > > operations, which the filesystem subscribes to, and uses to maintain its
>> > > > badblocks accounting. (This part is implemented as a proof of concept in
>> > > > the RFC mentioned above [1]).
>> > >
>> > > I am not sure that any bad block notification during/after IO operation
>> > > is valuable for file system. Maybe, it could help if file system simply will
>> > > know about bad block beforehand the operation of logical block allocation.
>> > > But what subsystem will discover bad blocks before any IO operations?
>> > > How file system will receive information or some bad block table?
>> >
>> > The driver populates its badblocks lists whenever an Address Range Scrub
>> > is started (also via ACPI methods). This is always done at
>> > initialization time, so that it can build an in-memory representation of
>> > the badblocks. Additionally, this can also be triggered manually. And
>> > finally badblocks can also get populated for new latent errors when a
>> > machine check exception occurs. All of these can trigger notification to
>> > the file system without actual user reads happening.
>> >
>> > > I am not convinced that suggested badblocks approach is really feasible.
>> > > Also I am not sure that file system should see the bad blocks at all.
>> > > Why hardware cannot manage this issue for us?
>> >
>> > Hardware does manage the actual badblocks issue for us in the sense that
>> > when it discovers a badblock it will do the remapping. But since this is
>> > on the memory bus, and has different error signatures than applications
>> > are used to, we want to make the error handling similar to the existing
>> > storage model.
>>
>> Yes please and thank you, to the "error handling similar to the existing
>> storage model".  Even better if this just gets added to a layer
>> underneath the fs so that IO to bad regions returns EIO. 8-)
>
> This (if this just gets added to a layer underneath the fs so that IO to bad
> regions returns EIO) already happens :)  See pmem_do_bvec() in
> drivers/nvdimm/pmem.c, where we return EIO for a known badblock on a
> read. I'm wondering if this can be improved..
>

The pmem_do_bvec() read logic is like this:

pmem_do_bvec()
    if (is_bad_pmem())
        return -EIO;
    else
        memcpy_from_pmem();

Note memcpy_from_pmem() is calling memcpy_mcsafe(). Does this imply
that even if a block is not in the badblock list, it still can be bad
and causes MCE? Does the badblock list get changed during file system
running? If that is the case, should the file system get a
notification when it gets changed? If a block is good when I first
read it, can I still trust it to be good for the second access?

Thanks,
Andiry

>>
>> (Sleeeeep...)
>>
>> --D
>>
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-17 22:15             ` Andiry Xu
@ 2017-01-17 22:37               ` Vishal Verma
  -1 siblings, 0 replies; 89+ messages in thread
From: Vishal Verma @ 2017-01-17 22:37 UTC (permalink / raw)
  To: Andiry Xu
  Cc: Slava Dubeyko, Darrick J. Wong, linux-nvdimm@lists.01.org,
	linux-block, Viacheslav Dubeyko, Linux FS Devel, lsf-pc

On 01/17, Andiry Xu wrote:
> Hi,
> 
> On Tue, Jan 17, 2017 at 1:35 PM, Vishal Verma <vishal.l.verma@intel.com> wrote:
> > On 01/16, Darrick J. Wong wrote:
> >> On Fri, Jan 13, 2017 at 05:49:10PM -0700, Vishal Verma wrote:
> >> > On 01/14, Slava Dubeyko wrote:
> >> > >
> >> > > ---- Original Message ----
> >> > > Subject: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
> >> > > Sent: Jan 13, 2017 1:40 PM
> >> > > From: "Verma, Vishal L" <vishal.l.verma@intel.com>
> >> > > To: lsf-pc@lists.linux-foundation.org
> >> > > Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org
> >> > >
> >> > > > The current implementation of badblocks, where we consult the badblocks
> >> > > > list for every IO in the block driver works, and is a last option
> >> > > > failsafe, but from a user perspective, it isn't the easiest interface to
> >> > > > work with.
> >> > >
> >> > > As I remember, FAT and HFS+ specifications contain description of bad blocks
> >> > > (physical sectors) table. I believe that this table was used for the case of
> >> > > floppy media. But, finally, this table becomes to be the completely obsolete
> >> > > artefact because mostly storage devices are reliably enough. Why do you need
> >>
> >> ext4 has a badblocks inode to own all the bad spots on disk, but ISTR it
> >> doesn't support(??) extents or 64-bit filesystems, and might just be a
> >> vestigial organ at this point.  XFS doesn't have anything to track bad
> >> blocks currently....
> >>
> >> > > in exposing the bad blocks on the file system level?  Do you expect that next
> >> > > generation of NVM memory will be so unreliable that file system needs to manage
> >> > > bad blocks? What's about erasure coding schemes? Do file system really need to suffer
> >> > > from the bad block issue?
> >> > >
> >> > > Usually, we are using LBAs and it is the responsibility of storage device to map
> >> > > a bad physical block/page/sector into valid one. Do you mean that we have
> >> > > access to physical NVM memory address directly? But it looks like that we can
> >> > > have a "bad block" issue even we will access data into page cache's memory
> >> > > page (if we will use NVM memory for page cache, of course). So, what do you
> >> > > imply by "bad block" issue?
> >> >
> >> > We don't have direct physical access to the device's address space, in
> >> > the sense the device is still free to perform remapping of chunks of NVM
> >> > underneath us. The problem is that when a block or address range (as
> >> > small as a cache line) goes bad, the device maintains a poison bit for
> >> > every affected cache line. Behind the scenes, it may have already
> >> > remapped the range, but the cache line poison has to be kept so that
> >> > there is a notification to the user/owner of the data that something has
> >> > been lost. Since NVM is byte addressable memory sitting on the memory
> >> > bus, such a poisoned cache line results in memory errors and SIGBUSes.
> >> > Compared to tradational storage where an app will get nice and friendly
> >> > (relatively speaking..) -EIOs. The whole badblocks implementation was
> >> > done so that the driver can intercept IO (i.e. reads) to _known_ bad
> >> > locations, and short-circuit them with an EIO. If the driver doesn't
> >> > catch these, the reads will turn into a memory bus access, and the
> >> > poison will cause a SIGBUS.
> >>
> >> "driver" ... you mean XFS?  Or do you mean the thing that makes pmem
> >> look kind of like a traditional block device? :)
> >
> > Yes, the thing that makes pmem look like a block device :) --
> > drivers/nvdimm/pmem.c
> >
> >>
> >> > This effort is to try and make this badblock checking smarter - and try
> >> > and reduce the penalty on every IO to a smaller range, which only the
> >> > filesystem can do.
> >>
> >> Though... now that XFS merged the reverse mapping support, I've been
> >> wondering if there'll be a resubmission of the device errors callback?
> >> It still would be useful to be able to inform the user that part of
> >> their fs has gone bad, or, better yet, if the buffer is still in memory
> >> someplace else, just write it back out.
> >>
> >> Or I suppose if we had some kind of raid1 set up between memories we
> >> could read one of the other copies and rewrite it into the failing
> >> region immediately.
> >
> > Yes, that is kind of what I was hoping to accomplish via this
> > discussion. How much would filesystems want to be involved in this sort
> > of badblocks handling, if at all. I can refresh my patches that provide
> > the fs notification, but that's the easy bit, and a starting point.
> >
> 
> I have some questions. Why moving badblock handling to file system
> level avoid the checking phase? In file system level for each I/O I
> still have to check the badblock list, right? Do you mean during mount
> it can go through the pmem device and locates all the data structures
> mangled by badblocks and handle them accordingly, so that during
> normal running the badblocks will never be accessed? Or, if there is
> replicataion/snapshot support, use a copy to recover the badblocks?
> 
> How about operations bypass the file system, i.e. mmap?

Hi Andiry,

I do mean that in the filesystem, for every IO, the badblocks will be
checked. Currently, the pmem driver does this, and the hope is that the
filesystem can do a better job at it. The driver unconditionally checks
every IO for badblocks on the whole device. Depending on how the
badblocks are represented in the filesystem, we might be able to quickly
tell if a file/range has existing badblocks, and error out the IO
accordingly.

At mount the the fs would read the existing badblocks on the block
device, and build its own representation of them. Then during normal
use, if the underlying badblocks change, the fs would get a notification
that would allow it to also update its own representation.

Yes, if there is replication etc support in the filesystem, we could try
to recover using that, but I haven't put much thought in that direction.

Like I said in a previous reply, mmap can be a tricky case, and other
than handling the machine check exception, there may not be anything
else we can do.. 
If the range we're faulting on has known errors in badblocks, the fault
will fail with SIGBUS (see where pmem_direct_access() fails due to
badblocks). For latent errors that are not known in badblocks, if the
platform has MCE recovery, there is an MCE handler for pmem currently,
that will add that address to badblocks. If MCE recovery is absent, then
the system will crash/reboot, and the next time the driver populates
badblocks, that address will appear in it.
> 
> >>
> >> > > > A while back, Dave Chinner had suggested a move towards smarter
> >> > > > handling, and I posted initial RFC patches [1], but since then the topic
> >> > > > hasn't really moved forward.
> >> > > >
> >> > > > I'd like to propose and have a discussion about the following new
> >> > > > functionality:
> >> > > >
> >> > > > 1. Filesystems develop a native representation of badblocks. For
> >> > > > example, in xfs, this would (presumably) be linked to the reverse
> >> > > > mapping btree. The filesystem representation has the potential to be
> >> > > > more efficient than the block driver doing the check, as the fs can
> >> > > > check the IO happening on a file against just that file's range.
> >>
> >> OTOH that means we'd have to check /every/ file IO request against the
> >> rmapbt, which will make things reaaaaaally slow.  I suspect it might be
> >> preferable just to let the underlying pmem driver throw an error at us.
> >>
> >> (Or possibly just cache the bad extents in memory.)
> >
> > Interesting - this would be a good discussion to have. My motivation for
> > this was the reasoning that the pmem driver has to check every single IO
> > against badblocks, and maybe the fs can do a better job. But if you
> > think the fs will actually be slower, we should try to somehow benchmark
> > that!
> >
> >>
> >> > > What do you mean by "file system can check the IO happening on a file"?
> >> > > Do you mean read or write operation? What's about metadata?
> >> >
> >> > For the purpose described above, i.e. returning early EIOs when
> >> > possible, this will be limited to reads and metadata reads. If we're
> >> > about to do a metadata read, and realize the block(s) about to be read
> >> > are on the badblocks list, then we do the same thing as when we discover
> >> > other kinds of metadata corruption.
> >>
> >> ...fail and shut down? :)
> >>
> >> Actually, for metadata either we look at the xfs_bufs to see if it's in
> >> memory (XFS doesn't directly access metadata) and write it back out; or
> >> we could fire up the online repair tool to rebuild the metadata.
> >
> > Agreed, I was just stressing that this scenario does not change from
> > status quo, and really recovering from corruption isn't the problem
> > we're trying to solve here :)
> >
> >>
> >> > > If we are talking about the discovering a bad block on read operation then
> >> > > rare modern file system is able to survive as for the case of metadata as
> >> > > for the case of user data. Let's imagine that we have really mature file
> >> > > system driver then what does it mean to encounter a bad block? The failure
> >> > > to read a logical block of some metadata (bad block) means that we are
> >> > > unable to extract some part of a metadata structure. From file system
> >> > > driver point of view, it looks like that our file system is corrupted, we need
> >> > > to stop the file system operations and, finally, to check and recover file
> >> > > system volume by means of fsck tool. If we find a bad block for some
> >> > > user file then, again, it looks like an issue. Some file systems simply
> >> > > return "unrecovered read error". Another one, theoretically, is able
> >> > > to survive because of snapshots, for example. But, anyway, it will look
> >> > > like as Read-Only mount state and the user will need to resolve such
> >> > > trouble by hands.
> >> >
> >> > As far as I can tell, all of these things remain the same. The goal here
> >> > isn't to survive more NVM badblocks than we would've before, and lost
> >> > data or lost metadata will continue to have the same consequences as
> >> > before, and will need the same recovery actions/intervention as before.
> >> > The goal is to make the failure model similar to what users expect
> >> > today, and as much as possible make recovery actions too similarly
> >> > intuitive.
> >> >
> >> > >
> >> > > If we are talking about discovering a bad block during write operation then,
> >> > > again, we are in trouble. Usually, we are using asynchronous model
> >> > > of write/flush operation. We are preparing the consistent state of all our
> >> > > metadata structures in the memory, at first. The flush operations for metadata
> >> > > and user data can be done in different times. And what should be done if we
> >> > > discover bad block for any piece of metadata or user data? Simple tracking of
> >> > > bad blocks is not enough at all. Let's consider user data, at first. If we cannot
> >> > > write some file's block successfully then we have two ways: (1) forget about
> >> > > this piece of data; (2) try to change the associated LBA for this piece of data.
> >> > > The operation of re-allocation LBA number for discovered bad block
> >> > > (user data case) sounds as real pain. Because you need to rebuild the metadata
> >> > > that track the location of this part of file. And it sounds as practically
> >> > > impossible operation, for the case of LFS file system, for example.
> >> > > If we have trouble with flushing any part of metadata then it sounds as
> >> > > complete disaster for any file system.
> >> >
> >> > Writes can get more complicated in certain cases. If it is a regular
> >> > page cache writeback, or any aligned write that goes through the block
> >> > driver, that is completely fine. The block driver will check that the
> >> > block was previously marked as bad, do a "clear poison" operation
> >> > (defined in the ACPI spec), which tells the firmware that the poison bit
> >> > is not OK to be cleared, and writes the new data. This also removes the
> >> > block from the badblocks list, and in this scheme, triggers a
> >> > notification to the filesystem that it too can remove the block from its
> >> > accounting. mmap writes and DAX can get more complicated, and at times
> >> > they will just trigger a SIGBUS, and there's no way around that.
> >> >
> >> > >
> >> > > Are you really sure that file system should process bad block issue?
> >> > >
> >> > > >In contrast, today, the block driver checks against the whole block device
> >> > > > range for every IO. On encountering badblocks, the filesystem can
> >> > > > generate a better notification/error message that points the user to
> >> > > > (file, offset) as opposed to the block driver, which can only provide
> >> > > > (block-device, sector).
> >>
> >> <shrug> We can do the translation with the backref info...
> >
> > Yes we should at least do that. I'm guessing this would happen in XFS
> > when it gets an EIO from an IO submission? The bio submission path in
> > the fs is probably not synchronous (correct?), but whenever it gets the
> > EIO, I'm guessing we just print a loud error message after doing the
> > backref lookup..
> >
> >>
> >> > > > 2. The block layer adds a notifier to badblock addition/removal
> >> > > > operations, which the filesystem subscribes to, and uses to maintain its
> >> > > > badblocks accounting. (This part is implemented as a proof of concept in
> >> > > > the RFC mentioned above [1]).
> >> > >
> >> > > I am not sure that any bad block notification during/after IO operation
> >> > > is valuable for file system. Maybe, it could help if file system simply will
> >> > > know about bad block beforehand the operation of logical block allocation.
> >> > > But what subsystem will discover bad blocks before any IO operations?
> >> > > How file system will receive information or some bad block table?
> >> >
> >> > The driver populates its badblocks lists whenever an Address Range Scrub
> >> > is started (also via ACPI methods). This is always done at
> >> > initialization time, so that it can build an in-memory representation of
> >> > the badblocks. Additionally, this can also be triggered manually. And
> >> > finally badblocks can also get populated for new latent errors when a
> >> > machine check exception occurs. All of these can trigger notification to
> >> > the file system without actual user reads happening.
> >> >
> >> > > I am not convinced that suggested badblocks approach is really feasible.
> >> > > Also I am not sure that file system should see the bad blocks at all.
> >> > > Why hardware cannot manage this issue for us?
> >> >
> >> > Hardware does manage the actual badblocks issue for us in the sense that
> >> > when it discovers a badblock it will do the remapping. But since this is
> >> > on the memory bus, and has different error signatures than applications
> >> > are used to, we want to make the error handling similar to the existing
> >> > storage model.
> >>
> >> Yes please and thank you, to the "error handling similar to the existing
> >> storage model".  Even better if this just gets added to a layer
> >> underneath the fs so that IO to bad regions returns EIO. 8-)
> >
> > This (if this just gets added to a layer underneath the fs so that IO to bad
> > regions returns EIO) already happens :)  See pmem_do_bvec() in
> > drivers/nvdimm/pmem.c, where we return EIO for a known badblock on a
> > read. I'm wondering if this can be improved..
> >
> 
> The pmem_do_bvec() read logic is like this:
> 
> pmem_do_bvec()
>     if (is_bad_pmem())
>         return -EIO;
>     else
>         memcpy_from_pmem();
> 
> Note memcpy_from_pmem() is calling memcpy_mcsafe(). Does this imply
> that even if a block is not in the badblock list, it still can be bad
> and causes MCE? Does the badblock list get changed during file system
> running? If that is the case, should the file system get a
> notification when it gets changed? If a block is good when I first
> read it, can I still trust it to be good for the second access?

Yes, if a block is not in the badblocks list, it can still cause an
MCE. This is the latent error case I described above. For a simple read()
via the pmem driver, this will get handled by memcpy_mcsafe. For mmap,
an MCE is inevitable.

Yes the badblocks list may change while a filesystem is running. The RFC
patches[1] I linked to add a notification for the filesystem when this
happens.

No, if the media, for some reason, 'dvelops' a bad cell, a second
consecutive read does have a chance of being bad. Once a location has
been marked as bad, it will stay bad till the ACPI clear error 'DSM' has
been called to mark it as clean.

[1]: http://www.linux.sgi.com/archives/xfs/2016-06/msg00299.html


> 
> Thanks,
> Andiry
> 
> >>
> >> (Sleeeeep...)
> >>
> >> --D
> >>
> >> >
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> >> > the body of a message to majordomo@vger.kernel.org
> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-17 22:37               ` Vishal Verma
  0 siblings, 0 replies; 89+ messages in thread
From: Vishal Verma @ 2017-01-17 22:37 UTC (permalink / raw)
  To: Andiry Xu
  Cc: Darrick J. Wong, Slava Dubeyko, lsf-pc,
	linux-nvdimm@lists.01.org, linux-block, Linux FS Devel,
	Viacheslav Dubeyko

On 01/17, Andiry Xu wrote:
> Hi,
> 
> On Tue, Jan 17, 2017 at 1:35 PM, Vishal Verma <vishal.l.verma@intel.com> wrote:
> > On 01/16, Darrick J. Wong wrote:
> >> On Fri, Jan 13, 2017 at 05:49:10PM -0700, Vishal Verma wrote:
> >> > On 01/14, Slava Dubeyko wrote:
> >> > >
> >> > > ---- Original Message ----
> >> > > Subject: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
> >> > > Sent: Jan 13, 2017 1:40 PM
> >> > > From: "Verma, Vishal L" <vishal.l.verma@intel.com>
> >> > > To: lsf-pc@lists.linux-foundation.org
> >> > > Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org
> >> > >
> >> > > > The current implementation of badblocks, where we consult the badblocks
> >> > > > list for every IO in the block driver works, and is a last option
> >> > > > failsafe, but from a user perspective, it isn't the easiest interface to
> >> > > > work with.
> >> > >
> >> > > As I remember, FAT and HFS+ specifications contain description of bad blocks
> >> > > (physical sectors) table. I believe that this table was used for the case of
> >> > > floppy media. But, finally, this table becomes to be the completely obsolete
> >> > > artefact because mostly storage devices are reliably enough. Why do you need
> >>
> >> ext4 has a badblocks inode to own all the bad spots on disk, but ISTR it
> >> doesn't support(??) extents or 64-bit filesystems, and might just be a
> >> vestigial organ at this point.  XFS doesn't have anything to track bad
> >> blocks currently....
> >>
> >> > > in exposing the bad blocks on the file system level?  Do you expect that next
> >> > > generation of NVM memory will be so unreliable that file system needs to manage
> >> > > bad blocks? What's about erasure coding schemes? Do file system really need to suffer
> >> > > from the bad block issue?
> >> > >
> >> > > Usually, we are using LBAs and it is the responsibility of storage device to map
> >> > > a bad physical block/page/sector into valid one. Do you mean that we have
> >> > > access to physical NVM memory address directly? But it looks like that we can
> >> > > have a "bad block" issue even we will access data into page cache's memory
> >> > > page (if we will use NVM memory for page cache, of course). So, what do you
> >> > > imply by "bad block" issue?
> >> >
> >> > We don't have direct physical access to the device's address space, in
> >> > the sense the device is still free to perform remapping of chunks of NVM
> >> > underneath us. The problem is that when a block or address range (as
> >> > small as a cache line) goes bad, the device maintains a poison bit for
> >> > every affected cache line. Behind the scenes, it may have already
> >> > remapped the range, but the cache line poison has to be kept so that
> >> > there is a notification to the user/owner of the data that something has
> >> > been lost. Since NVM is byte addressable memory sitting on the memory
> >> > bus, such a poisoned cache line results in memory errors and SIGBUSes.
> >> > Compared to tradational storage where an app will get nice and friendly
> >> > (relatively speaking..) -EIOs. The whole badblocks implementation was
> >> > done so that the driver can intercept IO (i.e. reads) to _known_ bad
> >> > locations, and short-circuit them with an EIO. If the driver doesn't
> >> > catch these, the reads will turn into a memory bus access, and the
> >> > poison will cause a SIGBUS.
> >>
> >> "driver" ... you mean XFS?  Or do you mean the thing that makes pmem
> >> look kind of like a traditional block device? :)
> >
> > Yes, the thing that makes pmem look like a block device :) --
> > drivers/nvdimm/pmem.c
> >
> >>
> >> > This effort is to try and make this badblock checking smarter - and try
> >> > and reduce the penalty on every IO to a smaller range, which only the
> >> > filesystem can do.
> >>
> >> Though... now that XFS merged the reverse mapping support, I've been
> >> wondering if there'll be a resubmission of the device errors callback?
> >> It still would be useful to be able to inform the user that part of
> >> their fs has gone bad, or, better yet, if the buffer is still in memory
> >> someplace else, just write it back out.
> >>
> >> Or I suppose if we had some kind of raid1 set up between memories we
> >> could read one of the other copies and rewrite it into the failing
> >> region immediately.
> >
> > Yes, that is kind of what I was hoping to accomplish via this
> > discussion. How much would filesystems want to be involved in this sort
> > of badblocks handling, if at all. I can refresh my patches that provide
> > the fs notification, but that's the easy bit, and a starting point.
> >
> 
> I have some questions. Why moving badblock handling to file system
> level avoid the checking phase? In file system level for each I/O I
> still have to check the badblock list, right? Do you mean during mount
> it can go through the pmem device and locates all the data structures
> mangled by badblocks and handle them accordingly, so that during
> normal running the badblocks will never be accessed? Or, if there is
> replicataion/snapshot support, use a copy to recover the badblocks?
> 
> How about operations bypass the file system, i.e. mmap?

Hi Andiry,

I do mean that in the filesystem, for every IO, the badblocks will be
checked. Currently, the pmem driver does this, and the hope is that the
filesystem can do a better job at it. The driver unconditionally checks
every IO for badblocks on the whole device. Depending on how the
badblocks are represented in the filesystem, we might be able to quickly
tell if a file/range has existing badblocks, and error out the IO
accordingly.

At mount the the fs would read the existing badblocks on the block
device, and build its own representation of them. Then during normal
use, if the underlying badblocks change, the fs would get a notification
that would allow it to also update its own representation.

Yes, if there is replication etc support in the filesystem, we could try
to recover using that, but I haven't put much thought in that direction.

Like I said in a previous reply, mmap can be a tricky case, and other
than handling the machine check exception, there may not be anything
else we can do.. 
If the range we're faulting on has known errors in badblocks, the fault
will fail with SIGBUS (see where pmem_direct_access() fails due to
badblocks). For latent errors that are not known in badblocks, if the
platform has MCE recovery, there is an MCE handler for pmem currently,
that will add that address to badblocks. If MCE recovery is absent, then
the system will crash/reboot, and the next time the driver populates
badblocks, that address will appear in it.
> 
> >>
> >> > > > A while back, Dave Chinner had suggested a move towards smarter
> >> > > > handling, and I posted initial RFC patches [1], but since then the topic
> >> > > > hasn't really moved forward.
> >> > > >
> >> > > > I'd like to propose and have a discussion about the following new
> >> > > > functionality:
> >> > > >
> >> > > > 1. Filesystems develop a native representation of badblocks. For
> >> > > > example, in xfs, this would (presumably) be linked to the reverse
> >> > > > mapping btree. The filesystem representation has the potential to be
> >> > > > more efficient than the block driver doing the check, as the fs can
> >> > > > check the IO happening on a file against just that file's range.
> >>
> >> OTOH that means we'd have to check /every/ file IO request against the
> >> rmapbt, which will make things reaaaaaally slow.  I suspect it might be
> >> preferable just to let the underlying pmem driver throw an error at us.
> >>
> >> (Or possibly just cache the bad extents in memory.)
> >
> > Interesting - this would be a good discussion to have. My motivation for
> > this was the reasoning that the pmem driver has to check every single IO
> > against badblocks, and maybe the fs can do a better job. But if you
> > think the fs will actually be slower, we should try to somehow benchmark
> > that!
> >
> >>
> >> > > What do you mean by "file system can check the IO happening on a file"?
> >> > > Do you mean read or write operation? What's about metadata?
> >> >
> >> > For the purpose described above, i.e. returning early EIOs when
> >> > possible, this will be limited to reads and metadata reads. If we're
> >> > about to do a metadata read, and realize the block(s) about to be read
> >> > are on the badblocks list, then we do the same thing as when we discover
> >> > other kinds of metadata corruption.
> >>
> >> ...fail and shut down? :)
> >>
> >> Actually, for metadata either we look at the xfs_bufs to see if it's in
> >> memory (XFS doesn't directly access metadata) and write it back out; or
> >> we could fire up the online repair tool to rebuild the metadata.
> >
> > Agreed, I was just stressing that this scenario does not change from
> > status quo, and really recovering from corruption isn't the problem
> > we're trying to solve here :)
> >
> >>
> >> > > If we are talking about the discovering a bad block on read operation then
> >> > > rare modern file system is able to survive as for the case of metadata as
> >> > > for the case of user data. Let's imagine that we have really mature file
> >> > > system driver then what does it mean to encounter a bad block? The failure
> >> > > to read a logical block of some metadata (bad block) means that we are
> >> > > unable to extract some part of a metadata structure. From file system
> >> > > driver point of view, it looks like that our file system is corrupted, we need
> >> > > to stop the file system operations and, finally, to check and recover file
> >> > > system volume by means of fsck tool. If we find a bad block for some
> >> > > user file then, again, it looks like an issue. Some file systems simply
> >> > > return "unrecovered read error". Another one, theoretically, is able
> >> > > to survive because of snapshots, for example. But, anyway, it will look
> >> > > like as Read-Only mount state and the user will need to resolve such
> >> > > trouble by hands.
> >> >
> >> > As far as I can tell, all of these things remain the same. The goal here
> >> > isn't to survive more NVM badblocks than we would've before, and lost
> >> > data or lost metadata will continue to have the same consequences as
> >> > before, and will need the same recovery actions/intervention as before.
> >> > The goal is to make the failure model similar to what users expect
> >> > today, and as much as possible make recovery actions too similarly
> >> > intuitive.
> >> >
> >> > >
> >> > > If we are talking about discovering a bad block during write operation then,
> >> > > again, we are in trouble. Usually, we are using asynchronous model
> >> > > of write/flush operation. We are preparing the consistent state of all our
> >> > > metadata structures in the memory, at first. The flush operations for metadata
> >> > > and user data can be done in different times. And what should be done if we
> >> > > discover bad block for any piece of metadata or user data? Simple tracking of
> >> > > bad blocks is not enough at all. Let's consider user data, at first. If we cannot
> >> > > write some file's block successfully then we have two ways: (1) forget about
> >> > > this piece of data; (2) try to change the associated LBA for this piece of data.
> >> > > The operation of re-allocation LBA number for discovered bad block
> >> > > (user data case) sounds as real pain. Because you need to rebuild the metadata
> >> > > that track the location of this part of file. And it sounds as practically
> >> > > impossible operation, for the case of LFS file system, for example.
> >> > > If we have trouble with flushing any part of metadata then it sounds as
> >> > > complete disaster for any file system.
> >> >
> >> > Writes can get more complicated in certain cases. If it is a regular
> >> > page cache writeback, or any aligned write that goes through the block
> >> > driver, that is completely fine. The block driver will check that the
> >> > block was previously marked as bad, do a "clear poison" operation
> >> > (defined in the ACPI spec), which tells the firmware that the poison bit
> >> > is not OK to be cleared, and writes the new data. This also removes the
> >> > block from the badblocks list, and in this scheme, triggers a
> >> > notification to the filesystem that it too can remove the block from its
> >> > accounting. mmap writes and DAX can get more complicated, and at times
> >> > they will just trigger a SIGBUS, and there's no way around that.
> >> >
> >> > >
> >> > > Are you really sure that file system should process bad block issue?
> >> > >
> >> > > >In contrast, today, the block driver checks against the whole block device
> >> > > > range for every IO. On encountering badblocks, the filesystem can
> >> > > > generate a better notification/error message that points the user to
> >> > > > (file, offset) as opposed to the block driver, which can only provide
> >> > > > (block-device, sector).
> >>
> >> <shrug> We can do the translation with the backref info...
> >
> > Yes we should at least do that. I'm guessing this would happen in XFS
> > when it gets an EIO from an IO submission? The bio submission path in
> > the fs is probably not synchronous (correct?), but whenever it gets the
> > EIO, I'm guessing we just print a loud error message after doing the
> > backref lookup..
> >
> >>
> >> > > > 2. The block layer adds a notifier to badblock addition/removal
> >> > > > operations, which the filesystem subscribes to, and uses to maintain its
> >> > > > badblocks accounting. (This part is implemented as a proof of concept in
> >> > > > the RFC mentioned above [1]).
> >> > >
> >> > > I am not sure that any bad block notification during/after IO operation
> >> > > is valuable for file system. Maybe, it could help if file system simply will
> >> > > know about bad block beforehand the operation of logical block allocation.
> >> > > But what subsystem will discover bad blocks before any IO operations?
> >> > > How file system will receive information or some bad block table?
> >> >
> >> > The driver populates its badblocks lists whenever an Address Range Scrub
> >> > is started (also via ACPI methods). This is always done at
> >> > initialization time, so that it can build an in-memory representation of
> >> > the badblocks. Additionally, this can also be triggered manually. And
> >> > finally badblocks can also get populated for new latent errors when a
> >> > machine check exception occurs. All of these can trigger notification to
> >> > the file system without actual user reads happening.
> >> >
> >> > > I am not convinced that suggested badblocks approach is really feasible.
> >> > > Also I am not sure that file system should see the bad blocks at all.
> >> > > Why hardware cannot manage this issue for us?
> >> >
> >> > Hardware does manage the actual badblocks issue for us in the sense that
> >> > when it discovers a badblock it will do the remapping. But since this is
> >> > on the memory bus, and has different error signatures than applications
> >> > are used to, we want to make the error handling similar to the existing
> >> > storage model.
> >>
> >> Yes please and thank you, to the "error handling similar to the existing
> >> storage model".  Even better if this just gets added to a layer
> >> underneath the fs so that IO to bad regions returns EIO. 8-)
> >
> > This (if this just gets added to a layer underneath the fs so that IO to bad
> > regions returns EIO) already happens :)  See pmem_do_bvec() in
> > drivers/nvdimm/pmem.c, where we return EIO for a known badblock on a
> > read. I'm wondering if this can be improved..
> >
> 
> The pmem_do_bvec() read logic is like this:
> 
> pmem_do_bvec()
>     if (is_bad_pmem())
>         return -EIO;
>     else
>         memcpy_from_pmem();
> 
> Note memcpy_from_pmem() is calling memcpy_mcsafe(). Does this imply
> that even if a block is not in the badblock list, it still can be bad
> and causes MCE? Does the badblock list get changed during file system
> running? If that is the case, should the file system get a
> notification when it gets changed? If a block is good when I first
> read it, can I still trust it to be good for the second access?

Yes, if a block is not in the badblocks list, it can still cause an
MCE. This is the latent error case I described above. For a simple read()
via the pmem driver, this will get handled by memcpy_mcsafe. For mmap,
an MCE is inevitable.

Yes the badblocks list may change while a filesystem is running. The RFC
patches[1] I linked to add a notification for the filesystem when this
happens.

No, if the media, for some reason, 'dvelops' a bad cell, a second
consecutive read does have a chance of being bad. Once a location has
been marked as bad, it will stay bad till the ACPI clear error 'DSM' has
been called to mark it as clean.

[1]: http://www.linux.sgi.com/archives/xfs/2016-06/msg00299.html


> 
> Thanks,
> Andiry
> 
> >>
> >> (Sleeeeep...)
> >>
> >> --D
> >>
> >> >
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> >> > the body of a message to majordomo@vger.kernel.org
> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 89+ messages in thread

* RE: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-17 14:37           ` Jan Kara
  (?)
@ 2017-01-17 23:15             ` Slava Dubeyko
  -1 siblings, 0 replies; 89+ messages in thread
From: Slava Dubeyko @ 2017-01-17 23:15 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-nvdimm, linux-block, Viacheslav Dubeyko, Linux FS Devel, lsf-pc


-----Original Message-----
From: Jan Kara [mailto:jack@suse.cz] 
Sent: Tuesday, January 17, 2017 6:37 AM
To: Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>; linux-block@vger.kernel.org; Linux FS Devel <linux-fsdevel@vger.kernel.org>; lsf-pc@lists.linux-foundation.org; Viacheslav Dubeyko <slava@dubeyko.com>; linux-nvdimm@lists.01.org
Subject: Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems

> > > We don't have direct physical access to the device's address space, 
> > > in the sense the device is still free to perform remapping of chunks of NVM underneath us.
> > > The problem is that when a block or address range (as small as a 
> > > cache line) goes bad, the device maintains a poison bit for every 
> > > affected cache line. Behind the scenes, it may have already remapped 
> > > the range, but the cache line poison has to be kept so that there is a notification to the user/owner of the data that something has been lost.
> > > Since NVM is byte addressable memory sitting on the memory bus, such 
> > > a poisoned cache line results in memory errors and SIGBUSes.
> > > Compared to tradational storage where an app will get nice and friendly (relatively speaking..) -EIOs.
> > > The whole badblocks implementation was done so that the driver can 
> > > intercept IO (i.e. reads) to _known_ bad locations, and 
> > > short-circuit them with an EIO. If the driver doesn't catch these, the reads will turn into a memory bus access, and the poison will cause a SIGBUS.
> > >
> > > This effort is to try and make this badblock checking smarter - and 
> > > try and reduce the penalty on every IO to a smaller range, which only the filesystem can do.

> Well, the situation with NVM is more like with DRAM AFAIU. It is quite reliable
> but given the size the probability *some* cell has degraded is quite high.
> And similar to DRAM you'll get MCE (Machine Check Exception) when you try
> to read such cell. As Vishal wrote, the hardware does some background scrubbing
> and relocates stuff early if needed but nothing is 100%.

My understanding that hardware does the remapping the affected address
range (64 bytes, for example) but it doesn't move/migrate the stored data in this address
range. So, it sounds slightly weird. Because it means that no guarantee to retrieve the stored
data. It sounds that file system should be aware about this and has to be heavily protected
by some replication or erasure coding scheme. Otherwise, if the hardware does everything for us
(remap the affected address region and move data into a new address region) then why
does file system need to know about the affected address regions?

> The reason why we play games with badblocks is to avoid those MCEs
> (i.e., even trying to read the data we know that are bad). Even if it would
> be rare event, MCE may mean the machine just immediately reboots
> (although I find such platforms hardly usable with NVM then) and that
> is no good. And even on hardware platforms that allow for more graceful
> recovery from MCE it is asynchronous in its nature and our error handling
> around IO is all synchronous so it is difficult to join these two models together.
>
> But I think it is a good question to ask whether we cannot improve on MCE handling
> instead of trying to avoid them and pushing around responsibility for handling
> bad blocks. Actually I thought someone was working on that.
> Cannot we e.g. wrap in-kernel accesses to persistent memory (those are now
> well identified anyway so that we can consult the badblocks list) so that it MCE
> happens during these accesses, we note it somewhere and at the end of the magic
> block we will just pick up the errors and report them back?

Let's imagine that the affected address range will equal to 64 bytes. It sounds for me
that for the case of block device it will affect the whole logical block (4 KB). If the failure
rate of address ranges could be significant then it would affect a lot of logical blocks.
It looks like a complete nightmare for the file system. Especially, if we discover such
issue on the read operation. Again, LBA means logical block address. It sounds for me
that this guy should be valid always. Otherwise, we crash the whole concept.

The situation is more critical for the case of DAX approach. Correct me if I wrong but
my understanding is the goal of DAX is to provide the direct access to file's memory
pages with minimal file system overhead. So, it looks like that raising bad block issue
on file system level will affect a user-space application. Because, finally, user-space
application will need to process such trouble (bad block issue). It sounds for me as really
weird situation. What can protect a user-space application from encountering the issue
with partially incorrect memory page?

> > OK. Let's imagine that NVM memory device hasn't any internal error 
> > correction hardware-based scheme. Next level of defense could be any 
> > erasure coding scheme on device driver level. So, any piece of data 
> > can be protected by parities. And device driver will be responsible 
> > for management of erasure coding scheme. It will increase latency of 
> > read operation for the case of necessity to recover the affected memory page.
> > But, finally, all recovering activity will be behind the scene and 
> > file system will be unaware about such recovering activity.
>
> Note that your options are limited by the byte addressability and
> the direct CPU access to the memory. But even with these limitations
> it is not that error rate would but unusually high, it is just not zero.
 
Even for the case of byte addressability, I cannot see any troubles
with using some error correction or erasure coding schemes
inside of the memory chip. Especially, for the rare case of such issue
the latency of device operations will be pretty OK.

> > If you are going not to provide any erasure coding or error correction 
> > scheme then it's really bad case. The fsck tool is not regular case 
> > tool but the last resort. If you are going to rely on the fsck tool 
> > then simply forget about using your hardware. Some file systems 
> > haven't the fsck tool at all. Some guys really believe that file 
> > system has to work without support of the fsck tool.  Even if a mature 
> > file system has reliable fsck tool then the probability of file system 
> > recovering is very low in the case of serious metadata corruptions. 
> > So, it means that you are trying to suggest the technique when we will 
> > lose the whole file system volumes on regular basis without any hope 
> > to recover data. Even if file system has snapshots then, again, we 
> > haven't hope because we can suffer from read error and for operation with snapshot.
>
> I hope I have cleared out that this is not about higher error rate
> of persistent memory above. As a side note, XFS guys are working on automatic
> background scrubbing and online filesystem checking. Not specifically for persistent
> memory but simply because with growing size of the filesystem the likelihood of
> some problem somewhere is growing. 
 
I see your point but even for low error rate you cannot predict what logical
block can be affected by such issue. Even the online file system checking subsystem
cannot prevent from file system corruption. Because, for example, if you find
during a read operation that your btree's root node is corrupted then you
can lose the whole btree.

Thanks,
Vyacheslav Dubeyko.
 
Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 89+ messages in thread

* RE: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-17 23:15             ` Slava Dubeyko
  0 siblings, 0 replies; 89+ messages in thread
From: Slava Dubeyko @ 2017-01-17 23:15 UTC (permalink / raw)
  To: Jan Kara
  Cc: Vishal Verma, linux-block, Linux FS Devel, lsf-pc,
	Viacheslav Dubeyko, linux-nvdimm


-----Original Message-----
From: Jan Kara [mailto:jack@suse.cz]=20
Sent: Tuesday, January 17, 2017 6:37 AM
To: Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>; linux-block@vger.kernel.org; L=
inux FS Devel <linux-fsdevel@vger.kernel.org>; lsf-pc@lists.linux-foundatio=
n.org; Viacheslav Dubeyko <slava@dubeyko.com>; linux-nvdimm@lists.01.org
Subject: Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in f=
ilesystems

> > > We don't have direct physical access to the device's address space,=20
> > > in the sense the device is still free to perform remapping of chunks =
of NVM underneath us.
> > > The problem is that when a block or address range (as small as a=20
> > > cache line) goes bad, the device maintains a poison bit for every=20
> > > affected cache line. Behind the scenes, it may have already remapped=
=20
> > > the range, but the cache line poison has to be kept so that there is =
a notification to the user/owner of the data that something has been lost.
> > > Since NVM is byte addressable memory sitting on the memory bus, such=
=20
> > > a poisoned cache line results in memory errors and SIGBUSes.
> > > Compared to tradational storage where an app will get nice and friend=
ly (relatively speaking..) -EIOs.
> > > The whole badblocks implementation was done so that the driver can=20
> > > intercept IO (i.e. reads) to _known_ bad locations, and=20
> > > short-circuit them with an EIO. If the driver doesn't catch these, th=
e reads will turn into a memory bus access, and the poison will cause a SIG=
BUS.
> > >
> > > This effort is to try and make this badblock checking smarter - and=20
> > > try and reduce the penalty on every IO to a smaller range, which only=
 the filesystem can do.

> Well, the situation with NVM is more like with DRAM AFAIU. It is quite re=
liable
> but given the size the probability *some* cell has degraded is quite high=
.
> And similar to DRAM you'll get MCE (Machine Check Exception) when you try
> to read such cell. As Vishal wrote, the hardware does some background scr=
ubbing
> and relocates stuff early if needed but nothing is 100%.

My understanding that hardware does the remapping the affected address
range (64 bytes, for example) but it doesn't move/migrate the stored data i=
n this address
range. So, it sounds slightly weird. Because it means that no guarantee to =
retrieve the stored
data. It sounds that file system should be aware about this and has to be h=
eavily protected
by some replication or erasure coding scheme. Otherwise, if the hardware do=
es everything for us
(remap the affected address region and move data into a new address region)=
 then why
does file system need to know about the affected address regions?

> The reason why we play games with badblocks is to avoid those MCEs
> (i.e., even trying to read the data we know that are bad). Even if it wou=
ld
> be rare event, MCE may mean the machine just immediately reboots
> (although I find such platforms hardly usable with NVM then) and that
> is no good. And even on hardware platforms that allow for more graceful
> recovery from MCE it is asynchronous in its nature and our error handling
> around IO is all synchronous so it is difficult to join these two models =
together.
>
> But I think it is a good question to ask whether we cannot improve on MCE=
 handling
> instead of trying to avoid them and pushing around responsibility for han=
dling
> bad blocks. Actually I thought someone was working on that.
> Cannot we e.g. wrap in-kernel accesses to persistent memory (those are no=
w
> well identified anyway so that we can consult the badblocks list) so that=
 it MCE
> happens during these accesses, we note it somewhere and at the end of the=
 magic
> block we will just pick up the errors and report them back?

Let's imagine that the affected address range will equal to 64 bytes. It so=
unds for me
that for the case of block device it will affect the whole logical block (4=
 KB). If the failure
rate of address ranges could be significant then it would affect a lot of l=
ogical blocks.
It looks like a complete nightmare for the file system. Especially, if we d=
iscover such
issue on the read operation. Again, LBA means logical block address. It sou=
nds for me
that this guy should be valid always. Otherwise, we crash the whole concept=
.

The situation is more critical for the case of DAX approach. Correct me if =
I wrong but
my understanding is the goal of DAX is to provide the direct access to file=
's memory
pages with minimal file system overhead. So, it looks like that raising bad=
 block issue
on file system level will affect a user-space application. Because, finally=
, user-space
application will need to process such trouble (bad block issue). It sounds =
for me as really
weird situation. What can protect a user-space application from encounterin=
g the issue
with partially incorrect memory page?

> > OK. Let's imagine that NVM memory device hasn't any internal error=20
> > correction hardware-based scheme. Next level of defense could be any=20
> > erasure coding scheme on device driver level. So, any piece of data=20
> > can be protected by parities. And device driver will be responsible=20
> > for management of erasure coding scheme. It will increase latency of=20
> > read operation for the case of necessity to recover the affected memory=
 page.
> > But, finally, all recovering activity will be behind the scene and=20
> > file system will be unaware about such recovering activity.
>
> Note that your options are limited by the byte addressability and
> the direct CPU access to the memory. But even with these limitations
> it is not that error rate would but unusually high, it is just not zero.
=20
Even for the case of byte addressability, I cannot see any troubles
with using some error correction or erasure coding schemes
inside of the memory chip. Especially, for the rare case of such issue
the latency of device operations will be pretty OK.

> > If you are going not to provide any erasure coding or error correction=
=20
> > scheme then it's really bad case. The fsck tool is not regular case=20
> > tool but the last resort. If you are going to rely on the fsck tool=20
> > then simply forget about using your hardware. Some file systems=20
> > haven't the fsck tool at all. Some guys really believe that file=20
> > system has to work without support of the fsck tool.  Even if a mature=
=20
> > file system has reliable fsck tool then the probability of file system=
=20
> > recovering is very low in the case of serious metadata corruptions.=20
> > So, it means that you are trying to suggest the technique when we will=
=20
> > lose the whole file system volumes on regular basis without any hope=20
> > to recover data. Even if file system has snapshots then, again, we=20
> > haven't hope because we can suffer from read error and for operation wi=
th snapshot.
>
> I hope I have cleared out that this is not about higher error rate
> of persistent memory above. As a side note, XFS guys are working on autom=
atic
> background scrubbing and online filesystem checking. Not specifically for=
 persistent
> memory but simply because with growing size of the filesystem the likelih=
ood of
> some problem somewhere is growing.=20
=20
I see your point but even for low error rate you cannot predict what logica=
l
block can be affected by such issue. Even the online file system checking s=
ubsystem
cannot prevent from file system corruption. Because, for example, if you fi=
nd
during a read operation that your btree's root node is corrupted then you
can lose the whole btree.

Thanks,
Vyacheslav Dubeyko.
=20

^ permalink raw reply	[flat|nested] 89+ messages in thread

* RE: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-17 23:15             ` Slava Dubeyko
  0 siblings, 0 replies; 89+ messages in thread
From: Slava Dubeyko @ 2017-01-17 23:15 UTC (permalink / raw)
  To: Jan Kara
  Cc: Vishal Verma, linux-block, Linux FS Devel, lsf-pc,
	Viacheslav Dubeyko, linux-nvdimm


-----Original Message-----
From: Jan Kara [mailto:jack@suse.cz] 
Sent: Tuesday, January 17, 2017 6:37 AM
To: Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>; linux-block@vger.kernel.org; Linux FS Devel <linux-fsdevel@vger.kernel.org>; lsf-pc@lists.linux-foundation.org; Viacheslav Dubeyko <slava@dubeyko.com>; linux-nvdimm@lists.01.org
Subject: Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems

> > > We don't have direct physical access to the device's address space, 
> > > in the sense the device is still free to perform remapping of chunks of NVM underneath us.
> > > The problem is that when a block or address range (as small as a 
> > > cache line) goes bad, the device maintains a poison bit for every 
> > > affected cache line. Behind the scenes, it may have already remapped 
> > > the range, but the cache line poison has to be kept so that there is a notification to the user/owner of the data that something has been lost.
> > > Since NVM is byte addressable memory sitting on the memory bus, such 
> > > a poisoned cache line results in memory errors and SIGBUSes.
> > > Compared to tradational storage where an app will get nice and friendly (relatively speaking..) -EIOs.
> > > The whole badblocks implementation was done so that the driver can 
> > > intercept IO (i.e. reads) to _known_ bad locations, and 
> > > short-circuit them with an EIO. If the driver doesn't catch these, the reads will turn into a memory bus access, and the poison will cause a SIGBUS.
> > >
> > > This effort is to try and make this badblock checking smarter - and 
> > > try and reduce the penalty on every IO to a smaller range, which only the filesystem can do.

> Well, the situation with NVM is more like with DRAM AFAIU. It is quite reliable
> but given the size the probability *some* cell has degraded is quite high.
> And similar to DRAM you'll get MCE (Machine Check Exception) when you try
> to read such cell. As Vishal wrote, the hardware does some background scrubbing
> and relocates stuff early if needed but nothing is 100%.

My understanding that hardware does the remapping the affected address
range (64 bytes, for example) but it doesn't move/migrate the stored data in this address
range. So, it sounds slightly weird. Because it means that no guarantee to retrieve the stored
data. It sounds that file system should be aware about this and has to be heavily protected
by some replication or erasure coding scheme. Otherwise, if the hardware does everything for us
(remap the affected address region and move data into a new address region) then why
does file system need to know about the affected address regions?

> The reason why we play games with badblocks is to avoid those MCEs
> (i.e., even trying to read the data we know that are bad). Even if it would
> be rare event, MCE may mean the machine just immediately reboots
> (although I find such platforms hardly usable with NVM then) and that
> is no good. And even on hardware platforms that allow for more graceful
> recovery from MCE it is asynchronous in its nature and our error handling
> around IO is all synchronous so it is difficult to join these two models together.
>
> But I think it is a good question to ask whether we cannot improve on MCE handling
> instead of trying to avoid them and pushing around responsibility for handling
> bad blocks. Actually I thought someone was working on that.
> Cannot we e.g. wrap in-kernel accesses to persistent memory (those are now
> well identified anyway so that we can consult the badblocks list) so that it MCE
> happens during these accesses, we note it somewhere and at the end of the magic
> block we will just pick up the errors and report them back?

Let's imagine that the affected address range will equal to 64 bytes. It sounds for me
that for the case of block device it will affect the whole logical block (4 KB). If the failure
rate of address ranges could be significant then it would affect a lot of logical blocks.
It looks like a complete nightmare for the file system. Especially, if we discover such
issue on the read operation. Again, LBA means logical block address. It sounds for me
that this guy should be valid always. Otherwise, we crash the whole concept.

The situation is more critical for the case of DAX approach. Correct me if I wrong but
my understanding is the goal of DAX is to provide the direct access to file's memory
pages with minimal file system overhead. So, it looks like that raising bad block issue
on file system level will affect a user-space application. Because, finally, user-space
application will need to process such trouble (bad block issue). It sounds for me as really
weird situation. What can protect a user-space application from encountering the issue
with partially incorrect memory page?

> > OK. Let's imagine that NVM memory device hasn't any internal error 
> > correction hardware-based scheme. Next level of defense could be any 
> > erasure coding scheme on device driver level. So, any piece of data 
> > can be protected by parities. And device driver will be responsible 
> > for management of erasure coding scheme. It will increase latency of 
> > read operation for the case of necessity to recover the affected memory page.
> > But, finally, all recovering activity will be behind the scene and 
> > file system will be unaware about such recovering activity.
>
> Note that your options are limited by the byte addressability and
> the direct CPU access to the memory. But even with these limitations
> it is not that error rate would but unusually high, it is just not zero.
 
Even for the case of byte addressability, I cannot see any troubles
with using some error correction or erasure coding schemes
inside of the memory chip. Especially, for the rare case of such issue
the latency of device operations will be pretty OK.

> > If you are going not to provide any erasure coding or error correction 
> > scheme then it's really bad case. The fsck tool is not regular case 
> > tool but the last resort. If you are going to rely on the fsck tool 
> > then simply forget about using your hardware. Some file systems 
> > haven't the fsck tool at all. Some guys really believe that file 
> > system has to work without support of the fsck tool.  Even if a mature 
> > file system has reliable fsck tool then the probability of file system 
> > recovering is very low in the case of serious metadata corruptions. 
> > So, it means that you are trying to suggest the technique when we will 
> > lose the whole file system volumes on regular basis without any hope 
> > to recover data. Even if file system has snapshots then, again, we 
> > haven't hope because we can suffer from read error and for operation with snapshot.
>
> I hope I have cleared out that this is not about higher error rate
> of persistent memory above. As a side note, XFS guys are working on automatic
> background scrubbing and online filesystem checking. Not specifically for persistent
> memory but simply because with growing size of the filesystem the likelihood of
> some problem somewhere is growing. 
 
I see your point but even for low error rate you cannot predict what logical
block can be affected by such issue. Even the online file system checking subsystem
cannot prevent from file system corruption. Because, for example, if you find
during a read operation that your btree's root node is corrupted then you
can lose the whole btree.

Thanks,
Vyacheslav Dubeyko.
 

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-17 22:37               ` Vishal Verma
@ 2017-01-17 23:20                 ` Andiry Xu
  -1 siblings, 0 replies; 89+ messages in thread
From: Andiry Xu @ 2017-01-17 23:20 UTC (permalink / raw)
  To: Vishal Verma
  Cc: Slava Dubeyko, Darrick J. Wong, linux-nvdimm@lists.01.org,
	linux-block, Viacheslav Dubeyko, Linux FS Devel, lsf-pc

On Tue, Jan 17, 2017 at 2:37 PM, Vishal Verma <vishal.l.verma@intel.com> wrote:
> On 01/17, Andiry Xu wrote:
>> Hi,
>>
>> On Tue, Jan 17, 2017 at 1:35 PM, Vishal Verma <vishal.l.verma@intel.com> wrote:
>> > On 01/16, Darrick J. Wong wrote:
>> >> On Fri, Jan 13, 2017 at 05:49:10PM -0700, Vishal Verma wrote:
>> >> > On 01/14, Slava Dubeyko wrote:
>> >> > >
>> >> > > ---- Original Message ----
>> >> > > Subject: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
>> >> > > Sent: Jan 13, 2017 1:40 PM
>> >> > > From: "Verma, Vishal L" <vishal.l.verma@intel.com>
>> >> > > To: lsf-pc@lists.linux-foundation.org
>> >> > > Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org
>> >> > >
>> >> > > > The current implementation of badblocks, where we consult the badblocks
>> >> > > > list for every IO in the block driver works, and is a last option
>> >> > > > failsafe, but from a user perspective, it isn't the easiest interface to
>> >> > > > work with.
>> >> > >
>> >> > > As I remember, FAT and HFS+ specifications contain description of bad blocks
>> >> > > (physical sectors) table. I believe that this table was used for the case of
>> >> > > floppy media. But, finally, this table becomes to be the completely obsolete
>> >> > > artefact because mostly storage devices are reliably enough. Why do you need
>> >>
>> >> ext4 has a badblocks inode to own all the bad spots on disk, but ISTR it
>> >> doesn't support(??) extents or 64-bit filesystems, and might just be a
>> >> vestigial organ at this point.  XFS doesn't have anything to track bad
>> >> blocks currently....
>> >>
>> >> > > in exposing the bad blocks on the file system level?  Do you expect that next
>> >> > > generation of NVM memory will be so unreliable that file system needs to manage
>> >> > > bad blocks? What's about erasure coding schemes? Do file system really need to suffer
>> >> > > from the bad block issue?
>> >> > >
>> >> > > Usually, we are using LBAs and it is the responsibility of storage device to map
>> >> > > a bad physical block/page/sector into valid one. Do you mean that we have
>> >> > > access to physical NVM memory address directly? But it looks like that we can
>> >> > > have a "bad block" issue even we will access data into page cache's memory
>> >> > > page (if we will use NVM memory for page cache, of course). So, what do you
>> >> > > imply by "bad block" issue?
>> >> >
>> >> > We don't have direct physical access to the device's address space, in
>> >> > the sense the device is still free to perform remapping of chunks of NVM
>> >> > underneath us. The problem is that when a block or address range (as
>> >> > small as a cache line) goes bad, the device maintains a poison bit for
>> >> > every affected cache line. Behind the scenes, it may have already
>> >> > remapped the range, but the cache line poison has to be kept so that
>> >> > there is a notification to the user/owner of the data that something has
>> >> > been lost. Since NVM is byte addressable memory sitting on the memory
>> >> > bus, such a poisoned cache line results in memory errors and SIGBUSes.
>> >> > Compared to tradational storage where an app will get nice and friendly
>> >> > (relatively speaking..) -EIOs. The whole badblocks implementation was
>> >> > done so that the driver can intercept IO (i.e. reads) to _known_ bad
>> >> > locations, and short-circuit them with an EIO. If the driver doesn't
>> >> > catch these, the reads will turn into a memory bus access, and the
>> >> > poison will cause a SIGBUS.
>> >>
>> >> "driver" ... you mean XFS?  Or do you mean the thing that makes pmem
>> >> look kind of like a traditional block device? :)
>> >
>> > Yes, the thing that makes pmem look like a block device :) --
>> > drivers/nvdimm/pmem.c
>> >
>> >>
>> >> > This effort is to try and make this badblock checking smarter - and try
>> >> > and reduce the penalty on every IO to a smaller range, which only the
>> >> > filesystem can do.
>> >>
>> >> Though... now that XFS merged the reverse mapping support, I've been
>> >> wondering if there'll be a resubmission of the device errors callback?
>> >> It still would be useful to be able to inform the user that part of
>> >> their fs has gone bad, or, better yet, if the buffer is still in memory
>> >> someplace else, just write it back out.
>> >>
>> >> Or I suppose if we had some kind of raid1 set up between memories we
>> >> could read one of the other copies and rewrite it into the failing
>> >> region immediately.
>> >
>> > Yes, that is kind of what I was hoping to accomplish via this
>> > discussion. How much would filesystems want to be involved in this sort
>> > of badblocks handling, if at all. I can refresh my patches that provide
>> > the fs notification, but that's the easy bit, and a starting point.
>> >
>>
>> I have some questions. Why moving badblock handling to file system
>> level avoid the checking phase? In file system level for each I/O I
>> still have to check the badblock list, right? Do you mean during mount
>> it can go through the pmem device and locates all the data structures
>> mangled by badblocks and handle them accordingly, so that during
>> normal running the badblocks will never be accessed? Or, if there is
>> replicataion/snapshot support, use a copy to recover the badblocks?
>>
>> How about operations bypass the file system, i.e. mmap?
>
> Hi Andiry,
>
> I do mean that in the filesystem, for every IO, the badblocks will be
> checked. Currently, the pmem driver does this, and the hope is that the
> filesystem can do a better job at it. The driver unconditionally checks
> every IO for badblocks on the whole device. Depending on how the
> badblocks are represented in the filesystem, we might be able to quickly
> tell if a file/range has existing badblocks, and error out the IO
> accordingly.
>
> At mount the the fs would read the existing badblocks on the block
> device, and build its own representation of them. Then during normal
> use, if the underlying badblocks change, the fs would get a notification
> that would allow it to also update its own representation.
>
> Yes, if there is replication etc support in the filesystem, we could try
> to recover using that, but I haven't put much thought in that direction.
>
> Like I said in a previous reply, mmap can be a tricky case, and other
> than handling the machine check exception, there may not be anything
> else we can do..
> If the range we're faulting on has known errors in badblocks, the fault
> will fail with SIGBUS (see where pmem_direct_access() fails due to
> badblocks). For latent errors that are not known in badblocks, if the
> platform has MCE recovery, there is an MCE handler for pmem currently,
> that will add that address to badblocks. If MCE recovery is absent, then
> the system will crash/reboot, and the next time the driver populates
> badblocks, that address will appear in it.

Thank you for reply. That is very clear.

>>
>> >>
>> >> > > > A while back, Dave Chinner had suggested a move towards smarter
>> >> > > > handling, and I posted initial RFC patches [1], but since then the topic
>> >> > > > hasn't really moved forward.
>> >> > > >
>> >> > > > I'd like to propose and have a discussion about the following new
>> >> > > > functionality:
>> >> > > >
>> >> > > > 1. Filesystems develop a native representation of badblocks. For
>> >> > > > example, in xfs, this would (presumably) be linked to the reverse
>> >> > > > mapping btree. The filesystem representation has the potential to be
>> >> > > > more efficient than the block driver doing the check, as the fs can
>> >> > > > check the IO happening on a file against just that file's range.
>> >>
>> >> OTOH that means we'd have to check /every/ file IO request against the
>> >> rmapbt, which will make things reaaaaaally slow.  I suspect it might be
>> >> preferable just to let the underlying pmem driver throw an error at us.
>> >>
>> >> (Or possibly just cache the bad extents in memory.)
>> >
>> > Interesting - this would be a good discussion to have. My motivation for
>> > this was the reasoning that the pmem driver has to check every single IO
>> > against badblocks, and maybe the fs can do a better job. But if you
>> > think the fs will actually be slower, we should try to somehow benchmark
>> > that!
>> >
>> >>
>> >> > > What do you mean by "file system can check the IO happening on a file"?
>> >> > > Do you mean read or write operation? What's about metadata?
>> >> >
>> >> > For the purpose described above, i.e. returning early EIOs when
>> >> > possible, this will be limited to reads and metadata reads. If we're
>> >> > about to do a metadata read, and realize the block(s) about to be read
>> >> > are on the badblocks list, then we do the same thing as when we discover
>> >> > other kinds of metadata corruption.
>> >>
>> >> ...fail and shut down? :)
>> >>
>> >> Actually, for metadata either we look at the xfs_bufs to see if it's in
>> >> memory (XFS doesn't directly access metadata) and write it back out; or
>> >> we could fire up the online repair tool to rebuild the metadata.
>> >
>> > Agreed, I was just stressing that this scenario does not change from
>> > status quo, and really recovering from corruption isn't the problem
>> > we're trying to solve here :)
>> >
>> >>
>> >> > > If we are talking about the discovering a bad block on read operation then
>> >> > > rare modern file system is able to survive as for the case of metadata as
>> >> > > for the case of user data. Let's imagine that we have really mature file
>> >> > > system driver then what does it mean to encounter a bad block? The failure
>> >> > > to read a logical block of some metadata (bad block) means that we are
>> >> > > unable to extract some part of a metadata structure. From file system
>> >> > > driver point of view, it looks like that our file system is corrupted, we need
>> >> > > to stop the file system operations and, finally, to check and recover file
>> >> > > system volume by means of fsck tool. If we find a bad block for some
>> >> > > user file then, again, it looks like an issue. Some file systems simply
>> >> > > return "unrecovered read error". Another one, theoretically, is able
>> >> > > to survive because of snapshots, for example. But, anyway, it will look
>> >> > > like as Read-Only mount state and the user will need to resolve such
>> >> > > trouble by hands.
>> >> >
>> >> > As far as I can tell, all of these things remain the same. The goal here
>> >> > isn't to survive more NVM badblocks than we would've before, and lost
>> >> > data or lost metadata will continue to have the same consequences as
>> >> > before, and will need the same recovery actions/intervention as before.
>> >> > The goal is to make the failure model similar to what users expect
>> >> > today, and as much as possible make recovery actions too similarly
>> >> > intuitive.
>> >> >
>> >> > >
>> >> > > If we are talking about discovering a bad block during write operation then,
>> >> > > again, we are in trouble. Usually, we are using asynchronous model
>> >> > > of write/flush operation. We are preparing the consistent state of all our
>> >> > > metadata structures in the memory, at first. The flush operations for metadata
>> >> > > and user data can be done in different times. And what should be done if we
>> >> > > discover bad block for any piece of metadata or user data? Simple tracking of
>> >> > > bad blocks is not enough at all. Let's consider user data, at first. If we cannot
>> >> > > write some file's block successfully then we have two ways: (1) forget about
>> >> > > this piece of data; (2) try to change the associated LBA for this piece of data.
>> >> > > The operation of re-allocation LBA number for discovered bad block
>> >> > > (user data case) sounds as real pain. Because you need to rebuild the metadata
>> >> > > that track the location of this part of file. And it sounds as practically
>> >> > > impossible operation, for the case of LFS file system, for example.
>> >> > > If we have trouble with flushing any part of metadata then it sounds as
>> >> > > complete disaster for any file system.
>> >> >
>> >> > Writes can get more complicated in certain cases. If it is a regular
>> >> > page cache writeback, or any aligned write that goes through the block
>> >> > driver, that is completely fine. The block driver will check that the
>> >> > block was previously marked as bad, do a "clear poison" operation
>> >> > (defined in the ACPI spec), which tells the firmware that the poison bit
>> >> > is not OK to be cleared, and writes the new data. This also removes the
>> >> > block from the badblocks list, and in this scheme, triggers a
>> >> > notification to the filesystem that it too can remove the block from its
>> >> > accounting. mmap writes and DAX can get more complicated, and at times
>> >> > they will just trigger a SIGBUS, and there's no way around that.
>> >> >
>> >> > >
>> >> > > Are you really sure that file system should process bad block issue?
>> >> > >
>> >> > > >In contrast, today, the block driver checks against the whole block device
>> >> > > > range for every IO. On encountering badblocks, the filesystem can
>> >> > > > generate a better notification/error message that points the user to
>> >> > > > (file, offset) as opposed to the block driver, which can only provide
>> >> > > > (block-device, sector).
>> >>
>> >> <shrug> We can do the translation with the backref info...
>> >
>> > Yes we should at least do that. I'm guessing this would happen in XFS
>> > when it gets an EIO from an IO submission? The bio submission path in
>> > the fs is probably not synchronous (correct?), but whenever it gets the
>> > EIO, I'm guessing we just print a loud error message after doing the
>> > backref lookup..
>> >
>> >>
>> >> > > > 2. The block layer adds a notifier to badblock addition/removal
>> >> > > > operations, which the filesystem subscribes to, and uses to maintain its
>> >> > > > badblocks accounting. (This part is implemented as a proof of concept in
>> >> > > > the RFC mentioned above [1]).
>> >> > >
>> >> > > I am not sure that any bad block notification during/after IO operation
>> >> > > is valuable for file system. Maybe, it could help if file system simply will
>> >> > > know about bad block beforehand the operation of logical block allocation.
>> >> > > But what subsystem will discover bad blocks before any IO operations?
>> >> > > How file system will receive information or some bad block table?
>> >> >
>> >> > The driver populates its badblocks lists whenever an Address Range Scrub
>> >> > is started (also via ACPI methods). This is always done at
>> >> > initialization time, so that it can build an in-memory representation of
>> >> > the badblocks. Additionally, this can also be triggered manually. And
>> >> > finally badblocks can also get populated for new latent errors when a
>> >> > machine check exception occurs. All of these can trigger notification to
>> >> > the file system without actual user reads happening.
>> >> >
>> >> > > I am not convinced that suggested badblocks approach is really feasible.
>> >> > > Also I am not sure that file system should see the bad blocks at all.
>> >> > > Why hardware cannot manage this issue for us?
>> >> >
>> >> > Hardware does manage the actual badblocks issue for us in the sense that
>> >> > when it discovers a badblock it will do the remapping. But since this is
>> >> > on the memory bus, and has different error signatures than applications
>> >> > are used to, we want to make the error handling similar to the existing
>> >> > storage model.
>> >>
>> >> Yes please and thank you, to the "error handling similar to the existing
>> >> storage model".  Even better if this just gets added to a layer
>> >> underneath the fs so that IO to bad regions returns EIO. 8-)
>> >
>> > This (if this just gets added to a layer underneath the fs so that IO to bad
>> > regions returns EIO) already happens :)  See pmem_do_bvec() in
>> > drivers/nvdimm/pmem.c, where we return EIO for a known badblock on a
>> > read. I'm wondering if this can be improved..
>> >
>>
>> The pmem_do_bvec() read logic is like this:
>>
>> pmem_do_bvec()
>>     if (is_bad_pmem())
>>         return -EIO;
>>     else
>>         memcpy_from_pmem();
>>
>> Note memcpy_from_pmem() is calling memcpy_mcsafe(). Does this imply
>> that even if a block is not in the badblock list, it still can be bad
>> and causes MCE? Does the badblock list get changed during file system
>> running? If that is the case, should the file system get a
>> notification when it gets changed? If a block is good when I first
>> read it, can I still trust it to be good for the second access?
>
> Yes, if a block is not in the badblocks list, it can still cause an
> MCE. This is the latent error case I described above. For a simple read()
> via the pmem driver, this will get handled by memcpy_mcsafe. For mmap,
> an MCE is inevitable.
>
> Yes the badblocks list may change while a filesystem is running. The RFC
> patches[1] I linked to add a notification for the filesystem when this
> happens.
>

This is really bad and it makes file system implementation much more
complicated. And badblock notification does not help very much,
because any block can be bad potentially, no matter it is in badblock
list or not. And file system has to perform checking for every read,
using memcpy_mcsafe. This is disaster for file system like NOVA, which
uses pointer de-reference to access data structures on pmem. Now if I
want to read a field in an inode on pmem, I have to copy it to DRAM
first and make sure memcpy_mcsafe() does not report anything wrong.

> No, if the media, for some reason, 'dvelops' a bad cell, a second
> consecutive read does have a chance of being bad. Once a location has
> been marked as bad, it will stay bad till the ACPI clear error 'DSM' has
> been called to mark it as clean.
>

I wonder what happens to write in this case? If a block is bad but not
reported in badblock list. Now I write to it without reading first. Do
I clear the poison with the write? Or still require a ACPI DSM?

> [1]: http://www.linux.sgi.com/archives/xfs/2016-06/msg00299.html
>

Thank you for the patchset. I will look into it.

Thanks,
Andiry

>
>>
>> Thanks,
>> Andiry
>>
>> >>
>> >> (Sleeeeep...)
>> >>
>> >> --D
>> >>
>> >> >
>> >> > --
>> >> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> >> > the body of a message to majordomo@vger.kernel.org
>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-17 23:20                 ` Andiry Xu
  0 siblings, 0 replies; 89+ messages in thread
From: Andiry Xu @ 2017-01-17 23:20 UTC (permalink / raw)
  To: Vishal Verma
  Cc: Darrick J. Wong, Slava Dubeyko, lsf-pc,
	linux-nvdimm@lists.01.org, linux-block, Linux FS Devel,
	Viacheslav Dubeyko

On Tue, Jan 17, 2017 at 2:37 PM, Vishal Verma <vishal.l.verma@intel.com> wrote:
> On 01/17, Andiry Xu wrote:
>> Hi,
>>
>> On Tue, Jan 17, 2017 at 1:35 PM, Vishal Verma <vishal.l.verma@intel.com> wrote:
>> > On 01/16, Darrick J. Wong wrote:
>> >> On Fri, Jan 13, 2017 at 05:49:10PM -0700, Vishal Verma wrote:
>> >> > On 01/14, Slava Dubeyko wrote:
>> >> > >
>> >> > > ---- Original Message ----
>> >> > > Subject: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
>> >> > > Sent: Jan 13, 2017 1:40 PM
>> >> > > From: "Verma, Vishal L" <vishal.l.verma@intel.com>
>> >> > > To: lsf-pc@lists.linux-foundation.org
>> >> > > Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org
>> >> > >
>> >> > > > The current implementation of badblocks, where we consult the badblocks
>> >> > > > list for every IO in the block driver works, and is a last option
>> >> > > > failsafe, but from a user perspective, it isn't the easiest interface to
>> >> > > > work with.
>> >> > >
>> >> > > As I remember, FAT and HFS+ specifications contain description of bad blocks
>> >> > > (physical sectors) table. I believe that this table was used for the case of
>> >> > > floppy media. But, finally, this table becomes to be the completely obsolete
>> >> > > artefact because mostly storage devices are reliably enough. Why do you need
>> >>
>> >> ext4 has a badblocks inode to own all the bad spots on disk, but ISTR it
>> >> doesn't support(??) extents or 64-bit filesystems, and might just be a
>> >> vestigial organ at this point.  XFS doesn't have anything to track bad
>> >> blocks currently....
>> >>
>> >> > > in exposing the bad blocks on the file system level?  Do you expect that next
>> >> > > generation of NVM memory will be so unreliable that file system needs to manage
>> >> > > bad blocks? What's about erasure coding schemes? Do file system really need to suffer
>> >> > > from the bad block issue?
>> >> > >
>> >> > > Usually, we are using LBAs and it is the responsibility of storage device to map
>> >> > > a bad physical block/page/sector into valid one. Do you mean that we have
>> >> > > access to physical NVM memory address directly? But it looks like that we can
>> >> > > have a "bad block" issue even we will access data into page cache's memory
>> >> > > page (if we will use NVM memory for page cache, of course). So, what do you
>> >> > > imply by "bad block" issue?
>> >> >
>> >> > We don't have direct physical access to the device's address space, in
>> >> > the sense the device is still free to perform remapping of chunks of NVM
>> >> > underneath us. The problem is that when a block or address range (as
>> >> > small as a cache line) goes bad, the device maintains a poison bit for
>> >> > every affected cache line. Behind the scenes, it may have already
>> >> > remapped the range, but the cache line poison has to be kept so that
>> >> > there is a notification to the user/owner of the data that something has
>> >> > been lost. Since NVM is byte addressable memory sitting on the memory
>> >> > bus, such a poisoned cache line results in memory errors and SIGBUSes.
>> >> > Compared to tradational storage where an app will get nice and friendly
>> >> > (relatively speaking..) -EIOs. The whole badblocks implementation was
>> >> > done so that the driver can intercept IO (i.e. reads) to _known_ bad
>> >> > locations, and short-circuit them with an EIO. If the driver doesn't
>> >> > catch these, the reads will turn into a memory bus access, and the
>> >> > poison will cause a SIGBUS.
>> >>
>> >> "driver" ... you mean XFS?  Or do you mean the thing that makes pmem
>> >> look kind of like a traditional block device? :)
>> >
>> > Yes, the thing that makes pmem look like a block device :) --
>> > drivers/nvdimm/pmem.c
>> >
>> >>
>> >> > This effort is to try and make this badblock checking smarter - and try
>> >> > and reduce the penalty on every IO to a smaller range, which only the
>> >> > filesystem can do.
>> >>
>> >> Though... now that XFS merged the reverse mapping support, I've been
>> >> wondering if there'll be a resubmission of the device errors callback?
>> >> It still would be useful to be able to inform the user that part of
>> >> their fs has gone bad, or, better yet, if the buffer is still in memory
>> >> someplace else, just write it back out.
>> >>
>> >> Or I suppose if we had some kind of raid1 set up between memories we
>> >> could read one of the other copies and rewrite it into the failing
>> >> region immediately.
>> >
>> > Yes, that is kind of what I was hoping to accomplish via this
>> > discussion. How much would filesystems want to be involved in this sort
>> > of badblocks handling, if at all. I can refresh my patches that provide
>> > the fs notification, but that's the easy bit, and a starting point.
>> >
>>
>> I have some questions. Why moving badblock handling to file system
>> level avoid the checking phase? In file system level for each I/O I
>> still have to check the badblock list, right? Do you mean during mount
>> it can go through the pmem device and locates all the data structures
>> mangled by badblocks and handle them accordingly, so that during
>> normal running the badblocks will never be accessed? Or, if there is
>> replicataion/snapshot support, use a copy to recover the badblocks?
>>
>> How about operations bypass the file system, i.e. mmap?
>
> Hi Andiry,
>
> I do mean that in the filesystem, for every IO, the badblocks will be
> checked. Currently, the pmem driver does this, and the hope is that the
> filesystem can do a better job at it. The driver unconditionally checks
> every IO for badblocks on the whole device. Depending on how the
> badblocks are represented in the filesystem, we might be able to quickly
> tell if a file/range has existing badblocks, and error out the IO
> accordingly.
>
> At mount the the fs would read the existing badblocks on the block
> device, and build its own representation of them. Then during normal
> use, if the underlying badblocks change, the fs would get a notification
> that would allow it to also update its own representation.
>
> Yes, if there is replication etc support in the filesystem, we could try
> to recover using that, but I haven't put much thought in that direction.
>
> Like I said in a previous reply, mmap can be a tricky case, and other
> than handling the machine check exception, there may not be anything
> else we can do..
> If the range we're faulting on has known errors in badblocks, the fault
> will fail with SIGBUS (see where pmem_direct_access() fails due to
> badblocks). For latent errors that are not known in badblocks, if the
> platform has MCE recovery, there is an MCE handler for pmem currently,
> that will add that address to badblocks. If MCE recovery is absent, then
> the system will crash/reboot, and the next time the driver populates
> badblocks, that address will appear in it.

Thank you for reply. That is very clear.

>>
>> >>
>> >> > > > A while back, Dave Chinner had suggested a move towards smarter
>> >> > > > handling, and I posted initial RFC patches [1], but since then the topic
>> >> > > > hasn't really moved forward.
>> >> > > >
>> >> > > > I'd like to propose and have a discussion about the following new
>> >> > > > functionality:
>> >> > > >
>> >> > > > 1. Filesystems develop a native representation of badblocks. For
>> >> > > > example, in xfs, this would (presumably) be linked to the reverse
>> >> > > > mapping btree. The filesystem representation has the potential to be
>> >> > > > more efficient than the block driver doing the check, as the fs can
>> >> > > > check the IO happening on a file against just that file's range.
>> >>
>> >> OTOH that means we'd have to check /every/ file IO request against the
>> >> rmapbt, which will make things reaaaaaally slow.  I suspect it might be
>> >> preferable just to let the underlying pmem driver throw an error at us.
>> >>
>> >> (Or possibly just cache the bad extents in memory.)
>> >
>> > Interesting - this would be a good discussion to have. My motivation for
>> > this was the reasoning that the pmem driver has to check every single IO
>> > against badblocks, and maybe the fs can do a better job. But if you
>> > think the fs will actually be slower, we should try to somehow benchmark
>> > that!
>> >
>> >>
>> >> > > What do you mean by "file system can check the IO happening on a file"?
>> >> > > Do you mean read or write operation? What's about metadata?
>> >> >
>> >> > For the purpose described above, i.e. returning early EIOs when
>> >> > possible, this will be limited to reads and metadata reads. If we're
>> >> > about to do a metadata read, and realize the block(s) about to be read
>> >> > are on the badblocks list, then we do the same thing as when we discover
>> >> > other kinds of metadata corruption.
>> >>
>> >> ...fail and shut down? :)
>> >>
>> >> Actually, for metadata either we look at the xfs_bufs to see if it's in
>> >> memory (XFS doesn't directly access metadata) and write it back out; or
>> >> we could fire up the online repair tool to rebuild the metadata.
>> >
>> > Agreed, I was just stressing that this scenario does not change from
>> > status quo, and really recovering from corruption isn't the problem
>> > we're trying to solve here :)
>> >
>> >>
>> >> > > If we are talking about the discovering a bad block on read operation then
>> >> > > rare modern file system is able to survive as for the case of metadata as
>> >> > > for the case of user data. Let's imagine that we have really mature file
>> >> > > system driver then what does it mean to encounter a bad block? The failure
>> >> > > to read a logical block of some metadata (bad block) means that we are
>> >> > > unable to extract some part of a metadata structure. From file system
>> >> > > driver point of view, it looks like that our file system is corrupted, we need
>> >> > > to stop the file system operations and, finally, to check and recover file
>> >> > > system volume by means of fsck tool. If we find a bad block for some
>> >> > > user file then, again, it looks like an issue. Some file systems simply
>> >> > > return "unrecovered read error". Another one, theoretically, is able
>> >> > > to survive because of snapshots, for example. But, anyway, it will look
>> >> > > like as Read-Only mount state and the user will need to resolve such
>> >> > > trouble by hands.
>> >> >
>> >> > As far as I can tell, all of these things remain the same. The goal here
>> >> > isn't to survive more NVM badblocks than we would've before, and lost
>> >> > data or lost metadata will continue to have the same consequences as
>> >> > before, and will need the same recovery actions/intervention as before.
>> >> > The goal is to make the failure model similar to what users expect
>> >> > today, and as much as possible make recovery actions too similarly
>> >> > intuitive.
>> >> >
>> >> > >
>> >> > > If we are talking about discovering a bad block during write operation then,
>> >> > > again, we are in trouble. Usually, we are using asynchronous model
>> >> > > of write/flush operation. We are preparing the consistent state of all our
>> >> > > metadata structures in the memory, at first. The flush operations for metadata
>> >> > > and user data can be done in different times. And what should be done if we
>> >> > > discover bad block for any piece of metadata or user data? Simple tracking of
>> >> > > bad blocks is not enough at all. Let's consider user data, at first. If we cannot
>> >> > > write some file's block successfully then we have two ways: (1) forget about
>> >> > > this piece of data; (2) try to change the associated LBA for this piece of data.
>> >> > > The operation of re-allocation LBA number for discovered bad block
>> >> > > (user data case) sounds as real pain. Because you need to rebuild the metadata
>> >> > > that track the location of this part of file. And it sounds as practically
>> >> > > impossible operation, for the case of LFS file system, for example.
>> >> > > If we have trouble with flushing any part of metadata then it sounds as
>> >> > > complete disaster for any file system.
>> >> >
>> >> > Writes can get more complicated in certain cases. If it is a regular
>> >> > page cache writeback, or any aligned write that goes through the block
>> >> > driver, that is completely fine. The block driver will check that the
>> >> > block was previously marked as bad, do a "clear poison" operation
>> >> > (defined in the ACPI spec), which tells the firmware that the poison bit
>> >> > is not OK to be cleared, and writes the new data. This also removes the
>> >> > block from the badblocks list, and in this scheme, triggers a
>> >> > notification to the filesystem that it too can remove the block from its
>> >> > accounting. mmap writes and DAX can get more complicated, and at times
>> >> > they will just trigger a SIGBUS, and there's no way around that.
>> >> >
>> >> > >
>> >> > > Are you really sure that file system should process bad block issue?
>> >> > >
>> >> > > >In contrast, today, the block driver checks against the whole block device
>> >> > > > range for every IO. On encountering badblocks, the filesystem can
>> >> > > > generate a better notification/error message that points the user to
>> >> > > > (file, offset) as opposed to the block driver, which can only provide
>> >> > > > (block-device, sector).
>> >>
>> >> <shrug> We can do the translation with the backref info...
>> >
>> > Yes we should at least do that. I'm guessing this would happen in XFS
>> > when it gets an EIO from an IO submission? The bio submission path in
>> > the fs is probably not synchronous (correct?), but whenever it gets the
>> > EIO, I'm guessing we just print a loud error message after doing the
>> > backref lookup..
>> >
>> >>
>> >> > > > 2. The block layer adds a notifier to badblock addition/removal
>> >> > > > operations, which the filesystem subscribes to, and uses to maintain its
>> >> > > > badblocks accounting. (This part is implemented as a proof of concept in
>> >> > > > the RFC mentioned above [1]).
>> >> > >
>> >> > > I am not sure that any bad block notification during/after IO operation
>> >> > > is valuable for file system. Maybe, it could help if file system simply will
>> >> > > know about bad block beforehand the operation of logical block allocation.
>> >> > > But what subsystem will discover bad blocks before any IO operations?
>> >> > > How file system will receive information or some bad block table?
>> >> >
>> >> > The driver populates its badblocks lists whenever an Address Range Scrub
>> >> > is started (also via ACPI methods). This is always done at
>> >> > initialization time, so that it can build an in-memory representation of
>> >> > the badblocks. Additionally, this can also be triggered manually. And
>> >> > finally badblocks can also get populated for new latent errors when a
>> >> > machine check exception occurs. All of these can trigger notification to
>> >> > the file system without actual user reads happening.
>> >> >
>> >> > > I am not convinced that suggested badblocks approach is really feasible.
>> >> > > Also I am not sure that file system should see the bad blocks at all.
>> >> > > Why hardware cannot manage this issue for us?
>> >> >
>> >> > Hardware does manage the actual badblocks issue for us in the sense that
>> >> > when it discovers a badblock it will do the remapping. But since this is
>> >> > on the memory bus, and has different error signatures than applications
>> >> > are used to, we want to make the error handling similar to the existing
>> >> > storage model.
>> >>
>> >> Yes please and thank you, to the "error handling similar to the existing
>> >> storage model".  Even better if this just gets added to a layer
>> >> underneath the fs so that IO to bad regions returns EIO. 8-)
>> >
>> > This (if this just gets added to a layer underneath the fs so that IO to bad
>> > regions returns EIO) already happens :)  See pmem_do_bvec() in
>> > drivers/nvdimm/pmem.c, where we return EIO for a known badblock on a
>> > read. I'm wondering if this can be improved..
>> >
>>
>> The pmem_do_bvec() read logic is like this:
>>
>> pmem_do_bvec()
>>     if (is_bad_pmem())
>>         return -EIO;
>>     else
>>         memcpy_from_pmem();
>>
>> Note memcpy_from_pmem() is calling memcpy_mcsafe(). Does this imply
>> that even if a block is not in the badblock list, it still can be bad
>> and causes MCE? Does the badblock list get changed during file system
>> running? If that is the case, should the file system get a
>> notification when it gets changed? If a block is good when I first
>> read it, can I still trust it to be good for the second access?
>
> Yes, if a block is not in the badblocks list, it can still cause an
> MCE. This is the latent error case I described above. For a simple read()
> via the pmem driver, this will get handled by memcpy_mcsafe. For mmap,
> an MCE is inevitable.
>
> Yes the badblocks list may change while a filesystem is running. The RFC
> patches[1] I linked to add a notification for the filesystem when this
> happens.
>

This is really bad and it makes file system implementation much more
complicated. And badblock notification does not help very much,
because any block can be bad potentially, no matter it is in badblock
list or not. And file system has to perform checking for every read,
using memcpy_mcsafe. This is disaster for file system like NOVA, which
uses pointer de-reference to access data structures on pmem. Now if I
want to read a field in an inode on pmem, I have to copy it to DRAM
first and make sure memcpy_mcsafe() does not report anything wrong.

> No, if the media, for some reason, 'dvelops' a bad cell, a second
> consecutive read does have a chance of being bad. Once a location has
> been marked as bad, it will stay bad till the ACPI clear error 'DSM' has
> been called to mark it as clean.
>

I wonder what happens to write in this case? If a block is bad but not
reported in badblock list. Now I write to it without reading first. Do
I clear the poison with the write? Or still require a ACPI DSM?

> [1]: http://www.linux.sgi.com/archives/xfs/2016-06/msg00299.html
>

Thank you for the patchset. I will look into it.

Thanks,
Andiry

>
>>
>> Thanks,
>> Andiry
>>
>> >>
>> >> (Sleeeeep...)
>> >>
>> >> --D
>> >>
>> >> >
>> >> > --
>> >> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> >> > the body of a message to majordomo@vger.kernel.org
>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-17 23:20                 ` Andiry Xu
@ 2017-01-17 23:51                   ` Vishal Verma
  -1 siblings, 0 replies; 89+ messages in thread
From: Vishal Verma @ 2017-01-17 23:51 UTC (permalink / raw)
  To: Andiry Xu
  Cc: Slava Dubeyko, Darrick J. Wong, linux-nvdimm@lists.01.org,
	linux-block, Viacheslav Dubeyko, Linux FS Devel, lsf-pc

On 01/17, Andiry Xu wrote:

<snip>

> >>
> >> The pmem_do_bvec() read logic is like this:
> >>
> >> pmem_do_bvec()
> >>     if (is_bad_pmem())
> >>         return -EIO;
> >>     else
> >>         memcpy_from_pmem();
> >>
> >> Note memcpy_from_pmem() is calling memcpy_mcsafe(). Does this imply
> >> that even if a block is not in the badblock list, it still can be bad
> >> and causes MCE? Does the badblock list get changed during file system
> >> running? If that is the case, should the file system get a
> >> notification when it gets changed? If a block is good when I first
> >> read it, can I still trust it to be good for the second access?
> >
> > Yes, if a block is not in the badblocks list, it can still cause an
> > MCE. This is the latent error case I described above. For a simple read()
> > via the pmem driver, this will get handled by memcpy_mcsafe. For mmap,
> > an MCE is inevitable.
> >
> > Yes the badblocks list may change while a filesystem is running. The RFC
> > patches[1] I linked to add a notification for the filesystem when this
> > happens.
> >
> 
> This is really bad and it makes file system implementation much more
> complicated. And badblock notification does not help very much,
> because any block can be bad potentially, no matter it is in badblock
> list or not. And file system has to perform checking for every read,
> using memcpy_mcsafe. This is disaster for file system like NOVA, which
> uses pointer de-reference to access data structures on pmem. Now if I
> want to read a field in an inode on pmem, I have to copy it to DRAM
> first and make sure memcpy_mcsafe() does not report anything wrong.

You have a good point, and I don't know if I have an answer for this..
Assuming a system with MCE recovery, maybe NOVA can add a mce handler
similar to nfit_handle_mce(), and handle errors as they happen, but I'm
being very hand-wavey here and don't know how much/how well that might
work..

> 
> > No, if the media, for some reason, 'dvelops' a bad cell, a second
> > consecutive read does have a chance of being bad. Once a location has
> > been marked as bad, it will stay bad till the ACPI clear error 'DSM' has
> > been called to mark it as clean.
> >
> 
> I wonder what happens to write in this case? If a block is bad but not
> reported in badblock list. Now I write to it without reading first. Do
> I clear the poison with the write? Or still require a ACPI DSM?

With writes, my understanding is there is still a possibility that an
internal read-modify-write can happen, and cause a MCE (this is the same
as writing to a bad DRAM cell, which can also cause an MCE). You can't
really use the ACPI DSM preemptively because you don't know whether the
location was bad. The error flow will be something like write causes the
MCE, a badblock gets added (either through the mce handler or after the
next reboot), and the recovery path is now the same as a regular badblock.

> 
> > [1]: http://www.linux.sgi.com/archives/xfs/2016-06/msg00299.html
> >
> 
> Thank you for the patchset. I will look into it.
> 
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-17 23:51                   ` Vishal Verma
  0 siblings, 0 replies; 89+ messages in thread
From: Vishal Verma @ 2017-01-17 23:51 UTC (permalink / raw)
  To: Andiry Xu
  Cc: Darrick J. Wong, Slava Dubeyko, lsf-pc,
	linux-nvdimm@lists.01.org, linux-block, Linux FS Devel,
	Viacheslav Dubeyko

On 01/17, Andiry Xu wrote:

<snip>

> >>
> >> The pmem_do_bvec() read logic is like this:
> >>
> >> pmem_do_bvec()
> >>     if (is_bad_pmem())
> >>         return -EIO;
> >>     else
> >>         memcpy_from_pmem();
> >>
> >> Note memcpy_from_pmem() is calling memcpy_mcsafe(). Does this imply
> >> that even if a block is not in the badblock list, it still can be bad
> >> and causes MCE? Does the badblock list get changed during file system
> >> running? If that is the case, should the file system get a
> >> notification when it gets changed? If a block is good when I first
> >> read it, can I still trust it to be good for the second access?
> >
> > Yes, if a block is not in the badblocks list, it can still cause an
> > MCE. This is the latent error case I described above. For a simple read()
> > via the pmem driver, this will get handled by memcpy_mcsafe. For mmap,
> > an MCE is inevitable.
> >
> > Yes the badblocks list may change while a filesystem is running. The RFC
> > patches[1] I linked to add a notification for the filesystem when this
> > happens.
> >
> 
> This is really bad and it makes file system implementation much more
> complicated. And badblock notification does not help very much,
> because any block can be bad potentially, no matter it is in badblock
> list or not. And file system has to perform checking for every read,
> using memcpy_mcsafe. This is disaster for file system like NOVA, which
> uses pointer de-reference to access data structures on pmem. Now if I
> want to read a field in an inode on pmem, I have to copy it to DRAM
> first and make sure memcpy_mcsafe() does not report anything wrong.

You have a good point, and I don't know if I have an answer for this..
Assuming a system with MCE recovery, maybe NOVA can add a mce handler
similar to nfit_handle_mce(), and handle errors as they happen, but I'm
being very hand-wavey here and don't know how much/how well that might
work..

> 
> > No, if the media, for some reason, 'dvelops' a bad cell, a second
> > consecutive read does have a chance of being bad. Once a location has
> > been marked as bad, it will stay bad till the ACPI clear error 'DSM' has
> > been called to mark it as clean.
> >
> 
> I wonder what happens to write in this case? If a block is bad but not
> reported in badblock list. Now I write to it without reading first. Do
> I clear the poison with the write? Or still require a ACPI DSM?

With writes, my understanding is there is still a possibility that an
internal read-modify-write can happen, and cause a MCE (this is the same
as writing to a bad DRAM cell, which can also cause an MCE). You can't
really use the ACPI DSM preemptively because you don't know whether the
location was bad. The error flow will be something like write causes the
MCE, a badblock gets added (either through the mce handler or after the
next reboot), and the recovery path is now the same as a regular badblock.

> 
> > [1]: http://www.linux.sgi.com/archives/xfs/2016-06/msg00299.html
> >
> 
> Thank you for the patchset. I will look into it.
> 

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-17 22:15             ` Andiry Xu
  (?)
  (?)
@ 2017-01-18  0:16             ` Andreas Dilger
  2017-01-18  2:01                 ` Andiry Xu
  -1 siblings, 1 reply; 89+ messages in thread
From: Andreas Dilger @ 2017-01-18  0:16 UTC (permalink / raw)
  To: Andiry Xu
  Cc: Vishal Verma, Darrick J. Wong, Slava Dubeyko, lsf-pc,
	linux-nvdimm@lists.01.org, linux-block, Linux FS Devel,
	Viacheslav Dubeyko

[-- Attachment #1: Type: text/plain, Size: 5863 bytes --]

On Jan 17, 2017, at 3:15 PM, Andiry Xu <andiry@gmail.com> wrote:
> On Tue, Jan 17, 2017 at 1:35 PM, Vishal Verma <vishal.l.verma@intel.com> wrote:
>> On 01/16, Darrick J. Wong wrote:
>>> On Fri, Jan 13, 2017 at 05:49:10PM -0700, Vishal Verma wrote:
>>>> On 01/14, Slava Dubeyko wrote:
>>>>> 
>>>>> ---- Original Message ----
>>>>> Subject: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
>>>>> Sent: Jan 13, 2017 1:40 PM
>>>>> From: "Verma, Vishal L" <vishal.l.verma@intel.com>
>>>>> To: lsf-pc@lists.linux-foundation.org
>>>>> Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org
>>>>> 
>>>>>> The current implementation of badblocks, where we consult the
>>>>>> badblocks list for every IO in the block driver works, and is a
>>>>>> last option failsafe, but from a user perspective, it isn't the
>>>>>> easiest interface to work with.
>>>>> 
>>>>> As I remember, FAT and HFS+ specifications contain description of bad blocks
>>>>> (physical sectors) table. I believe that this table was used for the case of
>>>>> floppy media. But, finally, this table becomes to be the completely obsolete
>>>>> artefact because mostly storage devices are reliably enough. Why do you need
>>> 
>>> ext4 has a badblocks inode to own all the bad spots on disk, but ISTR it
>>> doesn't support(??) extents or 64-bit filesystems, and might just be a
>>> vestigial organ at this point.  XFS doesn't have anything to track bad
>>> blocks currently....
>>> 
>>>>> in exposing the bad blocks on the file system level?  Do you expect that next
>>>>> generation of NVM memory will be so unreliable that file system needs to manage
>>>>> bad blocks? What's about erasure coding schemes? Do file system really need to suffer
>>>>> from the bad block issue?
>>>>> 
>>>>> Usually, we are using LBAs and it is the responsibility of storage device to map
>>>>> a bad physical block/page/sector into valid one. Do you mean that we have
>>>>> access to physical NVM memory address directly? But it looks like that we can
>>>>> have a "bad block" issue even we will access data into page cache's memory
>>>>> page (if we will use NVM memory for page cache, of course). So, what do you
>>>>> imply by "bad block" issue?
>>>> 
>>>> We don't have direct physical access to the device's address space, in
>>>> the sense the device is still free to perform remapping of chunks of NVM
>>>> underneath us. The problem is that when a block or address range (as
>>>> small as a cache line) goes bad, the device maintains a poison bit for
>>>> every affected cache line. Behind the scenes, it may have already
>>>> remapped the range, but the cache line poison has to be kept so that
>>>> there is a notification to the user/owner of the data that something has
>>>> been lost. Since NVM is byte addressable memory sitting on the memory
>>>> bus, such a poisoned cache line results in memory errors and SIGBUSes.
>>>> Compared to tradational storage where an app will get nice and friendly
>>>> (relatively speaking..) -EIOs. The whole badblocks implementation was
>>>> done so that the driver can intercept IO (i.e. reads) to _known_ bad
>>>> locations, and short-circuit them with an EIO. If the driver doesn't
>>>> catch these, the reads will turn into a memory bus access, and the
>>>> poison will cause a SIGBUS.
>>> 
>>> "driver" ... you mean XFS?  Or do you mean the thing that makes pmem
>>> look kind of like a traditional block device? :)
>> 
>> Yes, the thing that makes pmem look like a block device :) --
>> drivers/nvdimm/pmem.c
>> 
>>> 
>>>> This effort is to try and make this badblock checking smarter - and try
>>>> and reduce the penalty on every IO to a smaller range, which only the
>>>> filesystem can do.
>>> 
>>> Though... now that XFS merged the reverse mapping support, I've been
>>> wondering if there'll be a resubmission of the device errors callback?
>>> It still would be useful to be able to inform the user that part of
>>> their fs has gone bad, or, better yet, if the buffer is still in memory
>>> someplace else, just write it back out.
>>> 
>>> Or I suppose if we had some kind of raid1 set up between memories we
>>> could read one of the other copies and rewrite it into the failing
>>> region immediately.
>> 
>> Yes, that is kind of what I was hoping to accomplish via this
>> discussion. How much would filesystems want to be involved in this sort
>> of badblocks handling, if at all. I can refresh my patches that provide
>> the fs notification, but that's the easy bit, and a starting point.
>> 
> 
> I have some questions. Why moving badblock handling to file system
> level avoid the checking phase? In file system level for each I/O I
> still have to check the badblock list, right? Do you mean during mount
> it can go through the pmem device and locates all the data structures
> mangled by badblocks and handle them accordingly, so that during
> normal running the badblocks will never be accessed? Or, if there is
> replicataion/snapshot support, use a copy to recover the badblocks?

With ext4 badblocks, the main outcome is that the bad blocks would be
pemanently marked in the allocation bitmap as being used, and they would
never be allocated to a file, so they should never be accessed unless
doing a full device scan (which ext4 and e2fsck never do).  That would
avoid the need to check every I/O against the bad blocks list, if the
driver knows that the filesystem will handle this.

The one caveat is that ext4 only allows 32-bit block numbers in the
badblocks list, since this feature hasn't been used in a long time.
This is good for up to 16TB filesystems, but if there was a demand to
use this feature again it would be possible allow 64-bit block numbers.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-17 23:51                   ` Vishal Verma
@ 2017-01-18  1:58                     ` Andiry Xu
  -1 siblings, 0 replies; 89+ messages in thread
From: Andiry Xu @ 2017-01-18  1:58 UTC (permalink / raw)
  To: Vishal Verma
  Cc: Slava Dubeyko, Darrick J. Wong, linux-nvdimm@lists.01.org,
	linux-block, Viacheslav Dubeyko, Linux FS Devel, lsf-pc

On Tue, Jan 17, 2017 at 3:51 PM, Vishal Verma <vishal.l.verma@intel.com> wrote:
> On 01/17, Andiry Xu wrote:
>
> <snip>
>
>> >>
>> >> The pmem_do_bvec() read logic is like this:
>> >>
>> >> pmem_do_bvec()
>> >>     if (is_bad_pmem())
>> >>         return -EIO;
>> >>     else
>> >>         memcpy_from_pmem();
>> >>
>> >> Note memcpy_from_pmem() is calling memcpy_mcsafe(). Does this imply
>> >> that even if a block is not in the badblock list, it still can be bad
>> >> and causes MCE? Does the badblock list get changed during file system
>> >> running? If that is the case, should the file system get a
>> >> notification when it gets changed? If a block is good when I first
>> >> read it, can I still trust it to be good for the second access?
>> >
>> > Yes, if a block is not in the badblocks list, it can still cause an
>> > MCE. This is the latent error case I described above. For a simple read()
>> > via the pmem driver, this will get handled by memcpy_mcsafe. For mmap,
>> > an MCE is inevitable.
>> >
>> > Yes the badblocks list may change while a filesystem is running. The RFC
>> > patches[1] I linked to add a notification for the filesystem when this
>> > happens.
>> >
>>
>> This is really bad and it makes file system implementation much more
>> complicated. And badblock notification does not help very much,
>> because any block can be bad potentially, no matter it is in badblock
>> list or not. And file system has to perform checking for every read,
>> using memcpy_mcsafe. This is disaster for file system like NOVA, which
>> uses pointer de-reference to access data structures on pmem. Now if I
>> want to read a field in an inode on pmem, I have to copy it to DRAM
>> first and make sure memcpy_mcsafe() does not report anything wrong.
>
> You have a good point, and I don't know if I have an answer for this..
> Assuming a system with MCE recovery, maybe NOVA can add a mce handler
> similar to nfit_handle_mce(), and handle errors as they happen, but I'm
> being very hand-wavey here and don't know how much/how well that might
> work..
>
>>
>> > No, if the media, for some reason, 'dvelops' a bad cell, a second
>> > consecutive read does have a chance of being bad. Once a location has
>> > been marked as bad, it will stay bad till the ACPI clear error 'DSM' has
>> > been called to mark it as clean.
>> >
>>
>> I wonder what happens to write in this case? If a block is bad but not
>> reported in badblock list. Now I write to it without reading first. Do
>> I clear the poison with the write? Or still require a ACPI DSM?
>
> With writes, my understanding is there is still a possibility that an
> internal read-modify-write can happen, and cause a MCE (this is the same
> as writing to a bad DRAM cell, which can also cause an MCE). You can't
> really use the ACPI DSM preemptively because you don't know whether the
> location was bad. The error flow will be something like write causes the
> MCE, a badblock gets added (either through the mce handler or after the
> next reboot), and the recovery path is now the same as a regular badblock.
>

This is different from my understanding. Right now write_pmem() in
pmem_do_bvec() does not use memcpy_mcsafe(). If the block is bad it
clears poison and writes to pmem again. Seems to me writing to bad
blocks does not cause MCE. Do we need memcpy_mcsafe for pmem stores?

Thanks,
Andiry

>>
>> > [1]: http://www.linux.sgi.com/archives/xfs/2016-06/msg00299.html
>> >
>>
>> Thank you for the patchset. I will look into it.
>>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-18  1:58                     ` Andiry Xu
  0 siblings, 0 replies; 89+ messages in thread
From: Andiry Xu @ 2017-01-18  1:58 UTC (permalink / raw)
  To: Vishal Verma
  Cc: Darrick J. Wong, Slava Dubeyko, lsf-pc,
	linux-nvdimm@lists.01.org, linux-block, Linux FS Devel,
	Viacheslav Dubeyko

On Tue, Jan 17, 2017 at 3:51 PM, Vishal Verma <vishal.l.verma@intel.com> wrote:
> On 01/17, Andiry Xu wrote:
>
> <snip>
>
>> >>
>> >> The pmem_do_bvec() read logic is like this:
>> >>
>> >> pmem_do_bvec()
>> >>     if (is_bad_pmem())
>> >>         return -EIO;
>> >>     else
>> >>         memcpy_from_pmem();
>> >>
>> >> Note memcpy_from_pmem() is calling memcpy_mcsafe(). Does this imply
>> >> that even if a block is not in the badblock list, it still can be bad
>> >> and causes MCE? Does the badblock list get changed during file system
>> >> running? If that is the case, should the file system get a
>> >> notification when it gets changed? If a block is good when I first
>> >> read it, can I still trust it to be good for the second access?
>> >
>> > Yes, if a block is not in the badblocks list, it can still cause an
>> > MCE. This is the latent error case I described above. For a simple read()
>> > via the pmem driver, this will get handled by memcpy_mcsafe. For mmap,
>> > an MCE is inevitable.
>> >
>> > Yes the badblocks list may change while a filesystem is running. The RFC
>> > patches[1] I linked to add a notification for the filesystem when this
>> > happens.
>> >
>>
>> This is really bad and it makes file system implementation much more
>> complicated. And badblock notification does not help very much,
>> because any block can be bad potentially, no matter it is in badblock
>> list or not. And file system has to perform checking for every read,
>> using memcpy_mcsafe. This is disaster for file system like NOVA, which
>> uses pointer de-reference to access data structures on pmem. Now if I
>> want to read a field in an inode on pmem, I have to copy it to DRAM
>> first and make sure memcpy_mcsafe() does not report anything wrong.
>
> You have a good point, and I don't know if I have an answer for this..
> Assuming a system with MCE recovery, maybe NOVA can add a mce handler
> similar to nfit_handle_mce(), and handle errors as they happen, but I'm
> being very hand-wavey here and don't know how much/how well that might
> work..
>
>>
>> > No, if the media, for some reason, 'dvelops' a bad cell, a second
>> > consecutive read does have a chance of being bad. Once a location has
>> > been marked as bad, it will stay bad till the ACPI clear error 'DSM' has
>> > been called to mark it as clean.
>> >
>>
>> I wonder what happens to write in this case? If a block is bad but not
>> reported in badblock list. Now I write to it without reading first. Do
>> I clear the poison with the write? Or still require a ACPI DSM?
>
> With writes, my understanding is there is still a possibility that an
> internal read-modify-write can happen, and cause a MCE (this is the same
> as writing to a bad DRAM cell, which can also cause an MCE). You can't
> really use the ACPI DSM preemptively because you don't know whether the
> location was bad. The error flow will be something like write causes the
> MCE, a badblock gets added (either through the mce handler or after the
> next reboot), and the recovery path is now the same as a regular badblock.
>

This is different from my understanding. Right now write_pmem() in
pmem_do_bvec() does not use memcpy_mcsafe(). If the block is bad it
clears poison and writes to pmem again. Seems to me writing to bad
blocks does not cause MCE. Do we need memcpy_mcsafe for pmem stores?

Thanks,
Andiry

>>
>> > [1]: http://www.linux.sgi.com/archives/xfs/2016-06/msg00299.html
>> >
>>
>> Thank you for the patchset. I will look into it.
>>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-18  0:16             ` Andreas Dilger
@ 2017-01-18  2:01                 ` Andiry Xu
  0 siblings, 0 replies; 89+ messages in thread
From: Andiry Xu @ 2017-01-18  2:01 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Slava Dubeyko, Darrick J. Wong, linux-nvdimm@lists.01.org,
	linux-block, Viacheslav Dubeyko, Linux FS Devel, lsf-pc

On Tue, Jan 17, 2017 at 4:16 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> On Jan 17, 2017, at 3:15 PM, Andiry Xu <andiry@gmail.com> wrote:
>> On Tue, Jan 17, 2017 at 1:35 PM, Vishal Verma <vishal.l.verma@intel.com> wrote:
>>> On 01/16, Darrick J. Wong wrote:
>>>> On Fri, Jan 13, 2017 at 05:49:10PM -0700, Vishal Verma wrote:
>>>>> On 01/14, Slava Dubeyko wrote:
>>>>>>
>>>>>> ---- Original Message ----
>>>>>> Subject: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
>>>>>> Sent: Jan 13, 2017 1:40 PM
>>>>>> From: "Verma, Vishal L" <vishal.l.verma@intel.com>
>>>>>> To: lsf-pc@lists.linux-foundation.org
>>>>>> Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org
>>>>>>
>>>>>>> The current implementation of badblocks, where we consult the
>>>>>>> badblocks list for every IO in the block driver works, and is a
>>>>>>> last option failsafe, but from a user perspective, it isn't the
>>>>>>> easiest interface to work with.
>>>>>>
>>>>>> As I remember, FAT and HFS+ specifications contain description of bad blocks
>>>>>> (physical sectors) table. I believe that this table was used for the case of
>>>>>> floppy media. But, finally, this table becomes to be the completely obsolete
>>>>>> artefact because mostly storage devices are reliably enough. Why do you need
>>>>
>>>> ext4 has a badblocks inode to own all the bad spots on disk, but ISTR it
>>>> doesn't support(??) extents or 64-bit filesystems, and might just be a
>>>> vestigial organ at this point.  XFS doesn't have anything to track bad
>>>> blocks currently....
>>>>
>>>>>> in exposing the bad blocks on the file system level?  Do you expect that next
>>>>>> generation of NVM memory will be so unreliable that file system needs to manage
>>>>>> bad blocks? What's about erasure coding schemes? Do file system really need to suffer
>>>>>> from the bad block issue?
>>>>>>
>>>>>> Usually, we are using LBAs and it is the responsibility of storage device to map
>>>>>> a bad physical block/page/sector into valid one. Do you mean that we have
>>>>>> access to physical NVM memory address directly? But it looks like that we can
>>>>>> have a "bad block" issue even we will access data into page cache's memory
>>>>>> page (if we will use NVM memory for page cache, of course). So, what do you
>>>>>> imply by "bad block" issue?
>>>>>
>>>>> We don't have direct physical access to the device's address space, in
>>>>> the sense the device is still free to perform remapping of chunks of NVM
>>>>> underneath us. The problem is that when a block or address range (as
>>>>> small as a cache line) goes bad, the device maintains a poison bit for
>>>>> every affected cache line. Behind the scenes, it may have already
>>>>> remapped the range, but the cache line poison has to be kept so that
>>>>> there is a notification to the user/owner of the data that something has
>>>>> been lost. Since NVM is byte addressable memory sitting on the memory
>>>>> bus, such a poisoned cache line results in memory errors and SIGBUSes.
>>>>> Compared to tradational storage where an app will get nice and friendly
>>>>> (relatively speaking..) -EIOs. The whole badblocks implementation was
>>>>> done so that the driver can intercept IO (i.e. reads) to _known_ bad
>>>>> locations, and short-circuit them with an EIO. If the driver doesn't
>>>>> catch these, the reads will turn into a memory bus access, and the
>>>>> poison will cause a SIGBUS.
>>>>
>>>> "driver" ... you mean XFS?  Or do you mean the thing that makes pmem
>>>> look kind of like a traditional block device? :)
>>>
>>> Yes, the thing that makes pmem look like a block device :) --
>>> drivers/nvdimm/pmem.c
>>>
>>>>
>>>>> This effort is to try and make this badblock checking smarter - and try
>>>>> and reduce the penalty on every IO to a smaller range, which only the
>>>>> filesystem can do.
>>>>
>>>> Though... now that XFS merged the reverse mapping support, I've been
>>>> wondering if there'll be a resubmission of the device errors callback?
>>>> It still would be useful to be able to inform the user that part of
>>>> their fs has gone bad, or, better yet, if the buffer is still in memory
>>>> someplace else, just write it back out.
>>>>
>>>> Or I suppose if we had some kind of raid1 set up between memories we
>>>> could read one of the other copies and rewrite it into the failing
>>>> region immediately.
>>>
>>> Yes, that is kind of what I was hoping to accomplish via this
>>> discussion. How much would filesystems want to be involved in this sort
>>> of badblocks handling, if at all. I can refresh my patches that provide
>>> the fs notification, but that's the easy bit, and a starting point.
>>>
>>
>> I have some questions. Why moving badblock handling to file system
>> level avoid the checking phase? In file system level for each I/O I
>> still have to check the badblock list, right? Do you mean during mount
>> it can go through the pmem device and locates all the data structures
>> mangled by badblocks and handle them accordingly, so that during
>> normal running the badblocks will never be accessed? Or, if there is
>> replicataion/snapshot support, use a copy to recover the badblocks?
>
> With ext4 badblocks, the main outcome is that the bad blocks would be
> pemanently marked in the allocation bitmap as being used, and they would
> never be allocated to a file, so they should never be accessed unless
> doing a full device scan (which ext4 and e2fsck never do).  That would
> avoid the need to check every I/O against the bad blocks list, if the
> driver knows that the filesystem will handle this.
>

Thank you for explanation. However this only works for free blocks,
right? What about allocated blocks, like file data and metadata?

Thanks,
Andiry

> The one caveat is that ext4 only allows 32-bit block numbers in the
> badblocks list, since this feature hasn't been used in a long time.
> This is good for up to 16TB filesystems, but if there was a demand to
> use this feature again it would be possible allow 64-bit block numbers.
>
> Cheers, Andreas
>
>
>
>
>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-18  2:01                 ` Andiry Xu
  0 siblings, 0 replies; 89+ messages in thread
From: Andiry Xu @ 2017-01-18  2:01 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Vishal Verma, Darrick J. Wong, Slava Dubeyko, lsf-pc,
	linux-nvdimm@lists.01.org, linux-block, Linux FS Devel,
	Viacheslav Dubeyko

On Tue, Jan 17, 2017 at 4:16 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> On Jan 17, 2017, at 3:15 PM, Andiry Xu <andiry@gmail.com> wrote:
>> On Tue, Jan 17, 2017 at 1:35 PM, Vishal Verma <vishal.l.verma@intel.com> wrote:
>>> On 01/16, Darrick J. Wong wrote:
>>>> On Fri, Jan 13, 2017 at 05:49:10PM -0700, Vishal Verma wrote:
>>>>> On 01/14, Slava Dubeyko wrote:
>>>>>>
>>>>>> ---- Original Message ----
>>>>>> Subject: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
>>>>>> Sent: Jan 13, 2017 1:40 PM
>>>>>> From: "Verma, Vishal L" <vishal.l.verma@intel.com>
>>>>>> To: lsf-pc@lists.linux-foundation.org
>>>>>> Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org
>>>>>>
>>>>>>> The current implementation of badblocks, where we consult the
>>>>>>> badblocks list for every IO in the block driver works, and is a
>>>>>>> last option failsafe, but from a user perspective, it isn't the
>>>>>>> easiest interface to work with.
>>>>>>
>>>>>> As I remember, FAT and HFS+ specifications contain description of bad blocks
>>>>>> (physical sectors) table. I believe that this table was used for the case of
>>>>>> floppy media. But, finally, this table becomes to be the completely obsolete
>>>>>> artefact because mostly storage devices are reliably enough. Why do you need
>>>>
>>>> ext4 has a badblocks inode to own all the bad spots on disk, but ISTR it
>>>> doesn't support(??) extents or 64-bit filesystems, and might just be a
>>>> vestigial organ at this point.  XFS doesn't have anything to track bad
>>>> blocks currently....
>>>>
>>>>>> in exposing the bad blocks on the file system level?  Do you expect that next
>>>>>> generation of NVM memory will be so unreliable that file system needs to manage
>>>>>> bad blocks? What's about erasure coding schemes? Do file system really need to suffer
>>>>>> from the bad block issue?
>>>>>>
>>>>>> Usually, we are using LBAs and it is the responsibility of storage device to map
>>>>>> a bad physical block/page/sector into valid one. Do you mean that we have
>>>>>> access to physical NVM memory address directly? But it looks like that we can
>>>>>> have a "bad block" issue even we will access data into page cache's memory
>>>>>> page (if we will use NVM memory for page cache, of course). So, what do you
>>>>>> imply by "bad block" issue?
>>>>>
>>>>> We don't have direct physical access to the device's address space, in
>>>>> the sense the device is still free to perform remapping of chunks of NVM
>>>>> underneath us. The problem is that when a block or address range (as
>>>>> small as a cache line) goes bad, the device maintains a poison bit for
>>>>> every affected cache line. Behind the scenes, it may have already
>>>>> remapped the range, but the cache line poison has to be kept so that
>>>>> there is a notification to the user/owner of the data that something has
>>>>> been lost. Since NVM is byte addressable memory sitting on the memory
>>>>> bus, such a poisoned cache line results in memory errors and SIGBUSes.
>>>>> Compared to tradational storage where an app will get nice and friendly
>>>>> (relatively speaking..) -EIOs. The whole badblocks implementation was
>>>>> done so that the driver can intercept IO (i.e. reads) to _known_ bad
>>>>> locations, and short-circuit them with an EIO. If the driver doesn't
>>>>> catch these, the reads will turn into a memory bus access, and the
>>>>> poison will cause a SIGBUS.
>>>>
>>>> "driver" ... you mean XFS?  Or do you mean the thing that makes pmem
>>>> look kind of like a traditional block device? :)
>>>
>>> Yes, the thing that makes pmem look like a block device :) --
>>> drivers/nvdimm/pmem.c
>>>
>>>>
>>>>> This effort is to try and make this badblock checking smarter - and try
>>>>> and reduce the penalty on every IO to a smaller range, which only the
>>>>> filesystem can do.
>>>>
>>>> Though... now that XFS merged the reverse mapping support, I've been
>>>> wondering if there'll be a resubmission of the device errors callback?
>>>> It still would be useful to be able to inform the user that part of
>>>> their fs has gone bad, or, better yet, if the buffer is still in memory
>>>> someplace else, just write it back out.
>>>>
>>>> Or I suppose if we had some kind of raid1 set up between memories we
>>>> could read one of the other copies and rewrite it into the failing
>>>> region immediately.
>>>
>>> Yes, that is kind of what I was hoping to accomplish via this
>>> discussion. How much would filesystems want to be involved in this sort
>>> of badblocks handling, if at all. I can refresh my patches that provide
>>> the fs notification, but that's the easy bit, and a starting point.
>>>
>>
>> I have some questions. Why moving badblock handling to file system
>> level avoid the checking phase? In file system level for each I/O I
>> still have to check the badblock list, right? Do you mean during mount
>> it can go through the pmem device and locates all the data structures
>> mangled by badblocks and handle them accordingly, so that during
>> normal running the badblocks will never be accessed? Or, if there is
>> replicataion/snapshot support, use a copy to recover the badblocks?
>
> With ext4 badblocks, the main outcome is that the bad blocks would be
> pemanently marked in the allocation bitmap as being used, and they would
> never be allocated to a file, so they should never be accessed unless
> doing a full device scan (which ext4 and e2fsck never do).  That would
> avoid the need to check every I/O against the bad blocks list, if the
> driver knows that the filesystem will handle this.
>

Thank you for explanation. However this only works for free blocks,
right? What about allocated blocks, like file data and metadata?

Thanks,
Andiry

> The one caveat is that ext4 only allows 32-bit block numbers in the
> badblocks list, since this feature hasn't been used in a long time.
> This is good for up to 16TB filesystems, but if there was a demand to
> use this feature again it would be possible allow 64-bit block numbers.
>
> Cheers, Andreas
>
>
>
>
>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-18  3:08                   ` Lu Zhang
  0 siblings, 0 replies; 89+ messages in thread
From: Lu Zhang @ 2017-01-18  3:08 UTC (permalink / raw)
  To: Andiry Xu
  Cc: Andreas Dilger, Slava Dubeyko, Darrick J. Wong,
	linux-nvdimm@lists.01.org, linux-block, Viacheslav Dubeyko,
	Linux FS Devel, lsf-pc

I'm curious about the fault model and corresponding hardware ECC mechanisms
for NVDIMMs. In my understanding for memory accesses to trigger MCE, it
means the memory controller finds a detectable but uncorrectable error
(DUE). So if there is no hardware ECC support the media errors won't even
be noticed, not to mention badblocks or machine checks.

Current hardware ECC support for DRAM usually employs (72, 64) single-bit
error correction mechanism, and for advanced ECCs there are techniques like
Chipkill or SDDC which can tolerate a single DRAM chip failure. What is the
expected ECC mode for NVDIMMs, assuming that PCM or 3dXpoint based
technology might have higher error rates?

If DUE does happen and is flagged to the file system via MCE (somehow...),
and the fs finds that the error corrupts its allocated data page, or
metadata, now if the fs wants to recover its data the intuition is that
there needs to be a stronger error correction mechanism to correct the
hardware-uncorrectable errors. So knowing the hardware ECC baseline is
helpful for the file system to understand how severe are the faults in
badblocks, and develop its recovery methods.

Regards,
Lu

On Tue, Jan 17, 2017 at 6:01 PM, Andiry Xu <andiry@gmail.com> wrote:

> On Tue, Jan 17, 2017 at 4:16 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > On Jan 17, 2017, at 3:15 PM, Andiry Xu <andiry@gmail.com> wrote:
> >> On Tue, Jan 17, 2017 at 1:35 PM, Vishal Verma <vishal.l.verma@intel.com>
> wrote:
> >>> On 01/16, Darrick J. Wong wrote:
> >>>> On Fri, Jan 13, 2017 at 05:49:10PM -0700, Vishal Verma wrote:
> >>>>> On 01/14, Slava Dubeyko wrote:
> >>>>>>
> >>>>>> ---- Original Message ----
> >>>>>> Subject: [LSF/MM TOPIC] Badblocks checking/representation in
> filesystems
> >>>>>> Sent: Jan 13, 2017 1:40 PM
> >>>>>> From: "Verma, Vishal L" <vishal.l.verma@intel.com>
> >>>>>> To: lsf-pc@lists.linux-foundation.org
> >>>>>> Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org,
> linux-fsdevel@vger.kernel.org
> >>>>>>
> >>>>>>> The current implementation of badblocks, where we consult the
> >>>>>>> badblocks list for every IO in the block driver works, and is a
> >>>>>>> last option failsafe, but from a user perspective, it isn't the
> >>>>>>> easiest interface to work with.
> >>>>>>
> >>>>>> As I remember, FAT and HFS+ specifications contain description of
> bad blocks
> >>>>>> (physical sectors) table. I believe that this table was used for
> the case of
> >>>>>> floppy media. But, finally, this table becomes to be the completely
> obsolete
> >>>>>> artefact because mostly storage devices are reliably enough. Why do
> you need
> >>>>
> >>>> ext4 has a badblocks inode to own all the bad spots on disk, but ISTR
> it
> >>>> doesn't support(??) extents or 64-bit filesystems, and might just be a
> >>>> vestigial organ at this point.  XFS doesn't have anything to track bad
> >>>> blocks currently....
> >>>>
> >>>>>> in exposing the bad blocks on the file system level?  Do you expect
> that next
> >>>>>> generation of NVM memory will be so unreliable that file system
> needs to manage
> >>>>>> bad blocks? What's about erasure coding schemes? Do file system
> really need to suffer
> >>>>>> from the bad block issue?
> >>>>>>
> >>>>>> Usually, we are using LBAs and it is the responsibility of storage
> device to map
> >>>>>> a bad physical block/page/sector into valid one. Do you mean that
> we have
> >>>>>> access to physical NVM memory address directly? But it looks like
> that we can
> >>>>>> have a "bad block" issue even we will access data into page cache's
> memory
> >>>>>> page (if we will use NVM memory for page cache, of course). So,
> what do you
> >>>>>> imply by "bad block" issue?
> >>>>>
> >>>>> We don't have direct physical access to the device's address space,
> in
> >>>>> the sense the device is still free to perform remapping of chunks of
> NVM
> >>>>> underneath us. The problem is that when a block or address range (as
> >>>>> small as a cache line) goes bad, the device maintains a poison bit
> for
> >>>>> every affected cache line. Behind the scenes, it may have already
> >>>>> remapped the range, but the cache line poison has to be kept so that
> >>>>> there is a notification to the user/owner of the data that something
> has
> >>>>> been lost. Since NVM is byte addressable memory sitting on the memory
> >>>>> bus, such a poisoned cache line results in memory errors and
> SIGBUSes.
> >>>>> Compared to tradational storage where an app will get nice and
> friendly
> >>>>> (relatively speaking..) -EIOs. The whole badblocks implementation was
> >>>>> done so that the driver can intercept IO (i.e. reads) to _known_ bad
> >>>>> locations, and short-circuit them with an EIO. If the driver doesn't
> >>>>> catch these, the reads will turn into a memory bus access, and the
> >>>>> poison will cause a SIGBUS.
> >>>>
> >>>> "driver" ... you mean XFS?  Or do you mean the thing that makes pmem
> >>>> look kind of like a traditional block device? :)
> >>>
> >>> Yes, the thing that makes pmem look like a block device :) --
> >>> drivers/nvdimm/pmem.c
> >>>
> >>>>
> >>>>> This effort is to try and make this badblock checking smarter - and
> try
> >>>>> and reduce the penalty on every IO to a smaller range, which only the
> >>>>> filesystem can do.
> >>>>
> >>>> Though... now that XFS merged the reverse mapping support, I've been
> >>>> wondering if there'll be a resubmission of the device errors callback?
> >>>> It still would be useful to be able to inform the user that part of
> >>>> their fs has gone bad, or, better yet, if the buffer is still in
> memory
> >>>> someplace else, just write it back out.
> >>>>
> >>>> Or I suppose if we had some kind of raid1 set up between memories we
> >>>> could read one of the other copies and rewrite it into the failing
> >>>> region immediately.
> >>>
> >>> Yes, that is kind of what I was hoping to accomplish via this
> >>> discussion. How much would filesystems want to be involved in this sort
> >>> of badblocks handling, if at all. I can refresh my patches that provide
> >>> the fs notification, but that's the easy bit, and a starting point.
> >>>
> >>
> >> I have some questions. Why moving badblock handling to file system
> >> level avoid the checking phase? In file system level for each I/O I
> >> still have to check the badblock list, right? Do you mean during mount
> >> it can go through the pmem device and locates all the data structures
> >> mangled by badblocks and handle them accordingly, so that during
> >> normal running the badblocks will never be accessed? Or, if there is
> >> replicataion/snapshot support, use a copy to recover the badblocks?
> >
> > With ext4 badblocks, the main outcome is that the bad blocks would be
> > pemanently marked in the allocation bitmap as being used, and they would
> > never be allocated to a file, so they should never be accessed unless
> > doing a full device scan (which ext4 and e2fsck never do).  That would
> > avoid the need to check every I/O against the bad blocks list, if the
> > driver knows that the filesystem will handle this.
> >
>
> Thank you for explanation. However this only works for free blocks,
> right? What about allocated blocks, like file data and metadata?
>
> Thanks,
> Andiry
>
> > The one caveat is that ext4 only allows 32-bit block numbers in the
> > badblocks list, since this feature hasn't been used in a long time.
> > This is good for up to 16TB filesystems, but if there was a demand to
> > use this feature again it would be possible allow 64-bit block numbers.
> >
> > Cheers, Andreas
> >
> >
> >
> >
> >
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-18  3:08                   ` Lu Zhang
  0 siblings, 0 replies; 89+ messages in thread
From: Lu Zhang @ 2017-01-18  3:08 UTC (permalink / raw)
  To: Andiry Xu
  Cc: Andreas Dilger, Slava Dubeyko, Darrick J. Wong,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Viacheslav Dubeyko,
	Linux FS Devel,
	lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

I'm curious about the fault model and corresponding hardware ECC mechanisms
for NVDIMMs. In my understanding for memory accesses to trigger MCE, it
means the memory controller finds a detectable but uncorrectable error
(DUE). So if there is no hardware ECC support the media errors won't even
be noticed, not to mention badblocks or machine checks.

Current hardware ECC support for DRAM usually employs (72, 64) single-bit
error correction mechanism, and for advanced ECCs there are techniques like
Chipkill or SDDC which can tolerate a single DRAM chip failure. What is the
expected ECC mode for NVDIMMs, assuming that PCM or 3dXpoint based
technology might have higher error rates?

If DUE does happen and is flagged to the file system via MCE (somehow...),
and the fs finds that the error corrupts its allocated data page, or
metadata, now if the fs wants to recover its data the intuition is that
there needs to be a stronger error correction mechanism to correct the
hardware-uncorrectable errors. So knowing the hardware ECC baseline is
helpful for the file system to understand how severe are the faults in
badblocks, and develop its recovery methods.

Regards,
Lu

On Tue, Jan 17, 2017 at 6:01 PM, Andiry Xu <andiry-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> On Tue, Jan 17, 2017 at 4:16 PM, Andreas Dilger <adilger-m1MBpc4rdrD3fQ9qLvQP4Q@public.gmane.org> wrote:
> > On Jan 17, 2017, at 3:15 PM, Andiry Xu <andiry-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >> On Tue, Jan 17, 2017 at 1:35 PM, Vishal Verma <vishal.l.verma-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> wrote:
> >>> On 01/16, Darrick J. Wong wrote:
> >>>> On Fri, Jan 13, 2017 at 05:49:10PM -0700, Vishal Verma wrote:
> >>>>> On 01/14, Slava Dubeyko wrote:
> >>>>>>
> >>>>>> ---- Original Message ----
> >>>>>> Subject: [LSF/MM TOPIC] Badblocks checking/representation in
> filesystems
> >>>>>> Sent: Jan 13, 2017 1:40 PM
> >>>>>> From: "Verma, Vishal L" <vishal.l.verma-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> >>>>>> To: lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> >>>>>> Cc: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org, linux-block-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
> linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >>>>>>
> >>>>>>> The current implementation of badblocks, where we consult the
> >>>>>>> badblocks list for every IO in the block driver works, and is a
> >>>>>>> last option failsafe, but from a user perspective, it isn't the
> >>>>>>> easiest interface to work with.
> >>>>>>
> >>>>>> As I remember, FAT and HFS+ specifications contain description of
> bad blocks
> >>>>>> (physical sectors) table. I believe that this table was used for
> the case of
> >>>>>> floppy media. But, finally, this table becomes to be the completely
> obsolete
> >>>>>> artefact because mostly storage devices are reliably enough. Why do
> you need
> >>>>
> >>>> ext4 has a badblocks inode to own all the bad spots on disk, but ISTR
> it
> >>>> doesn't support(??) extents or 64-bit filesystems, and might just be a
> >>>> vestigial organ at this point.  XFS doesn't have anything to track bad
> >>>> blocks currently....
> >>>>
> >>>>>> in exposing the bad blocks on the file system level?  Do you expect
> that next
> >>>>>> generation of NVM memory will be so unreliable that file system
> needs to manage
> >>>>>> bad blocks? What's about erasure coding schemes? Do file system
> really need to suffer
> >>>>>> from the bad block issue?
> >>>>>>
> >>>>>> Usually, we are using LBAs and it is the responsibility of storage
> device to map
> >>>>>> a bad physical block/page/sector into valid one. Do you mean that
> we have
> >>>>>> access to physical NVM memory address directly? But it looks like
> that we can
> >>>>>> have a "bad block" issue even we will access data into page cache's
> memory
> >>>>>> page (if we will use NVM memory for page cache, of course). So,
> what do you
> >>>>>> imply by "bad block" issue?
> >>>>>
> >>>>> We don't have direct physical access to the device's address space,
> in
> >>>>> the sense the device is still free to perform remapping of chunks of
> NVM
> >>>>> underneath us. The problem is that when a block or address range (as
> >>>>> small as a cache line) goes bad, the device maintains a poison bit
> for
> >>>>> every affected cache line. Behind the scenes, it may have already
> >>>>> remapped the range, but the cache line poison has to be kept so that
> >>>>> there is a notification to the user/owner of the data that something
> has
> >>>>> been lost. Since NVM is byte addressable memory sitting on the memory
> >>>>> bus, such a poisoned cache line results in memory errors and
> SIGBUSes.
> >>>>> Compared to tradational storage where an app will get nice and
> friendly
> >>>>> (relatively speaking..) -EIOs. The whole badblocks implementation was
> >>>>> done so that the driver can intercept IO (i.e. reads) to _known_ bad
> >>>>> locations, and short-circuit them with an EIO. If the driver doesn't
> >>>>> catch these, the reads will turn into a memory bus access, and the
> >>>>> poison will cause a SIGBUS.
> >>>>
> >>>> "driver" ... you mean XFS?  Or do you mean the thing that makes pmem
> >>>> look kind of like a traditional block device? :)
> >>>
> >>> Yes, the thing that makes pmem look like a block device :) --
> >>> drivers/nvdimm/pmem.c
> >>>
> >>>>
> >>>>> This effort is to try and make this badblock checking smarter - and
> try
> >>>>> and reduce the penalty on every IO to a smaller range, which only the
> >>>>> filesystem can do.
> >>>>
> >>>> Though... now that XFS merged the reverse mapping support, I've been
> >>>> wondering if there'll be a resubmission of the device errors callback?
> >>>> It still would be useful to be able to inform the user that part of
> >>>> their fs has gone bad, or, better yet, if the buffer is still in
> memory
> >>>> someplace else, just write it back out.
> >>>>
> >>>> Or I suppose if we had some kind of raid1 set up between memories we
> >>>> could read one of the other copies and rewrite it into the failing
> >>>> region immediately.
> >>>
> >>> Yes, that is kind of what I was hoping to accomplish via this
> >>> discussion. How much would filesystems want to be involved in this sort
> >>> of badblocks handling, if at all. I can refresh my patches that provide
> >>> the fs notification, but that's the easy bit, and a starting point.
> >>>
> >>
> >> I have some questions. Why moving badblock handling to file system
> >> level avoid the checking phase? In file system level for each I/O I
> >> still have to check the badblock list, right? Do you mean during mount
> >> it can go through the pmem device and locates all the data structures
> >> mangled by badblocks and handle them accordingly, so that during
> >> normal running the badblocks will never be accessed? Or, if there is
> >> replicataion/snapshot support, use a copy to recover the badblocks?
> >
> > With ext4 badblocks, the main outcome is that the bad blocks would be
> > pemanently marked in the allocation bitmap as being used, and they would
> > never be allocated to a file, so they should never be accessed unless
> > doing a full device scan (which ext4 and e2fsck never do).  That would
> > avoid the need to check every I/O against the bad blocks list, if the
> > driver knows that the filesystem will handle this.
> >
>
> Thank you for explanation. However this only works for free blocks,
> right? What about allocated blocks, like file data and metadata?
>
> Thanks,
> Andiry
>
> > The one caveat is that ext4 only allows 32-bit block numbers in the
> > badblocks list, since this feature hasn't been used in a long time.
> > This is good for up to 16TB filesystems, but if there was a demand to
> > use this feature again it would be possible allow 64-bit block numbers.
> >
> > Cheers, Andreas
> >
> >
> >
> >
> >
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-17 22:37               ` Vishal Verma
@ 2017-01-18  9:38                 ` Jan Kara
  -1 siblings, 0 replies; 89+ messages in thread
From: Jan Kara @ 2017-01-18  9:38 UTC (permalink / raw)
  To: Vishal Verma
  Cc: Slava Dubeyko, Darrick J. Wong, linux-nvdimm@lists.01.org,
	linux-block, Linux FS Devel, Viacheslav Dubeyko, Andiry Xu,
	lsf-pc

On Tue 17-01-17 15:37:05, Vishal Verma wrote:
> I do mean that in the filesystem, for every IO, the badblocks will be
> checked. Currently, the pmem driver does this, and the hope is that the
> filesystem can do a better job at it. The driver unconditionally checks
> every IO for badblocks on the whole device. Depending on how the
> badblocks are represented in the filesystem, we might be able to quickly
> tell if a file/range has existing badblocks, and error out the IO
> accordingly.
> 
> At mount the the fs would read the existing badblocks on the block
> device, and build its own representation of them. Then during normal
> use, if the underlying badblocks change, the fs would get a notification
> that would allow it to also update its own representation.

So I believe we have to distinguish three cases so that we are on the same
page.

1) PMEM is exposed only via a block interface for legacy filesystems to
use. Here, all the bad blocks handling IMO must happen in NVDIMM driver.
Looking from outside, the IO either returns with EIO or succeeds. As a
result you cannot ever ger rid of bad blocks handling in the NVDIMM driver.

2) PMEM is exposed for DAX aware filesystem. This seems to be what you are
mostly interested in. We could possibly do something more efficient than
what NVDIMM driver does however the complexity would be relatively high and
frankly I'm far from convinced this is really worth it. If there are so
many badblocks this would matter, the HW has IMHO bigger problems than
performance.

3) PMEM filesystem - there things are even more difficult as was already
noted elsewhere in the thread. But for now I'd like to leave those aside
not to complicate things too much.

Now my question: Why do we bother with badblocks at all? In cases 1) and 2)
if the platform can recover from MCE, we can just always access persistent
memory using memcpy_mcsafe(), if that fails, return -EIO. Actually that
seems to already happen so we just need to make sure all places handle
returned errors properly (e.g. fs/dax.c does not seem to) and we are done.
No need for bad blocks list at all, no slow down unless we hit a bad cell
and in that case who cares about performance when the data is gone...

For platforms that cannot recover from MCE - just buy better hardware ;).
Seriously, I have doubts people can seriously use a machine that will
unavoidably randomly reboot (as there is always a risk you hit error that
has not been uncovered by background scrub). But maybe for big cloud providers
the cost savings may offset for the inconvenience, I don't know. But still
for that case a bad blocks handling in NVDIMM code like we do now looks
good enough?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-18  9:38                 ` Jan Kara
  0 siblings, 0 replies; 89+ messages in thread
From: Jan Kara @ 2017-01-18  9:38 UTC (permalink / raw)
  To: Vishal Verma
  Cc: Andiry Xu, Slava Dubeyko, Darrick J. Wong,
	linux-nvdimm@lists.01.org, linux-block, Viacheslav Dubeyko,
	Linux FS Devel, lsf-pc

On Tue 17-01-17 15:37:05, Vishal Verma wrote:
> I do mean that in the filesystem, for every IO, the badblocks will be
> checked. Currently, the pmem driver does this, and the hope is that the
> filesystem can do a better job at it. The driver unconditionally checks
> every IO for badblocks on the whole device. Depending on how the
> badblocks are represented in the filesystem, we might be able to quickly
> tell if a file/range has existing badblocks, and error out the IO
> accordingly.
> 
> At mount the the fs would read the existing badblocks on the block
> device, and build its own representation of them. Then during normal
> use, if the underlying badblocks change, the fs would get a notification
> that would allow it to also update its own representation.

So I believe we have to distinguish three cases so that we are on the same
page.

1) PMEM is exposed only via a block interface for legacy filesystems to
use. Here, all the bad blocks handling IMO must happen in NVDIMM driver.
Looking from outside, the IO either returns with EIO or succeeds. As a
result you cannot ever ger rid of bad blocks handling in the NVDIMM driver.

2) PMEM is exposed for DAX aware filesystem. This seems to be what you are
mostly interested in. We could possibly do something more efficient than
what NVDIMM driver does however the complexity would be relatively high and
frankly I'm far from convinced this is really worth it. If there are so
many badblocks this would matter, the HW has IMHO bigger problems than
performance.

3) PMEM filesystem - there things are even more difficult as was already
noted elsewhere in the thread. But for now I'd like to leave those aside
not to complicate things too much.

Now my question: Why do we bother with badblocks at all? In cases 1) and 2)
if the platform can recover from MCE, we can just always access persistent
memory using memcpy_mcsafe(), if that fails, return -EIO. Actually that
seems to already happen so we just need to make sure all places handle
returned errors properly (e.g. fs/dax.c does not seem to) and we are done.
No need for bad blocks list at all, no slow down unless we hit a bad cell
and in that case who cares about performance when the data is gone...

For platforms that cannot recover from MCE - just buy better hardware ;).
Seriously, I have doubts people can seriously use a machine that will
unavoidably randomly reboot (as there is always a risk you hit error that
has not been uncovered by background scrub). But maybe for big cloud providers
the cost savings may offset for the inconvenience, I don't know. But still
for that case a bad blocks handling in NVDIMM code like we do now looks
good enough?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-17 22:14             ` Vishal Verma
@ 2017-01-18 10:16               ` Jan Kara
  -1 siblings, 0 replies; 89+ messages in thread
From: Jan Kara @ 2017-01-18 10:16 UTC (permalink / raw)
  To: Vishal Verma
  Cc: Jan Kara, Slava Dubeyko, linux-nvdimm, linux-block,
	Viacheslav Dubeyko, Linux FS Devel, lsf-pc

On Tue 17-01-17 15:14:21, Vishal Verma wrote:
> Your note on the online repair does raise another tangentially related
> topic. Currently, if there are badblocks, writes via the bio submission
> path will clear the error (if the hardware is able to remap the bad
> locations). However, if the filesystem is mounted eith DAX, even
> non-mmap operations - read() and write() will go through the dax paths
> (dax_do_io()). We haven't found a good/agreeable way to perform
> error-clearing in this case. So currently, if a dax mounted filesystem
> has badblocks, the only way to clear those badblocks is to mount it
> without DAX, and overwrite/zero the bad locations. This is a pretty
> terrible user experience, and I'm hoping this can be solved in a better
> way.

Please remind me, what is the problem with DAX code doing necessary work to
clear the error when it gets EIO from memcpy on write?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-18 10:16               ` Jan Kara
  0 siblings, 0 replies; 89+ messages in thread
From: Jan Kara @ 2017-01-18 10:16 UTC (permalink / raw)
  To: Vishal Verma
  Cc: Jan Kara, Slava Dubeyko, linux-block, Linux FS Devel, lsf-pc,
	Viacheslav Dubeyko, linux-nvdimm

On Tue 17-01-17 15:14:21, Vishal Verma wrote:
> Your note on the online repair does raise another tangentially related
> topic. Currently, if there are badblocks, writes via the bio submission
> path will clear the error (if the hardware is able to remap the bad
> locations). However, if the filesystem is mounted eith DAX, even
> non-mmap operations - read() and write() will go through the dax paths
> (dax_do_io()). We haven't found a good/agreeable way to perform
> error-clearing in this case. So currently, if a dax mounted filesystem
> has badblocks, the only way to clear those badblocks is to mount it
> without DAX, and overwrite/zero the bad locations. This is a pretty
> terrible user experience, and I'm hoping this can be solved in a better
> way.

Please remind me, what is the problem with DAX code doing necessary work to
clear the error when it gets EIO from memcpy on write?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-18 10:16               ` Jan Kara
@ 2017-01-18 20:39                 ` Jeff Moyer
  -1 siblings, 0 replies; 89+ messages in thread
From: Jeff Moyer @ 2017-01-18 20:39 UTC (permalink / raw)
  To: Jan Kara
  Cc: Slava Dubeyko, linux-block, Viacheslav Dubeyko,
	linux-nvdimm@lists.01.org, Linux FS Devel, lsf-pc

Jan Kara <jack@suse.cz> writes:

> On Tue 17-01-17 15:14:21, Vishal Verma wrote:
>> Your note on the online repair does raise another tangentially related
>> topic. Currently, if there are badblocks, writes via the bio submission
>> path will clear the error (if the hardware is able to remap the bad
>> locations). However, if the filesystem is mounted eith DAX, even
>> non-mmap operations - read() and write() will go through the dax paths
>> (dax_do_io()). We haven't found a good/agreeable way to perform
>> error-clearing in this case. So currently, if a dax mounted filesystem
>> has badblocks, the only way to clear those badblocks is to mount it
>> without DAX, and overwrite/zero the bad locations. This is a pretty
>> terrible user experience, and I'm hoping this can be solved in a better
>> way.
>
> Please remind me, what is the problem with DAX code doing necessary work to
> clear the error when it gets EIO from memcpy on write?

You won't get an MCE for a store;  only loads generate them.

Won't fallocate FL_ZERO_RANGE clear bad blocks when mounted with -o dax?

Cheers,
Jeff
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-18 20:39                 ` Jeff Moyer
  0 siblings, 0 replies; 89+ messages in thread
From: Jeff Moyer @ 2017-01-18 20:39 UTC (permalink / raw)
  To: Jan Kara
  Cc: Vishal Verma, Slava Dubeyko, linux-nvdimm@lists.01.org,
	linux-block, Viacheslav Dubeyko, Linux FS Devel, lsf-pc

Jan Kara <jack@suse.cz> writes:

> On Tue 17-01-17 15:14:21, Vishal Verma wrote:
>> Your note on the online repair does raise another tangentially related
>> topic. Currently, if there are badblocks, writes via the bio submission
>> path will clear the error (if the hardware is able to remap the bad
>> locations). However, if the filesystem is mounted eith DAX, even
>> non-mmap operations - read() and write() will go through the dax paths
>> (dax_do_io()). We haven't found a good/agreeable way to perform
>> error-clearing in this case. So currently, if a dax mounted filesystem
>> has badblocks, the only way to clear those badblocks is to mount it
>> without DAX, and overwrite/zero the bad locations. This is a pretty
>> terrible user experience, and I'm hoping this can be solved in a better
>> way.
>
> Please remind me, what is the problem with DAX code doing necessary work to
> clear the error when it gets EIO from memcpy on write?

You won't get an MCE for a store;  only loads generate them.

Won't fallocate FL_ZERO_RANGE clear bad blocks when mounted with -o dax?

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-17 23:15             ` Slava Dubeyko
@ 2017-01-18 20:47               ` Jeff Moyer
  -1 siblings, 0 replies; 89+ messages in thread
From: Jeff Moyer @ 2017-01-18 20:47 UTC (permalink / raw)
  To: Slava Dubeyko
  Cc: Jan Kara, linux-nvdimm@lists.01.org, linux-block,
	Viacheslav Dubeyko, Linux FS Devel, lsf-pc

Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com> writes:

>> Well, the situation with NVM is more like with DRAM AFAIU. It is quite reliable
>> but given the size the probability *some* cell has degraded is quite high.
>> And similar to DRAM you'll get MCE (Machine Check Exception) when you try
>> to read such cell. As Vishal wrote, the hardware does some background scrubbing
>> and relocates stuff early if needed but nothing is 100%.
>
> My understanding that hardware does the remapping the affected address
> range (64 bytes, for example) but it doesn't move/migrate the stored
> data in this address range. So, it sounds slightly weird. Because it
> means that no guarantee to retrieve the stored data. It sounds that
> file system should be aware about this and has to be heavily protected
> by some replication or erasure coding scheme. Otherwise, if the
> hardware does everything for us (remap the affected address region and
> move data into a new address region) then why does file system need to
> know about the affected address regions?

The data is lost, that's why you're getting an ECC.  It's tantamount to
-EIO for a disk block access.

>> The reason why we play games with badblocks is to avoid those MCEs
>> (i.e., even trying to read the data we know that are bad). Even if it would
>> be rare event, MCE may mean the machine just immediately reboots
>> (although I find such platforms hardly usable with NVM then) and that
>> is no good. And even on hardware platforms that allow for more graceful
>> recovery from MCE it is asynchronous in its nature and our error handling
>> around IO is all synchronous so it is difficult to join these two models together.
>>
>> But I think it is a good question to ask whether we cannot improve on MCE handling
>> instead of trying to avoid them and pushing around responsibility for handling
>> bad blocks. Actually I thought someone was working on that.
>> Cannot we e.g. wrap in-kernel accesses to persistent memory (those are now
>> well identified anyway so that we can consult the badblocks list) so that it MCE
>> happens during these accesses, we note it somewhere and at the end of the magic
>> block we will just pick up the errors and report them back?
>
> Let's imagine that the affected address range will equal to 64 bytes. It sounds for me
> that for the case of block device it will affect the whole logical
> block (4 KB).

512 bytes, and yes, that's the granularity at which we track errors in
the block layer, so that's the minimum amount of data you lose.

> If the failure rate of address ranges could be significant then it
> would affect a lot of logical blocks.

Who would buy hardware like that?

> The situation is more critical for the case of DAX approach. Correct
> me if I wrong but my understanding is the goal of DAX is to provide
> the direct access to file's memory pages with minimal file system
> overhead. So, it looks like that raising bad block issue on file
> system level will affect a user-space application. Because, finally,
> user-space application will need to process such trouble (bad block
> issue). It sounds for me as really weird situation. What can protect a
> user-space application from encountering the issue with partially
> incorrect memory page?

Applications need to deal with -EIO today.  This is the same sort of
thing.  If an application trips over a bad block during a load from
persistent memory, they will get a signal, and they can either handle it
or not.

Have a read through this specification and see if it clears anything up
for you:
  http://www.snia.org/tech_activities/standards/curr_standards/npm

Cheers,
Jeff
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-18 20:47               ` Jeff Moyer
  0 siblings, 0 replies; 89+ messages in thread
From: Jeff Moyer @ 2017-01-18 20:47 UTC (permalink / raw)
  To: Slava Dubeyko
  Cc: Jan Kara, linux-nvdimm@lists.01.org, linux-block,
	Viacheslav Dubeyko, Linux FS Devel, lsf-pc

Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com> writes:

>> Well, the situation with NVM is more like with DRAM AFAIU. It is quite reliable
>> but given the size the probability *some* cell has degraded is quite high.
>> And similar to DRAM you'll get MCE (Machine Check Exception) when you try
>> to read such cell. As Vishal wrote, the hardware does some background scrubbing
>> and relocates stuff early if needed but nothing is 100%.
>
> My understanding that hardware does the remapping the affected address
> range (64 bytes, for example) but it doesn't move/migrate the stored
> data in this address range. So, it sounds slightly weird. Because it
> means that no guarantee to retrieve the stored data. It sounds that
> file system should be aware about this and has to be heavily protected
> by some replication or erasure coding scheme. Otherwise, if the
> hardware does everything for us (remap the affected address region and
> move data into a new address region) then why does file system need to
> know about the affected address regions?

The data is lost, that's why you're getting an ECC.  It's tantamount to
-EIO for a disk block access.

>> The reason why we play games with badblocks is to avoid those MCEs
>> (i.e., even trying to read the data we know that are bad). Even if it would
>> be rare event, MCE may mean the machine just immediately reboots
>> (although I find such platforms hardly usable with NVM then) and that
>> is no good. And even on hardware platforms that allow for more graceful
>> recovery from MCE it is asynchronous in its nature and our error handling
>> around IO is all synchronous so it is difficult to join these two models together.
>>
>> But I think it is a good question to ask whether we cannot improve on MCE handling
>> instead of trying to avoid them and pushing around responsibility for handling
>> bad blocks. Actually I thought someone was working on that.
>> Cannot we e.g. wrap in-kernel accesses to persistent memory (those are now
>> well identified anyway so that we can consult the badblocks list) so that it MCE
>> happens during these accesses, we note it somewhere and at the end of the magic
>> block we will just pick up the errors and report them back?
>
> Let's imagine that the affected address range will equal to 64 bytes. It sounds for me
> that for the case of block device it will affect the whole logical
> block (4 KB).

512 bytes, and yes, that's the granularity at which we track errors in
the block layer, so that's the minimum amount of data you lose.

> If the failure rate of address ranges could be significant then it
> would affect a lot of logical blocks.

Who would buy hardware like that?

> The situation is more critical for the case of DAX approach. Correct
> me if I wrong but my understanding is the goal of DAX is to provide
> the direct access to file's memory pages with minimal file system
> overhead. So, it looks like that raising bad block issue on file
> system level will affect a user-space application. Because, finally,
> user-space application will need to process such trouble (bad block
> issue). It sounds for me as really weird situation. What can protect a
> user-space application from encountering the issue with partially
> incorrect memory page?

Applications need to deal with -EIO today.  This is the same sort of
thing.  If an application trips over a bad block during a load from
persistent memory, they will get a signal, and they can either handle it
or not.

Have a read through this specification and see if it clears anything up
for you:
  http://www.snia.org/tech_activities/standards/curr_standards/npm

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-18 20:39                 ` Jeff Moyer
@ 2017-01-18 21:02                   ` Darrick J. Wong
  -1 siblings, 0 replies; 89+ messages in thread
From: Darrick J. Wong @ 2017-01-18 21:02 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Jan Kara, Slava Dubeyko, linux-block, Viacheslav Dubeyko,
	linux-nvdimm@lists.01.org, Linux FS Devel, lsf-pc

On Wed, Jan 18, 2017 at 03:39:17PM -0500, Jeff Moyer wrote:
> Jan Kara <jack@suse.cz> writes:
> 
> > On Tue 17-01-17 15:14:21, Vishal Verma wrote:
> >> Your note on the online repair does raise another tangentially related
> >> topic. Currently, if there are badblocks, writes via the bio submission
> >> path will clear the error (if the hardware is able to remap the bad
> >> locations). However, if the filesystem is mounted eith DAX, even
> >> non-mmap operations - read() and write() will go through the dax paths
> >> (dax_do_io()). We haven't found a good/agreeable way to perform
> >> error-clearing in this case. So currently, if a dax mounted filesystem
> >> has badblocks, the only way to clear those badblocks is to mount it
> >> without DAX, and overwrite/zero the bad locations. This is a pretty
> >> terrible user experience, and I'm hoping this can be solved in a better
> >> way.
> >
> > Please remind me, what is the problem with DAX code doing necessary work to
> > clear the error when it gets EIO from memcpy on write?
> 
> You won't get an MCE for a store;  only loads generate them.
> 
> Won't fallocate FL_ZERO_RANGE clear bad blocks when mounted with -o dax?

Not necessarily; XFS usually implements this by punching out the range
and then reallocating it as unwritten blocks.

--D

> 
> Cheers,
> Jeff
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-18 21:02                   ` Darrick J. Wong
  0 siblings, 0 replies; 89+ messages in thread
From: Darrick J. Wong @ 2017-01-18 21:02 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Jan Kara, Vishal Verma, Slava Dubeyko, linux-nvdimm@lists.01.org,
	linux-block, Viacheslav Dubeyko, Linux FS Devel, lsf-pc

On Wed, Jan 18, 2017 at 03:39:17PM -0500, Jeff Moyer wrote:
> Jan Kara <jack@suse.cz> writes:
> 
> > On Tue 17-01-17 15:14:21, Vishal Verma wrote:
> >> Your note on the online repair does raise another tangentially related
> >> topic. Currently, if there are badblocks, writes via the bio submission
> >> path will clear the error (if the hardware is able to remap the bad
> >> locations). However, if the filesystem is mounted eith DAX, even
> >> non-mmap operations - read() and write() will go through the dax paths
> >> (dax_do_io()). We haven't found a good/agreeable way to perform
> >> error-clearing in this case. So currently, if a dax mounted filesystem
> >> has badblocks, the only way to clear those badblocks is to mount it
> >> without DAX, and overwrite/zero the bad locations. This is a pretty
> >> terrible user experience, and I'm hoping this can be solved in a better
> >> way.
> >
> > Please remind me, what is the problem with DAX code doing necessary work to
> > clear the error when it gets EIO from memcpy on write?
> 
> You won't get an MCE for a store;  only loads generate them.
> 
> Won't fallocate FL_ZERO_RANGE clear bad blocks when mounted with -o dax?

Not necessarily; XFS usually implements this by punching out the range
and then reallocating it as unwritten blocks.

--D

> 
> Cheers,
> Jeff
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-18 21:02                   ` Darrick J. Wong
@ 2017-01-18 21:32                     ` Dan Williams
  -1 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2017-01-18 21:32 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, Slava Dubeyko, linux-nvdimm@lists.01.org, linux-block,
	Viacheslav Dubeyko, Linux FS Devel, lsf-pc

On Wed, Jan 18, 2017 at 1:02 PM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
> On Wed, Jan 18, 2017 at 03:39:17PM -0500, Jeff Moyer wrote:
>> Jan Kara <jack@suse.cz> writes:
>>
>> > On Tue 17-01-17 15:14:21, Vishal Verma wrote:
>> >> Your note on the online repair does raise another tangentially related
>> >> topic. Currently, if there are badblocks, writes via the bio submission
>> >> path will clear the error (if the hardware is able to remap the bad
>> >> locations). However, if the filesystem is mounted eith DAX, even
>> >> non-mmap operations - read() and write() will go through the dax paths
>> >> (dax_do_io()). We haven't found a good/agreeable way to perform
>> >> error-clearing in this case. So currently, if a dax mounted filesystem
>> >> has badblocks, the only way to clear those badblocks is to mount it
>> >> without DAX, and overwrite/zero the bad locations. This is a pretty
>> >> terrible user experience, and I'm hoping this can be solved in a better
>> >> way.
>> >
>> > Please remind me, what is the problem with DAX code doing necessary work to
>> > clear the error when it gets EIO from memcpy on write?
>>
>> You won't get an MCE for a store;  only loads generate them.
>>
>> Won't fallocate FL_ZERO_RANGE clear bad blocks when mounted with -o dax?
>
> Not necessarily; XFS usually implements this by punching out the range
> and then reallocating it as unwritten blocks.
>

That does clear the error because the unwritten blocks are zeroed and
errors cleared when they become allocated again.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-18 21:32                     ` Dan Williams
  0 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2017-01-18 21:32 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jeff Moyer, Jan Kara, Slava Dubeyko, linux-block,
	Viacheslav Dubeyko, linux-nvdimm@lists.01.org, Linux FS Devel,
	lsf-pc

On Wed, Jan 18, 2017 at 1:02 PM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
> On Wed, Jan 18, 2017 at 03:39:17PM -0500, Jeff Moyer wrote:
>> Jan Kara <jack@suse.cz> writes:
>>
>> > On Tue 17-01-17 15:14:21, Vishal Verma wrote:
>> >> Your note on the online repair does raise another tangentially related
>> >> topic. Currently, if there are badblocks, writes via the bio submission
>> >> path will clear the error (if the hardware is able to remap the bad
>> >> locations). However, if the filesystem is mounted eith DAX, even
>> >> non-mmap operations - read() and write() will go through the dax paths
>> >> (dax_do_io()). We haven't found a good/agreeable way to perform
>> >> error-clearing in this case. So currently, if a dax mounted filesystem
>> >> has badblocks, the only way to clear those badblocks is to mount it
>> >> without DAX, and overwrite/zero the bad locations. This is a pretty
>> >> terrible user experience, and I'm hoping this can be solved in a better
>> >> way.
>> >
>> > Please remind me, what is the problem with DAX code doing necessary work to
>> > clear the error when it gets EIO from memcpy on write?
>>
>> You won't get an MCE for a store;  only loads generate them.
>>
>> Won't fallocate FL_ZERO_RANGE clear bad blocks when mounted with -o dax?
>
> Not necessarily; XFS usually implements this by punching out the range
> and then reallocating it as unwritten blocks.
>

That does clear the error because the unwritten blocks are zeroed and
errors cleared when they become allocated again.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-18 21:32                     ` Dan Williams
  (?)
@ 2017-01-18 21:56                         ` Verma, Vishal L
  -1 siblings, 0 replies; 89+ messages in thread
From: Verma, Vishal L @ 2017-01-18 21:56 UTC (permalink / raw)
  To: Williams, Dan J, darrick.wong-QHcLZuEGTsvQT0dZR+AlfA
  Cc: jack-AlSwsSmVLrQ, Vyacheslav.Dubeyko-Sjgp3cTcYWE,
	linux-nvdimm-y27Ovi1pjclAfugRpC6u6w,
	linux-block-u79uwXL29TY76Z2rM5mHXA, slava-yeENwD64cLxBDgjK7y7TUQ,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Wed, 2017-01-18 at 13:32 -0800, Dan Williams wrote:
> On Wed, Jan 18, 2017 at 1:02 PM, Darrick J. Wong
> <darrick.wong@oracle.com> wrote:
> > On Wed, Jan 18, 2017 at 03:39:17PM -0500, Jeff Moyer wrote:
> > > Jan Kara <jack@suse.cz> writes:
> > > 
> > > > On Tue 17-01-17 15:14:21, Vishal Verma wrote:
> > > > > Your note on the online repair does raise another tangentially
> > > > > related
> > > > > topic. Currently, if there are badblocks, writes via the bio
> > > > > submission
> > > > > path will clear the error (if the hardware is able to remap
> > > > > the bad
> > > > > locations). However, if the filesystem is mounted eith DAX,
> > > > > even
> > > > > non-mmap operations - read() and write() will go through the
> > > > > dax paths
> > > > > (dax_do_io()). We haven't found a good/agreeable way to
> > > > > perform
> > > > > error-clearing in this case. So currently, if a dax mounted
> > > > > filesystem
> > > > > has badblocks, the only way to clear those badblocks is to
> > > > > mount it
> > > > > without DAX, and overwrite/zero the bad locations. This is a
> > > > > pretty
> > > > > terrible user experience, and I'm hoping this can be solved in
> > > > > a better
> > > > > way.
> > > > 
> > > > Please remind me, what is the problem with DAX code doing
> > > > necessary work to
> > > > clear the error when it gets EIO from memcpy on write?
> > > 
> > > You won't get an MCE for a store;  only loads generate them.
> > > 
> > > Won't fallocate FL_ZERO_RANGE clear bad blocks when mounted with
> > > -o dax?
> > 
> > Not necessarily; XFS usually implements this by punching out the
> > range
> > and then reallocating it as unwritten blocks.
> > 
> 
> That does clear the error because the unwritten blocks are zeroed and
> errors cleared when they become allocated again.

Yes, the problem was that writes won't clear errors. zeroing through
either hole-punch, truncate, unlinking the file should all work
(assuming the hole-punch or truncate ranges wholly contain the
'badblock' sector).


> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-18 21:56                         ` Verma, Vishal L
  0 siblings, 0 replies; 89+ messages in thread
From: Verma, Vishal L @ 2017-01-18 21:56 UTC (permalink / raw)
  To: Williams, Dan J, darrick.wong
  Cc: Vyacheslav.Dubeyko, jack, linux-block, linux-fsdevel, lsf-pc,
	linux-nvdimm, slava

T24gV2VkLCAyMDE3LTAxLTE4IGF0IDEzOjMyIC0wODAwLCBEYW4gV2lsbGlhbXMgd3JvdGU6DQo+
IE9uIFdlZCwgSmFuIDE4LCAyMDE3IGF0IDE6MDIgUE0sIERhcnJpY2sgSi4gV29uZw0KPiA8ZGFy
cmljay53b25nQG9yYWNsZS5jb20+IHdyb3RlOg0KPiA+IE9uIFdlZCwgSmFuIDE4LCAyMDE3IGF0
IDAzOjM5OjE3UE0gLTA1MDAsIEplZmYgTW95ZXIgd3JvdGU6DQo+ID4gPiBKYW4gS2FyYSA8amFj
a0BzdXNlLmN6PiB3cml0ZXM6DQo+ID4gPiANCj4gPiA+ID4gT24gVHVlIDE3LTAxLTE3IDE1OjE0
OjIxLCBWaXNoYWwgVmVybWEgd3JvdGU6DQo+ID4gPiA+ID4gWW91ciBub3RlIG9uIHRoZSBvbmxp
bmUgcmVwYWlyIGRvZXMgcmFpc2UgYW5vdGhlciB0YW5nZW50aWFsbHkNCj4gPiA+ID4gPiByZWxh
dGVkDQo+ID4gPiA+ID4gdG9waWMuIEN1cnJlbnRseSwgaWYgdGhlcmUgYXJlIGJhZGJsb2Nrcywg
d3JpdGVzIHZpYSB0aGUgYmlvDQo+ID4gPiA+ID4gc3VibWlzc2lvbg0KPiA+ID4gPiA+IHBhdGgg
d2lsbCBjbGVhciB0aGUgZXJyb3IgKGlmIHRoZSBoYXJkd2FyZSBpcyBhYmxlIHRvIHJlbWFwDQo+
ID4gPiA+ID4gdGhlIGJhZA0KPiA+ID4gPiA+IGxvY2F0aW9ucykuIEhvd2V2ZXIsIGlmIHRoZSBm
aWxlc3lzdGVtIGlzIG1vdW50ZWQgZWl0aCBEQVgsDQo+ID4gPiA+ID4gZXZlbg0KPiA+ID4gPiA+
IG5vbi1tbWFwIG9wZXJhdGlvbnMgLSByZWFkKCkgYW5kIHdyaXRlKCkgd2lsbCBnbyB0aHJvdWdo
IHRoZQ0KPiA+ID4gPiA+IGRheCBwYXRocw0KPiA+ID4gPiA+IChkYXhfZG9faW8oKSkuIFdlIGhh
dmVuJ3QgZm91bmQgYSBnb29kL2FncmVlYWJsZSB3YXkgdG8NCj4gPiA+ID4gPiBwZXJmb3JtDQo+
ID4gPiA+ID4gZXJyb3ItY2xlYXJpbmcgaW4gdGhpcyBjYXNlLiBTbyBjdXJyZW50bHksIGlmIGEg
ZGF4IG1vdW50ZWQNCj4gPiA+ID4gPiBmaWxlc3lzdGVtDQo+ID4gPiA+ID4gaGFzIGJhZGJsb2Nr
cywgdGhlIG9ubHkgd2F5IHRvIGNsZWFyIHRob3NlIGJhZGJsb2NrcyBpcyB0bw0KPiA+ID4gPiA+
IG1vdW50IGl0DQo+ID4gPiA+ID4gd2l0aG91dCBEQVgsIGFuZCBvdmVyd3JpdGUvemVybyB0aGUg
YmFkIGxvY2F0aW9ucy4gVGhpcyBpcyBhDQo+ID4gPiA+ID4gcHJldHR5DQo+ID4gPiA+ID4gdGVy
cmlibGUgdXNlciBleHBlcmllbmNlLCBhbmQgSSdtIGhvcGluZyB0aGlzIGNhbiBiZSBzb2x2ZWQg
aW4NCj4gPiA+ID4gPiBhIGJldHRlcg0KPiA+ID4gPiA+IHdheS4NCj4gPiA+ID4gDQo+ID4gPiA+
IFBsZWFzZSByZW1pbmQgbWUsIHdoYXQgaXMgdGhlIHByb2JsZW0gd2l0aCBEQVggY29kZSBkb2lu
Zw0KPiA+ID4gPiBuZWNlc3Nhcnkgd29yayB0bw0KPiA+ID4gPiBjbGVhciB0aGUgZXJyb3Igd2hl
biBpdCBnZXRzIEVJTyBmcm9tIG1lbWNweSBvbiB3cml0ZT8NCj4gPiA+IA0KPiA+ID4gWW91IHdv
bid0IGdldCBhbiBNQ0UgZm9yIGEgc3RvcmU7wqDCoG9ubHkgbG9hZHMgZ2VuZXJhdGUgdGhlbS4N
Cj4gPiA+IA0KPiA+ID4gV29uJ3QgZmFsbG9jYXRlIEZMX1pFUk9fUkFOR0UgY2xlYXIgYmFkIGJs
b2NrcyB3aGVuIG1vdW50ZWQgd2l0aA0KPiA+ID4gLW8gZGF4Pw0KPiA+IA0KPiA+IE5vdCBuZWNl
c3NhcmlseTsgWEZTIHVzdWFsbHkgaW1wbGVtZW50cyB0aGlzIGJ5IHB1bmNoaW5nIG91dCB0aGUN
Cj4gPiByYW5nZQ0KPiA+IGFuZCB0aGVuIHJlYWxsb2NhdGluZyBpdCBhcyB1bndyaXR0ZW4gYmxv
Y2tzLg0KPiA+IA0KPiANCj4gVGhhdCBkb2VzIGNsZWFyIHRoZSBlcnJvciBiZWNhdXNlIHRoZSB1
bndyaXR0ZW4gYmxvY2tzIGFyZSB6ZXJvZWQgYW5kDQo+IGVycm9ycyBjbGVhcmVkIHdoZW4gdGhl
eSBiZWNvbWUgYWxsb2NhdGVkIGFnYWluLg0KDQpZZXMsIHRoZSBwcm9ibGVtIHdhcyB0aGF0IHdy
aXRlcyB3b24ndCBjbGVhciBlcnJvcnMuIHplcm9pbmcgdGhyb3VnaA0KZWl0aGVyIGhvbGUtcHVu
Y2gsIHRydW5jYXRlLCB1bmxpbmtpbmcgdGhlIGZpbGUgc2hvdWxkIGFsbCB3b3JrDQooYXNzdW1p
bmcgdGhlIGhvbGUtcHVuY2ggb3IgdHJ1bmNhdGUgcmFuZ2VzIHdob2xseSBjb250YWluIHRoZQ0K
J2JhZGJsb2NrJyBzZWN0b3IpLg0KDQoNCj4gX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f
X19fX19fX19fX19fX19fX18NCj4gTGludXgtbnZkaW1tIG1haWxpbmcgbGlzdA0KPiBMaW51eC1u
dmRpbW1AbGlzdHMuMDEub3JnDQo+IGh0dHBzOi8vbGlzdHMuMDEub3JnL21haWxtYW4vbGlzdGlu
Zm8vbGludXgtbnZkaW1t

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-18 21:56                         ` Verma, Vishal L
  0 siblings, 0 replies; 89+ messages in thread
From: Verma, Vishal L @ 2017-01-18 21:56 UTC (permalink / raw)
  To: Williams, Dan J, darrick.wong
  Cc: Vyacheslav.Dubeyko, jack, linux-block, linux-fsdevel, lsf-pc,
	linux-nvdimm, slava

On Wed, 2017-01-18 at 13:32 -0800, Dan Williams wrote:
> On Wed, Jan 18, 2017 at 1:02 PM, Darrick J. Wong
> <darrick.wong@oracle.com> wrote:
> > On Wed, Jan 18, 2017 at 03:39:17PM -0500, Jeff Moyer wrote:
> > > Jan Kara <jack@suse.cz> writes:
> > > 
> > > > On Tue 17-01-17 15:14:21, Vishal Verma wrote:
> > > > > Your note on the online repair does raise another tangentially
> > > > > related
> > > > > topic. Currently, if there are badblocks, writes via the bio
> > > > > submission
> > > > > path will clear the error (if the hardware is able to remap
> > > > > the bad
> > > > > locations). However, if the filesystem is mounted eith DAX,
> > > > > even
> > > > > non-mmap operations - read() and write() will go through the
> > > > > dax paths
> > > > > (dax_do_io()). We haven't found a good/agreeable way to
> > > > > perform
> > > > > error-clearing in this case. So currently, if a dax mounted
> > > > > filesystem
> > > > > has badblocks, the only way to clear those badblocks is to
> > > > > mount it
> > > > > without DAX, and overwrite/zero the bad locations. This is a
> > > > > pretty
> > > > > terrible user experience, and I'm hoping this can be solved in
> > > > > a better
> > > > > way.
> > > > 
> > > > Please remind me, what is the problem with DAX code doing
> > > > necessary work to
> > > > clear the error when it gets EIO from memcpy on write?
> > > 
> > > You won't get an MCE for a store;  only loads generate them.
> > > 
> > > Won't fallocate FL_ZERO_RANGE clear bad blocks when mounted with
> > > -o dax?
> > 
> > Not necessarily; XFS usually implements this by punching out the
> > range
> > and then reallocating it as unwritten blocks.
> > 
> 
> That does clear the error because the unwritten blocks are zeroed and
> errors cleared when they become allocated again.

Yes, the problem was that writes won't clear errors. zeroing through
either hole-punch, truncate, unlinking the file should all work
(assuming the hole-punch or truncate ranges wholly contain the
'badblock' sector).


> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 89+ messages in thread

* RE: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-18 20:47               ` Jeff Moyer
  (?)
@ 2017-01-19  2:56                 ` Slava Dubeyko
  -1 siblings, 0 replies; 89+ messages in thread
From: Slava Dubeyko @ 2017-01-19  2:56 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Jan Kara, linux-nvdimm@lists.01.org, linux-block,
	Viacheslav Dubeyko, Linux FS Devel, lsf-pc


-----Original Message-----
From: Jeff Moyer [mailto:jmoyer@redhat.com] 
Sent: Wednesday, January 18, 2017 12:48 PM
To: Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>
Cc: Jan Kara <jack@suse.cz>; linux-nvdimm@lists.01.org <linux-nvdimm@ml01.01.org>; linux-block@vger.kernel.org; Viacheslav Dubeyko <slava@dubeyko.com>; Linux FS Devel <linux-fsdevel@vger.kernel.org>; lsf-pc@lists.linux-foundation.org
Subject: Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems

>>> Well, the situation with NVM is more like with DRAM AFAIU. It is 
>>> quite reliable but given the size the probability *some* cell has degraded is quite high.
>>> And similar to DRAM you'll get MCE (Machine Check Exception) when you 
>>> try to read such cell. As Vishal wrote, the hardware does some 
>>> background scrubbing and relocates stuff early if needed but nothing is 100%.
>>
>> My understanding that hardware does the remapping the affected address 
>> range (64 bytes, for example) but it doesn't move/migrate the stored 
>> data in this address range. So, it sounds slightly weird. Because it 
>> means that no guarantee to retrieve the stored data. It sounds that 
>> file system should be aware about this and has to be heavily protected 
>> by some replication or erasure coding scheme. Otherwise, if the 
>> hardware does everything for us (remap the affected address region and 
>> move data into a new address region) then why does file system need to 
>> know about the affected address regions?
>
>The data is lost, that's why you're getting an ECC.  It's tantamount to -EIO for a disk block access.

I see the three possible cases here:
(1) bad block has been discovered (no remap, no recovering) -> data is lost; -EIO for a disk block access, block is always bad;
(2) bad block has been discovered and remapped -> data is lost; -EIO for a disk block access.
(3) bad block has been discovered, remapped and recovered -> no data is lost.

>> Let's imagine that the affected address range will equal to 64 bytes. 
>> It sounds for me that for the case of block device it will affect the 
>> whole logical block (4 KB).
>
> 512 bytes, and yes, that's the granularity at which we track errors in the block layer, so that's the minimum amount of data you lose.

I think it depends what granularity hardware supports. It could be 512 bytes, 4 KB, maybe greater.

>> The situation is more critical for the case of DAX approach. Correct 
>> me if I wrong but my understanding is the goal of DAX is to provide 
>> the direct access to file's memory pages with minimal file system 
>> overhead. So, it looks like that raising bad block issue on file 
>> system level will affect a user-space application. Because, finally, 
>> user-space application will need to process such trouble (bad block 
>> issue). It sounds for me as really weird situation. What can protect a 
>> user-space application from encountering the issue with partially 
>> incorrect memory page?
>
> Applications need to deal with -EIO today.  This is the same sort of thing.
> If an application trips over a bad block during a load from persistent memory,
> they will get a signal, and they can either handle it or not.
>
> Have a read through this specification and see if it clears anything up for you:
>  http://www.snia.org/tech_activities/standards/curr_standards/npm

Thank you for sharing this. So, if a user-space application follows to the
NVM Programming Model then it will be able to survive by means of catching
and processing the exceptions. But these applications have to be implemented yet.
Also such applications need in special technique(s) of recovering. It sounds
that legacy user-space applications are unable to survive for the NVM.PM.FILE mode
in the case of load/store operation's failure.

Thanks,
Vyacheslav Dubeyko.

Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 89+ messages in thread

* RE: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-19  2:56                 ` Slava Dubeyko
  0 siblings, 0 replies; 89+ messages in thread
From: Slava Dubeyko @ 2017-01-19  2:56 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Jan Kara, linux-nvdimm@lists.01.org, linux-block,
	Viacheslav Dubeyko, Linux FS Devel, lsf-pc


-----Original Message-----
From: Jeff Moyer [mailto:jmoyer@redhat.com]=20
Sent: Wednesday, January 18, 2017 12:48 PM
To: Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>
Cc: Jan Kara <jack@suse.cz>; linux-nvdimm@lists.01.org <linux-nvdimm@ml01.0=
1.org>; linux-block@vger.kernel.org; Viacheslav Dubeyko <slava@dubeyko.com>=
; Linux FS Devel <linux-fsdevel@vger.kernel.org>; lsf-pc@lists.linux-founda=
tion.org
Subject: Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in f=
ilesystems

>>> Well, the situation with NVM is more like with DRAM AFAIU. It is=20
>>> quite reliable but given the size the probability *some* cell has degra=
ded is quite high.
>>> And similar to DRAM you'll get MCE (Machine Check Exception) when you=20
>>> try to read such cell. As Vishal wrote, the hardware does some=20
>>> background scrubbing and relocates stuff early if needed but nothing is=
 100%.
>>
>> My understanding that hardware does the remapping the affected address=20
>> range (64 bytes, for example) but it doesn't move/migrate the stored=20
>> data in this address range. So, it sounds slightly weird. Because it=20
>> means that no guarantee to retrieve the stored data. It sounds that=20
>> file system should be aware about this and has to be heavily protected=20
>> by some replication or erasure coding scheme. Otherwise, if the=20
>> hardware does everything for us (remap the affected address region and=20
>> move data into a new address region) then why does file system need to=20
>> know about the affected address regions?
>
>The data is lost, that's why you're getting an ECC.  It's tantamount to -E=
IO for a disk block access.

I see the three possible cases here:
(1) bad block has been discovered (no remap, no recovering) -> data is lost=
; -EIO for a disk block access, block is always bad;
(2) bad block has been discovered and remapped -> data is lost; -EIO for a =
disk block access.
(3) bad block has been discovered, remapped and recovered -> no data is los=
t.

>> Let's imagine that the affected address range will equal to 64 bytes.=20
>> It sounds for me that for the case of block device it will affect the=20
>> whole logical block (4 KB).
>
> 512 bytes, and yes, that's the granularity at which we track errors in th=
e block layer, so that's the minimum amount of data you lose.

I think it depends what granularity hardware supports. It could be 512 byte=
s, 4 KB, maybe greater.

>> The situation is more critical for the case of DAX approach. Correct=20
>> me if I wrong but my understanding is the goal of DAX is to provide=20
>> the direct access to file's memory pages with minimal file system=20
>> overhead. So, it looks like that raising bad block issue on file=20
>> system level will affect a user-space application. Because, finally,=20
>> user-space application will need to process such trouble (bad block=20
>> issue). It sounds for me as really weird situation. What can protect a=20
>> user-space application from encountering the issue with partially=20
>> incorrect memory page?
>
> Applications need to deal with -EIO today.  This is the same sort of thin=
g.
> If an application trips over a bad block during a load from persistent me=
mory,
> they will get a signal, and they can either handle it or not.
>
> Have a read through this specification and see if it clears anything up f=
or you:
>  http://www.snia.org/tech_activities/standards/curr_standards/npm

Thank you for sharing this. So, if a user-space application follows to the
NVM Programming Model then it will be able to survive by means of catching
and processing the exceptions. But these applications have to be implemente=
d yet.
Also such applications need in special technique(s) of recovering. It sound=
s
that legacy user-space applications are unable to survive for the NVM.PM.FI=
LE mode
in the case of load/store operation's failure.

Thanks,
Vyacheslav Dubeyko.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* RE: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-19  2:56                 ` Slava Dubeyko
  0 siblings, 0 replies; 89+ messages in thread
From: Slava Dubeyko @ 2017-01-19  2:56 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Jan Kara, linux-nvdimm@lists.01.org, linux-block,
	Viacheslav Dubeyko, Linux FS Devel, lsf-pc


-----Original Message-----
From: Jeff Moyer [mailto:jmoyer@redhat.com] 
Sent: Wednesday, January 18, 2017 12:48 PM
To: Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>
Cc: Jan Kara <jack@suse.cz>; linux-nvdimm@lists.01.org <linux-nvdimm@ml01.01.org>; linux-block@vger.kernel.org; Viacheslav Dubeyko <slava@dubeyko.com>; Linux FS Devel <linux-fsdevel@vger.kernel.org>; lsf-pc@lists.linux-foundation.org
Subject: Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems

>>> Well, the situation with NVM is more like with DRAM AFAIU. It is 
>>> quite reliable but given the size the probability *some* cell has degraded is quite high.
>>> And similar to DRAM you'll get MCE (Machine Check Exception) when you 
>>> try to read such cell. As Vishal wrote, the hardware does some 
>>> background scrubbing and relocates stuff early if needed but nothing is 100%.
>>
>> My understanding that hardware does the remapping the affected address 
>> range (64 bytes, for example) but it doesn't move/migrate the stored 
>> data in this address range. So, it sounds slightly weird. Because it 
>> means that no guarantee to retrieve the stored data. It sounds that 
>> file system should be aware about this and has to be heavily protected 
>> by some replication or erasure coding scheme. Otherwise, if the 
>> hardware does everything for us (remap the affected address region and 
>> move data into a new address region) then why does file system need to 
>> know about the affected address regions?
>
>The data is lost, that's why you're getting an ECC.  It's tantamount to -EIO for a disk block access.

I see the three possible cases here:
(1) bad block has been discovered (no remap, no recovering) -> data is lost; -EIO for a disk block access, block is always bad;
(2) bad block has been discovered and remapped -> data is lost; -EIO for a disk block access.
(3) bad block has been discovered, remapped and recovered -> no data is lost.

>> Let's imagine that the affected address range will equal to 64 bytes. 
>> It sounds for me that for the case of block device it will affect the 
>> whole logical block (4 KB).
>
> 512 bytes, and yes, that's the granularity at which we track errors in the block layer, so that's the minimum amount of data you lose.

I think it depends what granularity hardware supports. It could be 512 bytes, 4 KB, maybe greater.

>> The situation is more critical for the case of DAX approach. Correct 
>> me if I wrong but my understanding is the goal of DAX is to provide 
>> the direct access to file's memory pages with minimal file system 
>> overhead. So, it looks like that raising bad block issue on file 
>> system level will affect a user-space application. Because, finally, 
>> user-space application will need to process such trouble (bad block 
>> issue). It sounds for me as really weird situation. What can protect a 
>> user-space application from encountering the issue with partially 
>> incorrect memory page?
>
> Applications need to deal with -EIO today.  This is the same sort of thing.
> If an application trips over a bad block during a load from persistent memory,
> they will get a signal, and they can either handle it or not.
>
> Have a read through this specification and see if it clears anything up for you:
>  http://www.snia.org/tech_activities/standards/curr_standards/npm

Thank you for sharing this. So, if a user-space application follows to the
NVM Programming Model then it will be able to survive by means of catching
and processing the exceptions. But these applications have to be implemented yet.
Also such applications need in special technique(s) of recovering. It sounds
that legacy user-space applications are unable to survive for the NVM.PM.FILE mode
in the case of load/store operation's failure.

Thanks,
Vyacheslav Dubeyko.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-18 21:56                         ` Verma, Vishal L
@ 2017-01-19  8:10                             ` Jan Kara
  -1 siblings, 0 replies; 89+ messages in thread
From: Jan Kara @ 2017-01-19  8:10 UTC (permalink / raw)
  To: Verma, Vishal L
  Cc: jack-AlSwsSmVLrQ, Vyacheslav.Dubeyko-Sjgp3cTcYWE,
	darrick.wong-QHcLZuEGTsvQT0dZR+AlfA,
	linux-nvdimm-y27Ovi1pjclAfugRpC6u6w,
	linux-block-u79uwXL29TY76Z2rM5mHXA, slava-yeENwD64cLxBDgjK7y7TUQ,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Wed 18-01-17 21:56:58, Verma, Vishal L wrote:
> On Wed, 2017-01-18 at 13:32 -0800, Dan Williams wrote:
> > On Wed, Jan 18, 2017 at 1:02 PM, Darrick J. Wong
> > <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> > > On Wed, Jan 18, 2017 at 03:39:17PM -0500, Jeff Moyer wrote:
> > > > Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> writes:
> > > > 
> > > > > On Tue 17-01-17 15:14:21, Vishal Verma wrote:
> > > > > > Your note on the online repair does raise another tangentially
> > > > > > related
> > > > > > topic. Currently, if there are badblocks, writes via the bio
> > > > > > submission
> > > > > > path will clear the error (if the hardware is able to remap
> > > > > > the bad
> > > > > > locations). However, if the filesystem is mounted eith DAX,
> > > > > > even
> > > > > > non-mmap operations - read() and write() will go through the
> > > > > > dax paths
> > > > > > (dax_do_io()). We haven't found a good/agreeable way to
> > > > > > perform
> > > > > > error-clearing in this case. So currently, if a dax mounted
> > > > > > filesystem
> > > > > > has badblocks, the only way to clear those badblocks is to
> > > > > > mount it
> > > > > > without DAX, and overwrite/zero the bad locations. This is a
> > > > > > pretty
> > > > > > terrible user experience, and I'm hoping this can be solved in
> > > > > > a better
> > > > > > way.
> > > > > 
> > > > > Please remind me, what is the problem with DAX code doing
> > > > > necessary work to
> > > > > clear the error when it gets EIO from memcpy on write?
> > > > 
> > > > You won't get an MCE for a store;  only loads generate them.
> > > > 
> > > > Won't fallocate FL_ZERO_RANGE clear bad blocks when mounted with
> > > > -o dax?
> > > 
> > > Not necessarily; XFS usually implements this by punching out the
> > > range
> > > and then reallocating it as unwritten blocks.
> > > 
> > 
> > That does clear the error because the unwritten blocks are zeroed and
> > errors cleared when they become allocated again.
> 
> Yes, the problem was that writes won't clear errors. zeroing through
> either hole-punch, truncate, unlinking the file should all work
> (assuming the hole-punch or truncate ranges wholly contain the
> 'badblock' sector).

Let me repeat my question: You have mentioned that if we do IO through DAX,
writes won't clear errors and we should fall back to normal block path to
do write to clear the error. What does prevent us from directly clearing
the error from DAX path?

								Honza
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-19  8:10                             ` Jan Kara
  0 siblings, 0 replies; 89+ messages in thread
From: Jan Kara @ 2017-01-19  8:10 UTC (permalink / raw)
  To: Verma, Vishal L
  Cc: Williams, Dan J, darrick.wong, jack, Vyacheslav.Dubeyko,
	linux-nvdimm, linux-block, slava, linux-fsdevel, lsf-pc

On Wed 18-01-17 21:56:58, Verma, Vishal L wrote:
> On Wed, 2017-01-18 at 13:32 -0800, Dan Williams wrote:
> > On Wed, Jan 18, 2017 at 1:02 PM, Darrick J. Wong
> > <darrick.wong@oracle.com> wrote:
> > > On Wed, Jan 18, 2017 at 03:39:17PM -0500, Jeff Moyer wrote:
> > > > Jan Kara <jack@suse.cz> writes:
> > > > 
> > > > > On Tue 17-01-17 15:14:21, Vishal Verma wrote:
> > > > > > Your note on the online repair does raise another tangentially
> > > > > > related
> > > > > > topic. Currently, if there are badblocks, writes via the bio
> > > > > > submission
> > > > > > path will clear the error (if the hardware is able to remap
> > > > > > the bad
> > > > > > locations). However, if the filesystem is mounted eith DAX,
> > > > > > even
> > > > > > non-mmap operations - read() and write() will go through the
> > > > > > dax paths
> > > > > > (dax_do_io()). We haven't found a good/agreeable way to
> > > > > > perform
> > > > > > error-clearing in this case. So currently, if a dax mounted
> > > > > > filesystem
> > > > > > has badblocks, the only way to clear those badblocks is to
> > > > > > mount it
> > > > > > without DAX, and overwrite/zero the bad locations. This is a
> > > > > > pretty
> > > > > > terrible user experience, and I'm hoping this can be solved in
> > > > > > a better
> > > > > > way.
> > > > > 
> > > > > Please remind me, what is the problem with DAX code doing
> > > > > necessary work to
> > > > > clear the error when it gets EIO from memcpy on write?
> > > > 
> > > > You won't get an MCE for a store;��only loads generate them.
> > > > 
> > > > Won't fallocate FL_ZERO_RANGE clear bad blocks when mounted with
> > > > -o dax?
> > > 
> > > Not necessarily; XFS usually implements this by punching out the
> > > range
> > > and then reallocating it as unwritten blocks.
> > > 
> > 
> > That does clear the error because the unwritten blocks are zeroed and
> > errors cleared when they become allocated again.
> 
> Yes, the problem was that writes won't clear errors. zeroing through
> either hole-punch, truncate, unlinking the file should all work
> (assuming the hole-punch or truncate ranges wholly contain the
> 'badblock' sector).

Let me repeat my question: You have mentioned that if we do IO through DAX,
writes won't clear errors and we should fall back to normal block path to
do write to clear the error. What does prevent us from directly clearing
the error from DAX path?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-19  8:10                             ` Jan Kara
@ 2017-01-19 18:59                                 ` Vishal Verma
  -1 siblings, 0 replies; 89+ messages in thread
From: Vishal Verma @ 2017-01-19 18:59 UTC (permalink / raw)
  To: Jan Kara
  Cc: Vyacheslav.Dubeyko-Sjgp3cTcYWE,
	darrick.wong-QHcLZuEGTsvQT0dZR+AlfA,
	linux-nvdimm-y27Ovi1pjclAfugRpC6u6w,
	linux-block-u79uwXL29TY76Z2rM5mHXA, slava-yeENwD64cLxBDgjK7y7TUQ,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On 01/19, Jan Kara wrote:
> On Wed 18-01-17 21:56:58, Verma, Vishal L wrote:
> > On Wed, 2017-01-18 at 13:32 -0800, Dan Williams wrote:
> > > On Wed, Jan 18, 2017 at 1:02 PM, Darrick J. Wong
> > > <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> > > > On Wed, Jan 18, 2017 at 03:39:17PM -0500, Jeff Moyer wrote:
> > > > > Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> writes:
> > > > > 
> > > > > > On Tue 17-01-17 15:14:21, Vishal Verma wrote:
> > > > > > > Your note on the online repair does raise another tangentially
> > > > > > > related
> > > > > > > topic. Currently, if there are badblocks, writes via the bio
> > > > > > > submission
> > > > > > > path will clear the error (if the hardware is able to remap
> > > > > > > the bad
> > > > > > > locations). However, if the filesystem is mounted eith DAX,
> > > > > > > even
> > > > > > > non-mmap operations - read() and write() will go through the
> > > > > > > dax paths
> > > > > > > (dax_do_io()). We haven't found a good/agreeable way to
> > > > > > > perform
> > > > > > > error-clearing in this case. So currently, if a dax mounted
> > > > > > > filesystem
> > > > > > > has badblocks, the only way to clear those badblocks is to
> > > > > > > mount it
> > > > > > > without DAX, and overwrite/zero the bad locations. This is a
> > > > > > > pretty
> > > > > > > terrible user experience, and I'm hoping this can be solved in
> > > > > > > a better
> > > > > > > way.
> > > > > > 
> > > > > > Please remind me, what is the problem with DAX code doing
> > > > > > necessary work to
> > > > > > clear the error when it gets EIO from memcpy on write?
> > > > > 
> > > > > You won't get an MCE for a store;  only loads generate them.
> > > > > 
> > > > > Won't fallocate FL_ZERO_RANGE clear bad blocks when mounted with
> > > > > -o dax?
> > > > 
> > > > Not necessarily; XFS usually implements this by punching out the
> > > > range
> > > > and then reallocating it as unwritten blocks.
> > > > 
> > > 
> > > That does clear the error because the unwritten blocks are zeroed and
> > > errors cleared when they become allocated again.
> > 
> > Yes, the problem was that writes won't clear errors. zeroing through
> > either hole-punch, truncate, unlinking the file should all work
> > (assuming the hole-punch or truncate ranges wholly contain the
> > 'badblock' sector).
> 
> Let me repeat my question: You have mentioned that if we do IO through DAX,
> writes won't clear errors and we should fall back to normal block path to
> do write to clear the error. What does prevent us from directly clearing
> the error from DAX path?
> 
With DAX, all IO goes through DAX paths. There are two cases:
1. mmap and loads/stores: Obviously there is no kernel intervention
here, and no badblocks handling is possible.
2. read() or write() IO: In the absence of dax, this would go through
the bio submission path, through the pmem driver, and that would handle
error clearing. With DAX, this goes through dax_iomap_actor, which also
doesn't go through the pmem driver (it does a dax mapping, followed by
essentially memcpy), and hence cannot handle badblocks.


> 								Honza
> -- 
> Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-19 18:59                                 ` Vishal Verma
  0 siblings, 0 replies; 89+ messages in thread
From: Vishal Verma @ 2017-01-19 18:59 UTC (permalink / raw)
  To: Jan Kara
  Cc: Williams, Dan J, darrick.wong, Vyacheslav.Dubeyko, linux-nvdimm,
	linux-block, slava, linux-fsdevel, lsf-pc

On 01/19, Jan Kara wrote:
> On Wed 18-01-17 21:56:58, Verma, Vishal L wrote:
> > On Wed, 2017-01-18 at 13:32 -0800, Dan Williams wrote:
> > > On Wed, Jan 18, 2017 at 1:02 PM, Darrick J. Wong
> > > <darrick.wong@oracle.com> wrote:
> > > > On Wed, Jan 18, 2017 at 03:39:17PM -0500, Jeff Moyer wrote:
> > > > > Jan Kara <jack@suse.cz> writes:
> > > > > 
> > > > > > On Tue 17-01-17 15:14:21, Vishal Verma wrote:
> > > > > > > Your note on the online repair does raise another tangentially
> > > > > > > related
> > > > > > > topic. Currently, if there are badblocks, writes via the bio
> > > > > > > submission
> > > > > > > path will clear the error (if the hardware is able to remap
> > > > > > > the bad
> > > > > > > locations). However, if the filesystem is mounted eith DAX,
> > > > > > > even
> > > > > > > non-mmap operations - read() and write() will go through the
> > > > > > > dax paths
> > > > > > > (dax_do_io()). We haven't found a good/agreeable way to
> > > > > > > perform
> > > > > > > error-clearing in this case. So currently, if a dax mounted
> > > > > > > filesystem
> > > > > > > has badblocks, the only way to clear those badblocks is to
> > > > > > > mount it
> > > > > > > without DAX, and overwrite/zero the bad locations. This is a
> > > > > > > pretty
> > > > > > > terrible user experience, and I'm hoping this can be solved in
> > > > > > > a better
> > > > > > > way.
> > > > > > 
> > > > > > Please remind me, what is the problem with DAX code doing
> > > > > > necessary work to
> > > > > > clear the error when it gets EIO from memcpy on write?
> > > > > 
> > > > > You won't get an MCE for a store;��only loads generate them.
> > > > > 
> > > > > Won't fallocate FL_ZERO_RANGE clear bad blocks when mounted with
> > > > > -o dax?
> > > > 
> > > > Not necessarily; XFS usually implements this by punching out the
> > > > range
> > > > and then reallocating it as unwritten blocks.
> > > > 
> > > 
> > > That does clear the error because the unwritten blocks are zeroed and
> > > errors cleared when they become allocated again.
> > 
> > Yes, the problem was that writes won't clear errors. zeroing through
> > either hole-punch, truncate, unlinking the file should all work
> > (assuming the hole-punch or truncate ranges wholly contain the
> > 'badblock' sector).
> 
> Let me repeat my question: You have mentioned that if we do IO through DAX,
> writes won't clear errors and we should fall back to normal block path to
> do write to clear the error. What does prevent us from directly clearing
> the error from DAX path?
> 
With DAX, all IO goes through DAX paths. There are two cases:
1. mmap and loads/stores: Obviously there is no kernel intervention
here, and no badblocks handling is possible.
2. read() or write() IO: In the absence of dax, this would go through
the bio submission path, through the pmem driver, and that would handle
error clearing. With DAX, this goes through dax_iomap_actor, which also
doesn't go through the pmem driver (it does a dax mapping, followed by
essentially memcpy), and hence cannot handle badblocks.


> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-19 18:59                                 ` Vishal Verma
@ 2017-01-19 19:03                                     ` Dan Williams
  -1 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2017-01-19 19:03 UTC (permalink / raw)
  To: Vishal Verma
  Cc: Jan Kara, Vyacheslav.Dubeyko-Sjgp3cTcYWE,
	darrick.wong-QHcLZuEGTsvQT0dZR+AlfA,
	linux-nvdimm-y27Ovi1pjclAfugRpC6u6w,
	linux-block-u79uwXL29TY76Z2rM5mHXA, slava-yeENwD64cLxBDgjK7y7TUQ,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Thu, Jan 19, 2017 at 10:59 AM, Vishal Verma <vishal.l.verma-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> On 01/19, Jan Kara wrote:
>> On Wed 18-01-17 21:56:58, Verma, Vishal L wrote:
>> > On Wed, 2017-01-18 at 13:32 -0800, Dan Williams wrote:
>> > > On Wed, Jan 18, 2017 at 1:02 PM, Darrick J. Wong
>> > > <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
>> > > > On Wed, Jan 18, 2017 at 03:39:17PM -0500, Jeff Moyer wrote:
>> > > > > Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> writes:
>> > > > >
>> > > > > > On Tue 17-01-17 15:14:21, Vishal Verma wrote:
>> > > > > > > Your note on the online repair does raise another tangentially
>> > > > > > > related
>> > > > > > > topic. Currently, if there are badblocks, writes via the bio
>> > > > > > > submission
>> > > > > > > path will clear the error (if the hardware is able to remap
>> > > > > > > the bad
>> > > > > > > locations). However, if the filesystem is mounted eith DAX,
>> > > > > > > even
>> > > > > > > non-mmap operations - read() and write() will go through the
>> > > > > > > dax paths
>> > > > > > > (dax_do_io()). We haven't found a good/agreeable way to
>> > > > > > > perform
>> > > > > > > error-clearing in this case. So currently, if a dax mounted
>> > > > > > > filesystem
>> > > > > > > has badblocks, the only way to clear those badblocks is to
>> > > > > > > mount it
>> > > > > > > without DAX, and overwrite/zero the bad locations. This is a
>> > > > > > > pretty
>> > > > > > > terrible user experience, and I'm hoping this can be solved in
>> > > > > > > a better
>> > > > > > > way.
>> > > > > >
>> > > > > > Please remind me, what is the problem with DAX code doing
>> > > > > > necessary work to
>> > > > > > clear the error when it gets EIO from memcpy on write?
>> > > > >
>> > > > > You won't get an MCE for a store;  only loads generate them.
>> > > > >
>> > > > > Won't fallocate FL_ZERO_RANGE clear bad blocks when mounted with
>> > > > > -o dax?
>> > > >
>> > > > Not necessarily; XFS usually implements this by punching out the
>> > > > range
>> > > > and then reallocating it as unwritten blocks.
>> > > >
>> > >
>> > > That does clear the error because the unwritten blocks are zeroed and
>> > > errors cleared when they become allocated again.
>> >
>> > Yes, the problem was that writes won't clear errors. zeroing through
>> > either hole-punch, truncate, unlinking the file should all work
>> > (assuming the hole-punch or truncate ranges wholly contain the
>> > 'badblock' sector).
>>
>> Let me repeat my question: You have mentioned that if we do IO through DAX,
>> writes won't clear errors and we should fall back to normal block path to
>> do write to clear the error. What does prevent us from directly clearing
>> the error from DAX path?
>>
> With DAX, all IO goes through DAX paths. There are two cases:
> 1. mmap and loads/stores: Obviously there is no kernel intervention
> here, and no badblocks handling is possible.
> 2. read() or write() IO: In the absence of dax, this would go through
> the bio submission path, through the pmem driver, and that would handle
> error clearing. With DAX, this goes through dax_iomap_actor, which also
> doesn't go through the pmem driver (it does a dax mapping, followed by
> essentially memcpy), and hence cannot handle badblocks.

Hmm, that may no longer be true after my changes to push dax flushing
to the driver. I.e. we could have a copy_from_iter() implementation
that attempts to clear errors... I'll get that series out and we can
discuss there.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-19 19:03                                     ` Dan Williams
  0 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2017-01-19 19:03 UTC (permalink / raw)
  To: Vishal Verma
  Cc: Jan Kara, darrick.wong, Vyacheslav.Dubeyko, linux-nvdimm,
	linux-block, slava, linux-fsdevel, lsf-pc

On Thu, Jan 19, 2017 at 10:59 AM, Vishal Verma <vishal.l.verma@intel.com> wrote:
> On 01/19, Jan Kara wrote:
>> On Wed 18-01-17 21:56:58, Verma, Vishal L wrote:
>> > On Wed, 2017-01-18 at 13:32 -0800, Dan Williams wrote:
>> > > On Wed, Jan 18, 2017 at 1:02 PM, Darrick J. Wong
>> > > <darrick.wong@oracle.com> wrote:
>> > > > On Wed, Jan 18, 2017 at 03:39:17PM -0500, Jeff Moyer wrote:
>> > > > > Jan Kara <jack@suse.cz> writes:
>> > > > >
>> > > > > > On Tue 17-01-17 15:14:21, Vishal Verma wrote:
>> > > > > > > Your note on the online repair does raise another tangentially
>> > > > > > > related
>> > > > > > > topic. Currently, if there are badblocks, writes via the bio
>> > > > > > > submission
>> > > > > > > path will clear the error (if the hardware is able to remap
>> > > > > > > the bad
>> > > > > > > locations). However, if the filesystem is mounted eith DAX,
>> > > > > > > even
>> > > > > > > non-mmap operations - read() and write() will go through the
>> > > > > > > dax paths
>> > > > > > > (dax_do_io()). We haven't found a good/agreeable way to
>> > > > > > > perform
>> > > > > > > error-clearing in this case. So currently, if a dax mounted
>> > > > > > > filesystem
>> > > > > > > has badblocks, the only way to clear those badblocks is to
>> > > > > > > mount it
>> > > > > > > without DAX, and overwrite/zero the bad locations. This is a
>> > > > > > > pretty
>> > > > > > > terrible user experience, and I'm hoping this can be solved in
>> > > > > > > a better
>> > > > > > > way.
>> > > > > >
>> > > > > > Please remind me, what is the problem with DAX code doing
>> > > > > > necessary work to
>> > > > > > clear the error when it gets EIO from memcpy on write?
>> > > > >
>> > > > > You won't get an MCE for a store;  only loads generate them.
>> > > > >
>> > > > > Won't fallocate FL_ZERO_RANGE clear bad blocks when mounted with
>> > > > > -o dax?
>> > > >
>> > > > Not necessarily; XFS usually implements this by punching out the
>> > > > range
>> > > > and then reallocating it as unwritten blocks.
>> > > >
>> > >
>> > > That does clear the error because the unwritten blocks are zeroed and
>> > > errors cleared when they become allocated again.
>> >
>> > Yes, the problem was that writes won't clear errors. zeroing through
>> > either hole-punch, truncate, unlinking the file should all work
>> > (assuming the hole-punch or truncate ranges wholly contain the
>> > 'badblock' sector).
>>
>> Let me repeat my question: You have mentioned that if we do IO through DAX,
>> writes won't clear errors and we should fall back to normal block path to
>> do write to clear the error. What does prevent us from directly clearing
>> the error from DAX path?
>>
> With DAX, all IO goes through DAX paths. There are two cases:
> 1. mmap and loads/stores: Obviously there is no kernel intervention
> here, and no badblocks handling is possible.
> 2. read() or write() IO: In the absence of dax, this would go through
> the bio submission path, through the pmem driver, and that would handle
> error clearing. With DAX, this goes through dax_iomap_actor, which also
> doesn't go through the pmem driver (it does a dax mapping, followed by
> essentially memcpy), and hence cannot handle badblocks.

Hmm, that may no longer be true after my changes to push dax flushing
to the driver. I.e. we could have a copy_from_iter() implementation
that attempts to clear errors... I'll get that series out and we can
discuss there.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-19  2:56                 ` Slava Dubeyko
@ 2017-01-19 19:33                   ` Jeff Moyer
  -1 siblings, 0 replies; 89+ messages in thread
From: Jeff Moyer @ 2017-01-19 19:33 UTC (permalink / raw)
  To: Slava Dubeyko
  Cc: Jan Kara, linux-nvdimm@lists.01.org, linux-block,
	Viacheslav Dubeyko, Linux FS Devel, lsf-pc

Hi, Slava,

Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com> writes:

>>The data is lost, that's why you're getting an ECC.  It's tantamount
>>to -EIO for a disk block access.
>
> I see the three possible cases here:
> (1) bad block has been discovered (no remap, no recovering) -> data is
>> lost; -EIO for a disk block access, block is always bad;

This is, of course, a possiblity.  In that case, attempts to clear the
error will not succeed.

> (2) bad block has been discovered and remapped -> data is lost; -EIO
> for a disk block access.

Right, and the error is cleared when new data is provided (i.e. through
a write system call or fallocate).

> (3) bad block has been discovered, remapped and recovered -> no data is lost.

This is transparent to the OS and the application.

>>> Let's imagine that the affected address range will equal to 64 bytes. 
>>> It sounds for me that for the case of block device it will affect the 
>>> whole logical block (4 KB).
>>
>> 512 bytes, and yes, that's the granularity at which we track errors
>> in the block layer, so that's the minimum amount of data you lose.
>
> I think it depends what granularity hardware supports. It could be 512
> bytes, 4 KB, maybe greater.

Of course, though I expect the ECC protection in the NVDIMMs to cover a
range much smaller than a page.

>>> The situation is more critical for the case of DAX approach. Correct 
>>> me if I wrong but my understanding is the goal of DAX is to provide 
>>> the direct access to file's memory pages with minimal file system 
>>> overhead. So, it looks like that raising bad block issue on file 
>>> system level will affect a user-space application. Because, finally, 
>>> user-space application will need to process such trouble (bad block 
>>> issue). It sounds for me as really weird situation. What can protect a 
>>> user-space application from encountering the issue with partially 
>>> incorrect memory page?
>>
>> Applications need to deal with -EIO today.  This is the same sort of thing.
>> If an application trips over a bad block during a load from persistent memory,
>> they will get a signal, and they can either handle it or not.
>>
>> Have a read through this specification and see if it clears anything up for you:
>>  http://www.snia.org/tech_activities/standards/curr_standards/npm
>
> Thank you for sharing this. So, if a user-space application follows to the
> NVM Programming Model then it will be able to survive by means of catching
> and processing the exceptions. But these applications have to be implemented yet.
> Also such applications need in special technique(s) of recovering. It sounds
> that legacy user-space applications are unable to survive for the NVM.PM.FILE mode
> in the case of load/store operation's failure.

By legacy, I assume you mean those applications which mmap file data and
use msync.  Those applications already have to deal with SIGBUS today
when a disk block is bad.  There is no change in behavior.

If you meant legacy applications that use read/write, they also should
see no change in behavior.  Bad blocks are tracked in the block layer,
and any attempt to read from a bad area of memory will get -EIO.

Cheers,
Jeff
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-19 19:33                   ` Jeff Moyer
  0 siblings, 0 replies; 89+ messages in thread
From: Jeff Moyer @ 2017-01-19 19:33 UTC (permalink / raw)
  To: Slava Dubeyko
  Cc: Jan Kara, linux-nvdimm@lists.01.org, linux-block,
	Viacheslav Dubeyko, Linux FS Devel, lsf-pc

Hi, Slava,

Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com> writes:

>>The data is lost, that's why you're getting an ECC.  It's tantamount
>>to -EIO for a disk block access.
>
> I see the three possible cases here:
> (1) bad block has been discovered (no remap, no recovering) -> data is
>> lost; -EIO for a disk block access, block is always bad;

This is, of course, a possiblity.  In that case, attempts to clear the
error will not succeed.

> (2) bad block has been discovered and remapped -> data is lost; -EIO
> for a disk block access.

Right, and the error is cleared when new data is provided (i.e. through
a write system call or fallocate).

> (3) bad block has been discovered, remapped and recovered -> no data is lost.

This is transparent to the OS and the application.

>>> Let's imagine that the affected address range will equal to 64 bytes. 
>>> It sounds for me that for the case of block device it will affect the 
>>> whole logical block (4 KB).
>>
>> 512 bytes, and yes, that's the granularity at which we track errors
>> in the block layer, so that's the minimum amount of data you lose.
>
> I think it depends what granularity hardware supports. It could be 512
> bytes, 4 KB, maybe greater.

Of course, though I expect the ECC protection in the NVDIMMs to cover a
range much smaller than a page.

>>> The situation is more critical for the case of DAX approach. Correct 
>>> me if I wrong but my understanding is the goal of DAX is to provide 
>>> the direct access to file's memory pages with minimal file system 
>>> overhead. So, it looks like that raising bad block issue on file 
>>> system level will affect a user-space application. Because, finally, 
>>> user-space application will need to process such trouble (bad block 
>>> issue). It sounds for me as really weird situation. What can protect a 
>>> user-space application from encountering the issue with partially 
>>> incorrect memory page?
>>
>> Applications need to deal with -EIO today.  This is the same sort of thing.
>> If an application trips over a bad block during a load from persistent memory,
>> they will get a signal, and they can either handle it or not.
>>
>> Have a read through this specification and see if it clears anything up for you:
>>  http://www.snia.org/tech_activities/standards/curr_standards/npm
>
> Thank you for sharing this. So, if a user-space application follows to the
> NVM Programming Model then it will be able to survive by means of catching
> and processing the exceptions. But these applications have to be implemented yet.
> Also such applications need in special technique(s) of recovering. It sounds
> that legacy user-space applications are unable to survive for the NVM.PM.FILE mode
> in the case of load/store operation's failure.

By legacy, I assume you mean those applications which mmap file data and
use msync.  Those applications already have to deal with SIGBUS today
when a disk block is bad.  There is no change in behavior.

If you meant legacy applications that use read/write, they also should
see no change in behavior.  Bad blocks are tracked in the block layer,
and any attempt to read from a bad area of memory will get -EIO.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-18  9:38                 ` Jan Kara
@ 2017-01-19 21:17                   ` Vishal Verma
  -1 siblings, 0 replies; 89+ messages in thread
From: Vishal Verma @ 2017-01-19 21:17 UTC (permalink / raw)
  To: Jan Kara
  Cc: Slava Dubeyko, Darrick J. Wong, linux-nvdimm@lists.01.org,
	linux-block, Linux FS Devel, Viacheslav Dubeyko, Andiry Xu,
	lsf-pc

On 01/18, Jan Kara wrote:
> On Tue 17-01-17 15:37:05, Vishal Verma wrote:
> > I do mean that in the filesystem, for every IO, the badblocks will be
> > checked. Currently, the pmem driver does this, and the hope is that the
> > filesystem can do a better job at it. The driver unconditionally checks
> > every IO for badblocks on the whole device. Depending on how the
> > badblocks are represented in the filesystem, we might be able to quickly
> > tell if a file/range has existing badblocks, and error out the IO
> > accordingly.
> > 
> > At mount the the fs would read the existing badblocks on the block
> > device, and build its own representation of them. Then during normal
> > use, if the underlying badblocks change, the fs would get a notification
> > that would allow it to also update its own representation.
> 
> So I believe we have to distinguish three cases so that we are on the same
> page.
> 
> 1) PMEM is exposed only via a block interface for legacy filesystems to
> use. Here, all the bad blocks handling IMO must happen in NVDIMM driver.
> Looking from outside, the IO either returns with EIO or succeeds. As a
> result you cannot ever ger rid of bad blocks handling in the NVDIMM driver.

Correct.

> 
> 2) PMEM is exposed for DAX aware filesystem. This seems to be what you are
> mostly interested in. We could possibly do something more efficient than
> what NVDIMM driver does however the complexity would be relatively high and
> frankly I'm far from convinced this is really worth it. If there are so
> many badblocks this would matter, the HW has IMHO bigger problems than
> performance.

Correct, and Dave was of the opinion that once at least XFS has reverse
mapping support (which it does now), adding badblocks information to
that should not be a hard lift, and should be a better solution. I
suppose should try to benchmark how much of a penalty the current badblock
checking in the NVVDIMM driver imposes. The penalty is not because there
may be a large number of badblocks, but just due to the fact that we
have to do this check for every IO, in fact, every 'bvec' in a bio.

> 
> 3) PMEM filesystem - there things are even more difficult as was already
> noted elsewhere in the thread. But for now I'd like to leave those aside
> not to complicate things too much.

Agreed that that merits consideration and a whole discussion  by itself,
based on the points Audiry raised.

> 
> Now my question: Why do we bother with badblocks at all? In cases 1) and 2)
> if the platform can recover from MCE, we can just always access persistent
> memory using memcpy_mcsafe(), if that fails, return -EIO. Actually that
> seems to already happen so we just need to make sure all places handle
> returned errors properly (e.g. fs/dax.c does not seem to) and we are done.
> No need for bad blocks list at all, no slow down unless we hit a bad cell
> and in that case who cares about performance when the data is gone...

Even when we have MCE recovery, we cannot do away with the badblocks
list:
1. My understanding is that the hardware's ability to do MCE recovery is
limited/best-effort, and is not guaranteed. There can be circumstances
that cause a "Processor Context Corrupt" state, which is unrecoverable.
2. We still need to maintain a badblocks list so that we know what
blocks need to be cleared (via the ACPI method) on writes.

> 
> For platforms that cannot recover from MCE - just buy better hardware ;).
> Seriously, I have doubts people can seriously use a machine that will
> unavoidably randomly reboot (as there is always a risk you hit error that
> has not been uncovered by background scrub). But maybe for big cloud providers
> the cost savings may offset for the inconvenience, I don't know. But still
> for that case a bad blocks handling in NVDIMM code like we do now looks
> good enough?

The current handling is good enough for those systems, yes.

> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-19 21:17                   ` Vishal Verma
  0 siblings, 0 replies; 89+ messages in thread
From: Vishal Verma @ 2017-01-19 21:17 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andiry Xu, Slava Dubeyko, Darrick J. Wong,
	linux-nvdimm@lists.01.org, linux-block, Viacheslav Dubeyko,
	Linux FS Devel, lsf-pc

On 01/18, Jan Kara wrote:
> On Tue 17-01-17 15:37:05, Vishal Verma wrote:
> > I do mean that in the filesystem, for every IO, the badblocks will be
> > checked. Currently, the pmem driver does this, and the hope is that the
> > filesystem can do a better job at it. The driver unconditionally checks
> > every IO for badblocks on the whole device. Depending on how the
> > badblocks are represented in the filesystem, we might be able to quickly
> > tell if a file/range has existing badblocks, and error out the IO
> > accordingly.
> > 
> > At mount the the fs would read the existing badblocks on the block
> > device, and build its own representation of them. Then during normal
> > use, if the underlying badblocks change, the fs would get a notification
> > that would allow it to also update its own representation.
> 
> So I believe we have to distinguish three cases so that we are on the same
> page.
> 
> 1) PMEM is exposed only via a block interface for legacy filesystems to
> use. Here, all the bad blocks handling IMO must happen in NVDIMM driver.
> Looking from outside, the IO either returns with EIO or succeeds. As a
> result you cannot ever ger rid of bad blocks handling in the NVDIMM driver.

Correct.

> 
> 2) PMEM is exposed for DAX aware filesystem. This seems to be what you are
> mostly interested in. We could possibly do something more efficient than
> what NVDIMM driver does however the complexity would be relatively high and
> frankly I'm far from convinced this is really worth it. If there are so
> many badblocks this would matter, the HW has IMHO bigger problems than
> performance.

Correct, and Dave was of the opinion that once at least XFS has reverse
mapping support (which it does now), adding badblocks information to
that should not be a hard lift, and should be a better solution. I
suppose should try to benchmark how much of a penalty the current badblock
checking in the NVVDIMM driver imposes. The penalty is not because there
may be a large number of badblocks, but just due to the fact that we
have to do this check for every IO, in fact, every 'bvec' in a bio.

> 
> 3) PMEM filesystem - there things are even more difficult as was already
> noted elsewhere in the thread. But for now I'd like to leave those aside
> not to complicate things too much.

Agreed that that merits consideration and a whole discussion  by itself,
based on the points Audiry raised.

> 
> Now my question: Why do we bother with badblocks at all? In cases 1) and 2)
> if the platform can recover from MCE, we can just always access persistent
> memory using memcpy_mcsafe(), if that fails, return -EIO. Actually that
> seems to already happen so we just need to make sure all places handle
> returned errors properly (e.g. fs/dax.c does not seem to) and we are done.
> No need for bad blocks list at all, no slow down unless we hit a bad cell
> and in that case who cares about performance when the data is gone...

Even when we have MCE recovery, we cannot do away with the badblocks
list:
1. My understanding is that the hardware's ability to do MCE recovery is
limited/best-effort, and is not guaranteed. There can be circumstances
that cause a "Processor Context Corrupt" state, which is unrecoverable.
2. We still need to maintain a badblocks list so that we know what
blocks need to be cleared (via the ACPI method) on writes.

> 
> For platforms that cannot recover from MCE - just buy better hardware ;).
> Seriously, I have doubts people can seriously use a machine that will
> unavoidably randomly reboot (as there is always a risk you hit error that
> has not been uncovered by background scrub). But maybe for big cloud providers
> the cost savings may offset for the inconvenience, I don't know. But still
> for that case a bad blocks handling in NVDIMM code like we do now looks
> good enough?

The current handling is good enough for those systems, yes.

> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-18  1:58                     ` Andiry Xu
  (?)
@ 2017-01-20  0:32                         ` Verma, Vishal L
  -1 siblings, 0 replies; 89+ messages in thread
From: Verma, Vishal L @ 2017-01-20  0:32 UTC (permalink / raw)
  To: andiry-Re5JQEeQqe8AvxtiuMwx3w
  Cc: Vyacheslav.Dubeyko-Sjgp3cTcYWE,
	darrick.wong-QHcLZuEGTsvQT0dZR+AlfA,
	linux-nvdimm-y27Ovi1pjclAfugRpC6u6w,
	linux-block-u79uwXL29TY76Z2rM5mHXA, slava-yeENwD64cLxBDgjK7y7TUQ,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Tue, 2017-01-17 at 17:58 -0800, Andiry Xu wrote:
> On Tue, Jan 17, 2017 at 3:51 PM, Vishal Verma <vishal.l.verma@intel.co
> m> wrote:
> > On 01/17, Andiry Xu wrote:
> > 
> > <snip>
> > 
> > > > > 
> > > > > The pmem_do_bvec() read logic is like this:
> > > > > 
> > > > > pmem_do_bvec()
> > > > >     if (is_bad_pmem())
> > > > >         return -EIO;
> > > > >     else
> > > > >         memcpy_from_pmem();
> > > > > 
> > > > > Note memcpy_from_pmem() is calling memcpy_mcsafe(). Does this
> > > > > imply
> > > > > that even if a block is not in the badblock list, it still can
> > > > > be bad
> > > > > and causes MCE? Does the badblock list get changed during file
> > > > > system
> > > > > running? If that is the case, should the file system get a
> > > > > notification when it gets changed? If a block is good when I
> > > > > first
> > > > > read it, can I still trust it to be good for the second
> > > > > access?
> > > > 
> > > > Yes, if a block is not in the badblocks list, it can still cause
> > > > an
> > > > MCE. This is the latent error case I described above. For a
> > > > simple read()
> > > > via the pmem driver, this will get handled by memcpy_mcsafe. For
> > > > mmap,
> > > > an MCE is inevitable.
> > > > 
> > > > Yes the badblocks list may change while a filesystem is running.
> > > > The RFC
> > > > patches[1] I linked to add a notification for the filesystem
> > > > when this
> > > > happens.
> > > > 
> > > 
> > > This is really bad and it makes file system implementation much
> > > more
> > > complicated. And badblock notification does not help very much,
> > > because any block can be bad potentially, no matter it is in
> > > badblock
> > > list or not. And file system has to perform checking for every
> > > read,
> > > using memcpy_mcsafe. This is disaster for file system like NOVA,
> > > which
> > > uses pointer de-reference to access data structures on pmem. Now
> > > if I
> > > want to read a field in an inode on pmem, I have to copy it to
> > > DRAM
> > > first and make sure memcpy_mcsafe() does not report anything
> > > wrong.
> > 
> > You have a good point, and I don't know if I have an answer for
> > this..
> > Assuming a system with MCE recovery, maybe NOVA can add a mce
> > handler
> > similar to nfit_handle_mce(), and handle errors as they happen, but
> > I'm
> > being very hand-wavey here and don't know how much/how well that
> > might
> > work..
> > 
> > > 
> > > > No, if the media, for some reason, 'dvelops' a bad cell, a
> > > > second
> > > > consecutive read does have a chance of being bad. Once a
> > > > location has
> > > > been marked as bad, it will stay bad till the ACPI clear error
> > > > 'DSM' has
> > > > been called to mark it as clean.
> > > > 
> > > 
> > > I wonder what happens to write in this case? If a block is bad but
> > > not
> > > reported in badblock list. Now I write to it without reading
> > > first. Do
> > > I clear the poison with the write? Or still require a ACPI DSM?
> > 
> > With writes, my understanding is there is still a possibility that
> > an
> > internal read-modify-write can happen, and cause a MCE (this is the
> > same
> > as writing to a bad DRAM cell, which can also cause an MCE). You
> > can't
> > really use the ACPI DSM preemptively because you don't know whether
> > the
> > location was bad. The error flow will be something like write causes
> > the
> > MCE, a badblock gets added (either through the mce handler or after
> > the
> > next reboot), and the recovery path is now the same as a regular
> > badblock.
> > 
> 
> This is different from my understanding. Right now write_pmem() in
> pmem_do_bvec() does not use memcpy_mcsafe(). If the block is bad it
> clears poison and writes to pmem again. Seems to me writing to bad
> blocks does not cause MCE. Do we need memcpy_mcsafe for pmem stores?

You are right, writes don't use memcpy_mcsafe, and will not directly
cause an MCE. However a write can cause an asynchronous 'CMCI' -
corrected machine check interrupt, but this is not critical, and wont be
a memory error as the core didn't consume poison. memcpy_mcsafe cannot
protect against this because the write is 'posted' and the CMCI is not
synchronous. Note that this is only in the latent error or memmap-store
case.

> 
> Thanks,
> Andiry
> 
> > > 
> > > > [1]: http://www.linux.sgi.com/archives/xfs/2016-06/msg00299.html
> > > > 
> > > 
> > > Thank you for the patchset. I will look into it.
> > > 
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-20  0:32                         ` Verma, Vishal L
  0 siblings, 0 replies; 89+ messages in thread
From: Verma, Vishal L @ 2017-01-20  0:32 UTC (permalink / raw)
  To: andiry
  Cc: darrick.wong, Vyacheslav.Dubeyko, linux-block, slava, lsf-pc,
	linux-nvdimm, linux-fsdevel

T24gVHVlLCAyMDE3LTAxLTE3IGF0IDE3OjU4IC0wODAwLCBBbmRpcnkgWHUgd3JvdGU6DQo+IE9u
IFR1ZSwgSmFuIDE3LCAyMDE3IGF0IDM6NTEgUE0sIFZpc2hhbCBWZXJtYSA8dmlzaGFsLmwudmVy
bWFAaW50ZWwuY28NCj4gbT4gd3JvdGU6DQo+ID4gT24gMDEvMTcsIEFuZGlyeSBYdSB3cm90ZToN
Cj4gPiANCj4gPiA8c25pcD4NCj4gPiANCj4gPiA+ID4gPiANCj4gPiA+ID4gPiBUaGUgcG1lbV9k
b19idmVjKCkgcmVhZCBsb2dpYyBpcyBsaWtlIHRoaXM6DQo+ID4gPiA+ID4gDQo+ID4gPiA+ID4g
cG1lbV9kb19idmVjKCkNCj4gPiA+ID4gPiDCoMKgwqDCoGlmIChpc19iYWRfcG1lbSgpKQ0KPiA+
ID4gPiA+IMKgwqDCoMKgwqDCoMKgwqByZXR1cm4gLUVJTzsNCj4gPiA+ID4gPiDCoMKgwqDCoGVs
c2UNCj4gPiA+ID4gPiDCoMKgwqDCoMKgwqDCoMKgbWVtY3B5X2Zyb21fcG1lbSgpOw0KPiA+ID4g
PiA+IA0KPiA+ID4gPiA+IE5vdGUgbWVtY3B5X2Zyb21fcG1lbSgpIGlzIGNhbGxpbmcgbWVtY3B5
X21jc2FmZSgpLiBEb2VzIHRoaXMNCj4gPiA+ID4gPiBpbXBseQ0KPiA+ID4gPiA+IHRoYXQgZXZl
biBpZiBhIGJsb2NrIGlzIG5vdCBpbiB0aGUgYmFkYmxvY2sgbGlzdCwgaXQgc3RpbGwgY2FuDQo+
ID4gPiA+ID4gYmUgYmFkDQo+ID4gPiA+ID4gYW5kIGNhdXNlcyBNQ0U/IERvZXMgdGhlIGJhZGJs
b2NrIGxpc3QgZ2V0IGNoYW5nZWQgZHVyaW5nIGZpbGUNCj4gPiA+ID4gPiBzeXN0ZW0NCj4gPiA+
ID4gPiBydW5uaW5nPyBJZiB0aGF0IGlzIHRoZSBjYXNlLCBzaG91bGQgdGhlIGZpbGUgc3lzdGVt
IGdldCBhDQo+ID4gPiA+ID4gbm90aWZpY2F0aW9uIHdoZW4gaXQgZ2V0cyBjaGFuZ2VkPyBJZiBh
IGJsb2NrIGlzIGdvb2Qgd2hlbiBJDQo+ID4gPiA+ID4gZmlyc3QNCj4gPiA+ID4gPiByZWFkIGl0
LCBjYW4gSSBzdGlsbCB0cnVzdCBpdCB0byBiZSBnb29kIGZvciB0aGUgc2Vjb25kDQo+ID4gPiA+
ID4gYWNjZXNzPw0KPiA+ID4gPiANCj4gPiA+ID4gWWVzLCBpZiBhIGJsb2NrIGlzIG5vdCBpbiB0
aGUgYmFkYmxvY2tzIGxpc3QsIGl0IGNhbiBzdGlsbCBjYXVzZQ0KPiA+ID4gPiBhbg0KPiA+ID4g
PiBNQ0UuIFRoaXMgaXMgdGhlIGxhdGVudCBlcnJvciBjYXNlIEkgZGVzY3JpYmVkIGFib3ZlLiBG
b3IgYQ0KPiA+ID4gPiBzaW1wbGUgcmVhZCgpDQo+ID4gPiA+IHZpYSB0aGUgcG1lbSBkcml2ZXIs
IHRoaXMgd2lsbCBnZXQgaGFuZGxlZCBieSBtZW1jcHlfbWNzYWZlLiBGb3INCj4gPiA+ID4gbW1h
cCwNCj4gPiA+ID4gYW4gTUNFIGlzIGluZXZpdGFibGUuDQo+ID4gPiA+IA0KPiA+ID4gPiBZZXMg
dGhlIGJhZGJsb2NrcyBsaXN0IG1heSBjaGFuZ2Ugd2hpbGUgYSBmaWxlc3lzdGVtIGlzIHJ1bm5p
bmcuDQo+ID4gPiA+IFRoZSBSRkMNCj4gPiA+ID4gcGF0Y2hlc1sxXSBJIGxpbmtlZCB0byBhZGQg
YSBub3RpZmljYXRpb24gZm9yIHRoZSBmaWxlc3lzdGVtDQo+ID4gPiA+IHdoZW4gdGhpcw0KPiA+
ID4gPiBoYXBwZW5zLg0KPiA+ID4gPiANCj4gPiA+IA0KPiA+ID4gVGhpcyBpcyByZWFsbHkgYmFk
IGFuZCBpdCBtYWtlcyBmaWxlIHN5c3RlbSBpbXBsZW1lbnRhdGlvbiBtdWNoDQo+ID4gPiBtb3Jl
DQo+ID4gPiBjb21wbGljYXRlZC4gQW5kIGJhZGJsb2NrIG5vdGlmaWNhdGlvbiBkb2VzIG5vdCBo
ZWxwIHZlcnkgbXVjaCwNCj4gPiA+IGJlY2F1c2UgYW55IGJsb2NrIGNhbiBiZSBiYWQgcG90ZW50
aWFsbHksIG5vIG1hdHRlciBpdCBpcyBpbg0KPiA+ID4gYmFkYmxvY2sNCj4gPiA+IGxpc3Qgb3Ig
bm90LiBBbmQgZmlsZSBzeXN0ZW0gaGFzIHRvIHBlcmZvcm0gY2hlY2tpbmcgZm9yIGV2ZXJ5DQo+
ID4gPiByZWFkLA0KPiA+ID4gdXNpbmcgbWVtY3B5X21jc2FmZS4gVGhpcyBpcyBkaXNhc3RlciBm
b3IgZmlsZSBzeXN0ZW0gbGlrZSBOT1ZBLA0KPiA+ID4gd2hpY2gNCj4gPiA+IHVzZXMgcG9pbnRl
ciBkZS1yZWZlcmVuY2UgdG8gYWNjZXNzIGRhdGEgc3RydWN0dXJlcyBvbiBwbWVtLiBOb3cNCj4g
PiA+IGlmIEkNCj4gPiA+IHdhbnQgdG8gcmVhZCBhIGZpZWxkIGluIGFuIGlub2RlIG9uIHBtZW0s
IEkgaGF2ZSB0byBjb3B5IGl0IHRvDQo+ID4gPiBEUkFNDQo+ID4gPiBmaXJzdCBhbmQgbWFrZSBz
dXJlIG1lbWNweV9tY3NhZmUoKSBkb2VzIG5vdCByZXBvcnQgYW55dGhpbmcNCj4gPiA+IHdyb25n
Lg0KPiA+IA0KPiA+IFlvdSBoYXZlIGEgZ29vZCBwb2ludCwgYW5kIEkgZG9uJ3Qga25vdyBpZiBJ
IGhhdmUgYW4gYW5zd2VyIGZvcg0KPiA+IHRoaXMuLg0KPiA+IEFzc3VtaW5nIGEgc3lzdGVtIHdp
dGggTUNFIHJlY292ZXJ5LCBtYXliZSBOT1ZBIGNhbiBhZGQgYSBtY2UNCj4gPiBoYW5kbGVyDQo+
ID4gc2ltaWxhciB0byBuZml0X2hhbmRsZV9tY2UoKSwgYW5kIGhhbmRsZSBlcnJvcnMgYXMgdGhl
eSBoYXBwZW4sIGJ1dA0KPiA+IEknbQ0KPiA+IGJlaW5nIHZlcnkgaGFuZC13YXZleSBoZXJlIGFu
ZCBkb24ndCBrbm93IGhvdyBtdWNoL2hvdyB3ZWxsIHRoYXQNCj4gPiBtaWdodA0KPiA+IHdvcmsu
Lg0KPiA+IA0KPiA+ID4gDQo+ID4gPiA+IE5vLCBpZiB0aGUgbWVkaWEsIGZvciBzb21lIHJlYXNv
biwgJ2R2ZWxvcHMnIGEgYmFkIGNlbGwsIGENCj4gPiA+ID4gc2Vjb25kDQo+ID4gPiA+IGNvbnNl
Y3V0aXZlIHJlYWQgZG9lcyBoYXZlIGEgY2hhbmNlIG9mIGJlaW5nIGJhZC4gT25jZSBhDQo+ID4g
PiA+IGxvY2F0aW9uIGhhcw0KPiA+ID4gPiBiZWVuIG1hcmtlZCBhcyBiYWQsIGl0IHdpbGwgc3Rh
eSBiYWQgdGlsbCB0aGUgQUNQSSBjbGVhciBlcnJvcg0KPiA+ID4gPiAnRFNNJyBoYXMNCj4gPiA+
ID4gYmVlbiBjYWxsZWQgdG8gbWFyayBpdCBhcyBjbGVhbi4NCj4gPiA+ID4gDQo+ID4gPiANCj4g
PiA+IEkgd29uZGVyIHdoYXQgaGFwcGVucyB0byB3cml0ZSBpbiB0aGlzIGNhc2U/IElmIGEgYmxv
Y2sgaXMgYmFkIGJ1dA0KPiA+ID4gbm90DQo+ID4gPiByZXBvcnRlZCBpbiBiYWRibG9jayBsaXN0
LiBOb3cgSSB3cml0ZSB0byBpdCB3aXRob3V0IHJlYWRpbmcNCj4gPiA+IGZpcnN0LiBEbw0KPiA+
ID4gSSBjbGVhciB0aGUgcG9pc29uIHdpdGggdGhlIHdyaXRlPyBPciBzdGlsbCByZXF1aXJlIGEg
QUNQSSBEU00/DQo+ID4gDQo+ID4gV2l0aCB3cml0ZXMsIG15IHVuZGVyc3RhbmRpbmcgaXMgdGhl
cmUgaXMgc3RpbGwgYSBwb3NzaWJpbGl0eSB0aGF0DQo+ID4gYW4NCj4gPiBpbnRlcm5hbCByZWFk
LW1vZGlmeS13cml0ZSBjYW4gaGFwcGVuLCBhbmQgY2F1c2UgYSBNQ0UgKHRoaXMgaXMgdGhlDQo+
ID4gc2FtZQ0KPiA+IGFzIHdyaXRpbmcgdG8gYSBiYWQgRFJBTSBjZWxsLCB3aGljaCBjYW4gYWxz
byBjYXVzZSBhbiBNQ0UpLiBZb3UNCj4gPiBjYW4ndA0KPiA+IHJlYWxseSB1c2UgdGhlIEFDUEkg
RFNNIHByZWVtcHRpdmVseSBiZWNhdXNlIHlvdSBkb24ndCBrbm93IHdoZXRoZXINCj4gPiB0aGUN
Cj4gPiBsb2NhdGlvbiB3YXMgYmFkLiBUaGUgZXJyb3IgZmxvdyB3aWxsIGJlIHNvbWV0aGluZyBs
aWtlIHdyaXRlIGNhdXNlcw0KPiA+IHRoZQ0KPiA+IE1DRSwgYSBiYWRibG9jayBnZXRzIGFkZGVk
IChlaXRoZXIgdGhyb3VnaCB0aGUgbWNlIGhhbmRsZXIgb3IgYWZ0ZXINCj4gPiB0aGUNCj4gPiBu
ZXh0IHJlYm9vdCksIGFuZCB0aGUgcmVjb3ZlcnkgcGF0aCBpcyBub3cgdGhlIHNhbWUgYXMgYSBy
ZWd1bGFyDQo+ID4gYmFkYmxvY2suDQo+ID4gDQo+IA0KPiBUaGlzIGlzIGRpZmZlcmVudCBmcm9t
IG15IHVuZGVyc3RhbmRpbmcuIFJpZ2h0IG5vdyB3cml0ZV9wbWVtKCkgaW4NCj4gcG1lbV9kb19i
dmVjKCkgZG9lcyBub3QgdXNlIG1lbWNweV9tY3NhZmUoKS4gSWYgdGhlIGJsb2NrIGlzIGJhZCBp
dA0KPiBjbGVhcnMgcG9pc29uIGFuZCB3cml0ZXMgdG8gcG1lbSBhZ2Fpbi4gU2VlbXMgdG8gbWUg
d3JpdGluZyB0byBiYWQNCj4gYmxvY2tzIGRvZXMgbm90IGNhdXNlIE1DRS4gRG8gd2UgbmVlZCBt
ZW1jcHlfbWNzYWZlIGZvciBwbWVtIHN0b3Jlcz8NCg0KWW91IGFyZSByaWdodCwgd3JpdGVzIGRv
bid0IHVzZSBtZW1jcHlfbWNzYWZlLCBhbmQgd2lsbCBub3QgZGlyZWN0bHkNCmNhdXNlIGFuIE1D
RS4gSG93ZXZlciBhIHdyaXRlIGNhbiBjYXVzZSBhbiBhc3luY2hyb25vdXMgJ0NNQ0knIC0NCmNv
cnJlY3RlZCBtYWNoaW5lIGNoZWNrIGludGVycnVwdCwgYnV0IHRoaXMgaXMgbm90IGNyaXRpY2Fs
LCBhbmQgd29udCBiZQ0KYSBtZW1vcnkgZXJyb3IgYXMgdGhlIGNvcmUgZGlkbid0IGNvbnN1bWUg
cG9pc29uLiBtZW1jcHlfbWNzYWZlIGNhbm5vdA0KcHJvdGVjdCBhZ2FpbnN0IHRoaXMgYmVjYXVz
ZSB0aGUgd3JpdGUgaXMgJ3Bvc3RlZCcgYW5kIHRoZSBDTUNJIGlzIG5vdA0Kc3luY2hyb25vdXMu
IE5vdGUgdGhhdCB0aGlzIGlzIG9ubHkgaW4gdGhlIGxhdGVudCBlcnJvciBvciBtZW1tYXAtc3Rv
cmUNCmNhc2UuDQoNCj4gDQo+IFRoYW5rcywNCj4gQW5kaXJ5DQo+IA0KPiA+ID4gDQo+ID4gPiA+
IFsxXTogaHR0cDovL3d3dy5saW51eC5zZ2kuY29tL2FyY2hpdmVzL3hmcy8yMDE2LTA2L21zZzAw
Mjk5Lmh0bWwNCj4gPiA+ID4gDQo+ID4gPiANCj4gPiA+IFRoYW5rIHlvdSBmb3IgdGhlIHBhdGNo
c2V0LiBJIHdpbGwgbG9vayBpbnRvIGl0Lg0KPiA+ID4g

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-20  0:32                         ` Verma, Vishal L
  0 siblings, 0 replies; 89+ messages in thread
From: Verma, Vishal L @ 2017-01-20  0:32 UTC (permalink / raw)
  To: andiry
  Cc: darrick.wong, Vyacheslav.Dubeyko, linux-block, slava, lsf-pc,
	linux-nvdimm, linux-fsdevel

On Tue, 2017-01-17 at 17:58 -0800, Andiry Xu wrote:
> On Tue, Jan 17, 2017 at 3:51 PM, Vishal Verma <vishal.l.verma@intel.co
> m> wrote:
> > On 01/17, Andiry Xu wrote:
> > 
> > <snip>
> > 
> > > > > 
> > > > > The pmem_do_bvec() read logic is like this:
> > > > > 
> > > > > pmem_do_bvec()
> > > > >     if (is_bad_pmem())
> > > > >         return -EIO;
> > > > >     else
> > > > >         memcpy_from_pmem();
> > > > > 
> > > > > Note memcpy_from_pmem() is calling memcpy_mcsafe(). Does this
> > > > > imply
> > > > > that even if a block is not in the badblock list, it still can
> > > > > be bad
> > > > > and causes MCE? Does the badblock list get changed during file
> > > > > system
> > > > > running? If that is the case, should the file system get a
> > > > > notification when it gets changed? If a block is good when I
> > > > > first
> > > > > read it, can I still trust it to be good for the second
> > > > > access?
> > > > 
> > > > Yes, if a block is not in the badblocks list, it can still cause
> > > > an
> > > > MCE. This is the latent error case I described above. For a
> > > > simple read()
> > > > via the pmem driver, this will get handled by memcpy_mcsafe. For
> > > > mmap,
> > > > an MCE is inevitable.
> > > > 
> > > > Yes the badblocks list may change while a filesystem is running.
> > > > The RFC
> > > > patches[1] I linked to add a notification for the filesystem
> > > > when this
> > > > happens.
> > > > 
> > > 
> > > This is really bad and it makes file system implementation much
> > > more
> > > complicated. And badblock notification does not help very much,
> > > because any block can be bad potentially, no matter it is in
> > > badblock
> > > list or not. And file system has to perform checking for every
> > > read,
> > > using memcpy_mcsafe. This is disaster for file system like NOVA,
> > > which
> > > uses pointer de-reference to access data structures on pmem. Now
> > > if I
> > > want to read a field in an inode on pmem, I have to copy it to
> > > DRAM
> > > first and make sure memcpy_mcsafe() does not report anything
> > > wrong.
> > 
> > You have a good point, and I don't know if I have an answer for
> > this..
> > Assuming a system with MCE recovery, maybe NOVA can add a mce
> > handler
> > similar to nfit_handle_mce(), and handle errors as they happen, but
> > I'm
> > being very hand-wavey here and don't know how much/how well that
> > might
> > work..
> > 
> > > 
> > > > No, if the media, for some reason, 'dvelops' a bad cell, a
> > > > second
> > > > consecutive read does have a chance of being bad. Once a
> > > > location has
> > > > been marked as bad, it will stay bad till the ACPI clear error
> > > > 'DSM' has
> > > > been called to mark it as clean.
> > > > 
> > > 
> > > I wonder what happens to write in this case? If a block is bad but
> > > not
> > > reported in badblock list. Now I write to it without reading
> > > first. Do
> > > I clear the poison with the write? Or still require a ACPI DSM?
> > 
> > With writes, my understanding is there is still a possibility that
> > an
> > internal read-modify-write can happen, and cause a MCE (this is the
> > same
> > as writing to a bad DRAM cell, which can also cause an MCE). You
> > can't
> > really use the ACPI DSM preemptively because you don't know whether
> > the
> > location was bad. The error flow will be something like write causes
> > the
> > MCE, a badblock gets added (either through the mce handler or after
> > the
> > next reboot), and the recovery path is now the same as a regular
> > badblock.
> > 
> 
> This is different from my understanding. Right now write_pmem() in
> pmem_do_bvec() does not use memcpy_mcsafe(). If the block is bad it
> clears poison and writes to pmem again. Seems to me writing to bad
> blocks does not cause MCE. Do we need memcpy_mcsafe for pmem stores?

You are right, writes don't use memcpy_mcsafe, and will not directly
cause an MCE. However a write can cause an asynchronous 'CMCI' -
corrected machine check interrupt, but this is not critical, and wont be
a memory error as the core didn't consume poison. memcpy_mcsafe cannot
protect against this because the write is 'posted' and the CMCI is not
synchronous. Note that this is only in the latent error or memmap-store
case.

> 
> Thanks,
> Andiry
> 
> > > 
> > > > [1]: http://www.linux.sgi.com/archives/xfs/2016-06/msg00299.html
> > > > 
> > > 
> > > Thank you for the patchset. I will look into it.
> > > 

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-18  3:08                   ` Lu Zhang
@ 2017-01-20  0:46                     ` Vishal Verma
  -1 siblings, 0 replies; 89+ messages in thread
From: Vishal Verma @ 2017-01-20  0:46 UTC (permalink / raw)
  To: Lu Zhang
  Cc: Andreas Dilger, Slava Dubeyko, Darrick J. Wong,
	linux-nvdimm@lists.01.org, linux-block, Linux FS Devel,
	Viacheslav Dubeyko, Andiry Xu, lsf-pc

On 01/17, Lu Zhang wrote:
> I'm curious about the fault model and corresponding hardware ECC mechanisms
> for NVDIMMs. In my understanding for memory accesses to trigger MCE, it
> means the memory controller finds a detectable but uncorrectable error
> (DUE). So if there is no hardware ECC support the media errors won't even
> be noticed, not to mention badblocks or machine checks.
> 
> Current hardware ECC support for DRAM usually employs (72, 64) single-bit
> error correction mechanism, and for advanced ECCs there are techniques like
> Chipkill or SDDC which can tolerate a single DRAM chip failure. What is the
> expected ECC mode for NVDIMMs, assuming that PCM or 3dXpoint based
> technology might have higher error rates?

I'm sure once NVDIMMs start becoming widely available, there will be
more information on how they do ECC..

> 
> If DUE does happen and is flagged to the file system via MCE (somehow...),
> and the fs finds that the error corrupts its allocated data page, or
> metadata, now if the fs wants to recover its data the intuition is that
> there needs to be a stronger error correction mechanism to correct the
> hardware-uncorrectable errors. So knowing the hardware ECC baseline is
> helpful for the file system to understand how severe are the faults in
> badblocks, and develop its recovery methods.

Like mentioned before, this discussion is more about presentation of
errors in a known consumable format, rather than recovering from errors.
While recovering from errors is interesting, we already have layers
like RAID for that, and they are as applicable to NVDIMM backed storage
as they have been for disk/SSD based storage.

> 
> Regards,
> Lu
> 
> On Tue, Jan 17, 2017 at 6:01 PM, Andiry Xu <andiry@gmail.com> wrote:
> 
> > On Tue, Jan 17, 2017 at 4:16 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > > On Jan 17, 2017, at 3:15 PM, Andiry Xu <andiry@gmail.com> wrote:
> > >> On Tue, Jan 17, 2017 at 1:35 PM, Vishal Verma <vishal.l.verma@intel.com>
> > wrote:
> > >>> On 01/16, Darrick J. Wong wrote:
> > >>>> On Fri, Jan 13, 2017 at 05:49:10PM -0700, Vishal Verma wrote:
> > >>>>> On 01/14, Slava Dubeyko wrote:
> > >>>>>>
> > >>>>>> ---- Original Message ----
> > >>>>>> Subject: [LSF/MM TOPIC] Badblocks checking/representation in
> > filesystems
> > >>>>>> Sent: Jan 13, 2017 1:40 PM
> > >>>>>> From: "Verma, Vishal L" <vishal.l.verma@intel.com>
> > >>>>>> To: lsf-pc@lists.linux-foundation.org
> > >>>>>> Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org,
> > linux-fsdevel@vger.kernel.org
> > >>>>>>
> > >>>>>>> The current implementation of badblocks, where we consult the
> > >>>>>>> badblocks list for every IO in the block driver works, and is a
> > >>>>>>> last option failsafe, but from a user perspective, it isn't the
> > >>>>>>> easiest interface to work with.
> > >>>>>>
> > >>>>>> As I remember, FAT and HFS+ specifications contain description of
> > bad blocks
> > >>>>>> (physical sectors) table. I believe that this table was used for
> > the case of
> > >>>>>> floppy media. But, finally, this table becomes to be the completely
> > obsolete
> > >>>>>> artefact because mostly storage devices are reliably enough. Why do
> > you need
> > >>>>
> > >>>> ext4 has a badblocks inode to own all the bad spots on disk, but ISTR
> > it
> > >>>> doesn't support(??) extents or 64-bit filesystems, and might just be a
> > >>>> vestigial organ at this point.  XFS doesn't have anything to track bad
> > >>>> blocks currently....
> > >>>>
> > >>>>>> in exposing the bad blocks on the file system level?  Do you expect
> > that next
> > >>>>>> generation of NVM memory will be so unreliable that file system
> > needs to manage
> > >>>>>> bad blocks? What's about erasure coding schemes? Do file system
> > really need to suffer
> > >>>>>> from the bad block issue?
> > >>>>>>
> > >>>>>> Usually, we are using LBAs and it is the responsibility of storage
> > device to map
> > >>>>>> a bad physical block/page/sector into valid one. Do you mean that
> > we have
> > >>>>>> access to physical NVM memory address directly? But it looks like
> > that we can
> > >>>>>> have a "bad block" issue even we will access data into page cache's
> > memory
> > >>>>>> page (if we will use NVM memory for page cache, of course). So,
> > what do you
> > >>>>>> imply by "bad block" issue?
> > >>>>>
> > >>>>> We don't have direct physical access to the device's address space,
> > in
> > >>>>> the sense the device is still free to perform remapping of chunks of
> > NVM
> > >>>>> underneath us. The problem is that when a block or address range (as
> > >>>>> small as a cache line) goes bad, the device maintains a poison bit
> > for
> > >>>>> every affected cache line. Behind the scenes, it may have already
> > >>>>> remapped the range, but the cache line poison has to be kept so that
> > >>>>> there is a notification to the user/owner of the data that something
> > has
> > >>>>> been lost. Since NVM is byte addressable memory sitting on the memory
> > >>>>> bus, such a poisoned cache line results in memory errors and
> > SIGBUSes.
> > >>>>> Compared to tradational storage where an app will get nice and
> > friendly
> > >>>>> (relatively speaking..) -EIOs. The whole badblocks implementation was
> > >>>>> done so that the driver can intercept IO (i.e. reads) to _known_ bad
> > >>>>> locations, and short-circuit them with an EIO. If the driver doesn't
> > >>>>> catch these, the reads will turn into a memory bus access, and the
> > >>>>> poison will cause a SIGBUS.
> > >>>>
> > >>>> "driver" ... you mean XFS?  Or do you mean the thing that makes pmem
> > >>>> look kind of like a traditional block device? :)
> > >>>
> > >>> Yes, the thing that makes pmem look like a block device :) --
> > >>> drivers/nvdimm/pmem.c
> > >>>
> > >>>>
> > >>>>> This effort is to try and make this badblock checking smarter - and
> > try
> > >>>>> and reduce the penalty on every IO to a smaller range, which only the
> > >>>>> filesystem can do.
> > >>>>
> > >>>> Though... now that XFS merged the reverse mapping support, I've been
> > >>>> wondering if there'll be a resubmission of the device errors callback?
> > >>>> It still would be useful to be able to inform the user that part of
> > >>>> their fs has gone bad, or, better yet, if the buffer is still in
> > memory
> > >>>> someplace else, just write it back out.
> > >>>>
> > >>>> Or I suppose if we had some kind of raid1 set up between memories we
> > >>>> could read one of the other copies and rewrite it into the failing
> > >>>> region immediately.
> > >>>
> > >>> Yes, that is kind of what I was hoping to accomplish via this
> > >>> discussion. How much would filesystems want to be involved in this sort
> > >>> of badblocks handling, if at all. I can refresh my patches that provide
> > >>> the fs notification, but that's the easy bit, and a starting point.
> > >>>
> > >>
> > >> I have some questions. Why moving badblock handling to file system
> > >> level avoid the checking phase? In file system level for each I/O I
> > >> still have to check the badblock list, right? Do you mean during mount
> > >> it can go through the pmem device and locates all the data structures
> > >> mangled by badblocks and handle them accordingly, so that during
> > >> normal running the badblocks will never be accessed? Or, if there is
> > >> replicataion/snapshot support, use a copy to recover the badblocks?
> > >
> > > With ext4 badblocks, the main outcome is that the bad blocks would be
> > > pemanently marked in the allocation bitmap as being used, and they would
> > > never be allocated to a file, so they should never be accessed unless
> > > doing a full device scan (which ext4 and e2fsck never do).  That would
> > > avoid the need to check every I/O against the bad blocks list, if the
> > > driver knows that the filesystem will handle this.
> > >
> >
> > Thank you for explanation. However this only works for free blocks,
> > right? What about allocated blocks, like file data and metadata?
> >
> > Thanks,
> > Andiry
> >
> > > The one caveat is that ext4 only allows 32-bit block numbers in the
> > > badblocks list, since this feature hasn't been used in a long time.
> > > This is good for up to 16TB filesystems, but if there was a demand to
> > > use this feature again it would be possible allow 64-bit block numbers.
> > >
> > > Cheers, Andreas
> > >
> > >
> > >
> > >
> > >
> > _______________________________________________
> > Linux-nvdimm mailing list
> > Linux-nvdimm@lists.01.org
> > https://lists.01.org/mailman/listinfo/linux-nvdimm
> >
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-20  0:46                     ` Vishal Verma
  0 siblings, 0 replies; 89+ messages in thread
From: Vishal Verma @ 2017-01-20  0:46 UTC (permalink / raw)
  To: Lu Zhang
  Cc: Andiry Xu, Andreas Dilger, Slava Dubeyko, Darrick J. Wong,
	linux-nvdimm@lists.01.org, linux-block, Viacheslav Dubeyko,
	Linux FS Devel, lsf-pc

On 01/17, Lu Zhang wrote:
> I'm curious about the fault model and corresponding hardware ECC mechanisms
> for NVDIMMs. In my understanding for memory accesses to trigger MCE, it
> means the memory controller finds a detectable but uncorrectable error
> (DUE). So if there is no hardware ECC support the media errors won't even
> be noticed, not to mention badblocks or machine checks.
> 
> Current hardware ECC support for DRAM usually employs (72, 64) single-bit
> error correction mechanism, and for advanced ECCs there are techniques like
> Chipkill or SDDC which can tolerate a single DRAM chip failure. What is the
> expected ECC mode for NVDIMMs, assuming that PCM or 3dXpoint based
> technology might have higher error rates?

I'm sure once NVDIMMs start becoming widely available, there will be
more information on how they do ECC..

> 
> If DUE does happen and is flagged to the file system via MCE (somehow...),
> and the fs finds that the error corrupts its allocated data page, or
> metadata, now if the fs wants to recover its data the intuition is that
> there needs to be a stronger error correction mechanism to correct the
> hardware-uncorrectable errors. So knowing the hardware ECC baseline is
> helpful for the file system to understand how severe are the faults in
> badblocks, and develop its recovery methods.

Like mentioned before, this discussion is more about presentation of
errors in a known consumable format, rather than recovering from errors.
While recovering from errors is interesting, we already have layers
like RAID for that, and they are as applicable to NVDIMM backed storage
as they have been for disk/SSD based storage.

> 
> Regards,
> Lu
> 
> On Tue, Jan 17, 2017 at 6:01 PM, Andiry Xu <andiry@gmail.com> wrote:
> 
> > On Tue, Jan 17, 2017 at 4:16 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > > On Jan 17, 2017, at 3:15 PM, Andiry Xu <andiry@gmail.com> wrote:
> > >> On Tue, Jan 17, 2017 at 1:35 PM, Vishal Verma <vishal.l.verma@intel.com>
> > wrote:
> > >>> On 01/16, Darrick J. Wong wrote:
> > >>>> On Fri, Jan 13, 2017 at 05:49:10PM -0700, Vishal Verma wrote:
> > >>>>> On 01/14, Slava Dubeyko wrote:
> > >>>>>>
> > >>>>>> ---- Original Message ----
> > >>>>>> Subject: [LSF/MM TOPIC] Badblocks checking/representation in
> > filesystems
> > >>>>>> Sent: Jan 13, 2017 1:40 PM
> > >>>>>> From: "Verma, Vishal L" <vishal.l.verma@intel.com>
> > >>>>>> To: lsf-pc@lists.linux-foundation.org
> > >>>>>> Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org,
> > linux-fsdevel@vger.kernel.org
> > >>>>>>
> > >>>>>>> The current implementation of badblocks, where we consult the
> > >>>>>>> badblocks list for every IO in the block driver works, and is a
> > >>>>>>> last option failsafe, but from a user perspective, it isn't the
> > >>>>>>> easiest interface to work with.
> > >>>>>>
> > >>>>>> As I remember, FAT and HFS+ specifications contain description of
> > bad blocks
> > >>>>>> (physical sectors) table. I believe that this table was used for
> > the case of
> > >>>>>> floppy media. But, finally, this table becomes to be the completely
> > obsolete
> > >>>>>> artefact because mostly storage devices are reliably enough. Why do
> > you need
> > >>>>
> > >>>> ext4 has a badblocks inode to own all the bad spots on disk, but ISTR
> > it
> > >>>> doesn't support(??) extents or 64-bit filesystems, and might just be a
> > >>>> vestigial organ at this point.  XFS doesn't have anything to track bad
> > >>>> blocks currently....
> > >>>>
> > >>>>>> in exposing the bad blocks on the file system level?  Do you expect
> > that next
> > >>>>>> generation of NVM memory will be so unreliable that file system
> > needs to manage
> > >>>>>> bad blocks? What's about erasure coding schemes? Do file system
> > really need to suffer
> > >>>>>> from the bad block issue?
> > >>>>>>
> > >>>>>> Usually, we are using LBAs and it is the responsibility of storage
> > device to map
> > >>>>>> a bad physical block/page/sector into valid one. Do you mean that
> > we have
> > >>>>>> access to physical NVM memory address directly? But it looks like
> > that we can
> > >>>>>> have a "bad block" issue even we will access data into page cache's
> > memory
> > >>>>>> page (if we will use NVM memory for page cache, of course). So,
> > what do you
> > >>>>>> imply by "bad block" issue?
> > >>>>>
> > >>>>> We don't have direct physical access to the device's address space,
> > in
> > >>>>> the sense the device is still free to perform remapping of chunks of
> > NVM
> > >>>>> underneath us. The problem is that when a block or address range (as
> > >>>>> small as a cache line) goes bad, the device maintains a poison bit
> > for
> > >>>>> every affected cache line. Behind the scenes, it may have already
> > >>>>> remapped the range, but the cache line poison has to be kept so that
> > >>>>> there is a notification to the user/owner of the data that something
> > has
> > >>>>> been lost. Since NVM is byte addressable memory sitting on the memory
> > >>>>> bus, such a poisoned cache line results in memory errors and
> > SIGBUSes.
> > >>>>> Compared to tradational storage where an app will get nice and
> > friendly
> > >>>>> (relatively speaking..) -EIOs. The whole badblocks implementation was
> > >>>>> done so that the driver can intercept IO (i.e. reads) to _known_ bad
> > >>>>> locations, and short-circuit them with an EIO. If the driver doesn't
> > >>>>> catch these, the reads will turn into a memory bus access, and the
> > >>>>> poison will cause a SIGBUS.
> > >>>>
> > >>>> "driver" ... you mean XFS?  Or do you mean the thing that makes pmem
> > >>>> look kind of like a traditional block device? :)
> > >>>
> > >>> Yes, the thing that makes pmem look like a block device :) --
> > >>> drivers/nvdimm/pmem.c
> > >>>
> > >>>>
> > >>>>> This effort is to try and make this badblock checking smarter - and
> > try
> > >>>>> and reduce the penalty on every IO to a smaller range, which only the
> > >>>>> filesystem can do.
> > >>>>
> > >>>> Though... now that XFS merged the reverse mapping support, I've been
> > >>>> wondering if there'll be a resubmission of the device errors callback?
> > >>>> It still would be useful to be able to inform the user that part of
> > >>>> their fs has gone bad, or, better yet, if the buffer is still in
> > memory
> > >>>> someplace else, just write it back out.
> > >>>>
> > >>>> Or I suppose if we had some kind of raid1 set up between memories we
> > >>>> could read one of the other copies and rewrite it into the failing
> > >>>> region immediately.
> > >>>
> > >>> Yes, that is kind of what I was hoping to accomplish via this
> > >>> discussion. How much would filesystems want to be involved in this sort
> > >>> of badblocks handling, if at all. I can refresh my patches that provide
> > >>> the fs notification, but that's the easy bit, and a starting point.
> > >>>
> > >>
> > >> I have some questions. Why moving badblock handling to file system
> > >> level avoid the checking phase? In file system level for each I/O I
> > >> still have to check the badblock list, right? Do you mean during mount
> > >> it can go through the pmem device and locates all the data structures
> > >> mangled by badblocks and handle them accordingly, so that during
> > >> normal running the badblocks will never be accessed? Or, if there is
> > >> replicataion/snapshot support, use a copy to recover the badblocks?
> > >
> > > With ext4 badblocks, the main outcome is that the bad blocks would be
> > > pemanently marked in the allocation bitmap as being used, and they would
> > > never be allocated to a file, so they should never be accessed unless
> > > doing a full device scan (which ext4 and e2fsck never do).  That would
> > > avoid the need to check every I/O against the bad blocks list, if the
> > > driver knows that the filesystem will handle this.
> > >
> >
> > Thank you for explanation. However this only works for free blocks,
> > right? What about allocated blocks, like file data and metadata?
> >
> > Thanks,
> > Andiry
> >
> > > The one caveat is that ext4 only allows 32-bit block numbers in the
> > > badblocks list, since this feature hasn't been used in a long time.
> > > This is good for up to 16TB filesystems, but if there was a demand to
> > > use this feature again it would be possible allow 64-bit block numbers.
> > >
> > > Cheers, Andreas
> > >
> > >
> > >
> > >
> > >
> > _______________________________________________
> > Linux-nvdimm mailing list
> > Linux-nvdimm@lists.01.org
> > https://lists.01.org/mailman/listinfo/linux-nvdimm
> >
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-18  2:01                 ` Andiry Xu
@ 2017-01-20  0:55                   ` Verma, Vishal L
  -1 siblings, 0 replies; 89+ messages in thread
From: Verma, Vishal L @ 2017-01-20  0:55 UTC (permalink / raw)
  To: andiry, adilger
  Cc: darrick.wong, Vyacheslav.Dubeyko, linux-block, slava, lsf-pc,
	linux-nvdimm, linux-fsdevel

On Tue, 2017-01-17 at 18:01 -0800, Andiry Xu wrote:
> On Tue, Jan 17, 2017 at 4:16 PM, Andreas Dilger <adilger@dilger.ca>
> wrote:
> > On Jan 17, 2017, at 3:15 PM, Andiry Xu <andiry@gmail.com> wrote:
> > > On Tue, Jan 17, 2017 at 1:35 PM, Vishal Verma <vishal.l.verma@inte
> > > l.com> wrote:
> > > > On 01/16, Darrick J. Wong wrote:
> > > > > On Fri, Jan 13, 2017 at 05:49:10PM -0700, Vishal Verma wrote:
> > > > > > On 01/14, Slava Dubeyko wrote:
> > > > > > > 
> > > > > > > ---- Original Message ----
> > > > > > > Subject: [LSF/MM TOPIC] Badblocks checking/representation
> > > > > > > in filesystems
> > > > > > > Sent: Jan 13, 2017 1:40 PM
> > > > > > > From: "Verma, Vishal L" <vishal.l.verma@intel.com>
> > > > > > > To: lsf-pc@lists.linux-foundation.org
> > > > > > > Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org
> > > > > > > , linux-fsdevel@vger.kernel.org
> > > > > > > 
> > > > > > > > The current implementation of badblocks, where we
> > > > > > > > consult the
> > > > > > > > badblocks list for every IO in the block driver works,
> > > > > > > > and is a
> > > > > > > > last option failsafe, but from a user perspective, it
> > > > > > > > isn't the
> > > > > > > > easiest interface to work with.
> > > > > > > 
> > > > > > > As I remember, FAT and HFS+ specifications contain
> > > > > > > description of bad blocks
> > > > > > > (physical sectors) table. I believe that this table was
> > > > > > > used for the case of
> > > > > > > floppy media. But, finally, this table becomes to be the
> > > > > > > completely obsolete
> > > > > > > artefact because mostly storage devices are reliably
> > > > > > > enough. Why do you need
> > > > > 
> > > > > ext4 has a badblocks inode to own all the bad spots on disk,
> > > > > but ISTR it
> > > > > doesn't support(??) extents or 64-bit filesystems, and might
> > > > > just be a
> > > > > vestigial organ at this point.  XFS doesn't have anything to
> > > > > track bad
> > > > > blocks currently....
> > > > > 
> > > > > > > in exposing the bad blocks on the file system level?  Do
> > > > > > > you expect that next
> > > > > > > generation of NVM memory will be so unreliable that file
> > > > > > > system needs to manage
> > > > > > > bad blocks? What's about erasure coding schemes? Do file
> > > > > > > system really need to suffer
> > > > > > > from the bad block issue?
> > > > > > > 
> > > > > > > Usually, we are using LBAs and it is the responsibility of
> > > > > > > storage device to map
> > > > > > > a bad physical block/page/sector into valid one. Do you
> > > > > > > mean that we have
> > > > > > > access to physical NVM memory address directly? But it
> > > > > > > looks like that we can
> > > > > > > have a "bad block" issue even we will access data into
> > > > > > > page cache's memory
> > > > > > > page (if we will use NVM memory for page cache, of
> > > > > > > course). So, what do you
> > > > > > > imply by "bad block" issue?
> > > > > > 
> > > > > > We don't have direct physical access to the device's address
> > > > > > space, in
> > > > > > the sense the device is still free to perform remapping of
> > > > > > chunks of NVM
> > > > > > underneath us. The problem is that when a block or address
> > > > > > range (as
> > > > > > small as a cache line) goes bad, the device maintains a
> > > > > > poison bit for
> > > > > > every affected cache line. Behind the scenes, it may have
> > > > > > already
> > > > > > remapped the range, but the cache line poison has to be kept
> > > > > > so that
> > > > > > there is a notification to the user/owner of the data that
> > > > > > something has
> > > > > > been lost. Since NVM is byte addressable memory sitting on
> > > > > > the memory
> > > > > > bus, such a poisoned cache line results in memory errors and
> > > > > > SIGBUSes.
> > > > > > Compared to tradational storage where an app will get nice
> > > > > > and friendly
> > > > > > (relatively speaking..) -EIOs. The whole badblocks
> > > > > > implementation was
> > > > > > done so that the driver can intercept IO (i.e. reads) to
> > > > > > _known_ bad
> > > > > > locations, and short-circuit them with an EIO. If the driver
> > > > > > doesn't
> > > > > > catch these, the reads will turn into a memory bus access,
> > > > > > and the
> > > > > > poison will cause a SIGBUS.
> > > > > 
> > > > > "driver" ... you mean XFS?  Or do you mean the thing that
> > > > > makes pmem
> > > > > look kind of like a traditional block device? :)
> > > > 
> > > > Yes, the thing that makes pmem look like a block device :) --
> > > > drivers/nvdimm/pmem.c
> > > > 
> > > > > 
> > > > > > This effort is to try and make this badblock checking
> > > > > > smarter - and try
> > > > > > and reduce the penalty on every IO to a smaller range, which
> > > > > > only the
> > > > > > filesystem can do.
> > > > > 
> > > > > Though... now that XFS merged the reverse mapping support,
> > > > > I've been
> > > > > wondering if there'll be a resubmission of the device errors
> > > > > callback?
> > > > > It still would be useful to be able to inform the user that
> > > > > part of
> > > > > their fs has gone bad, or, better yet, if the buffer is still
> > > > > in memory
> > > > > someplace else, just write it back out.
> > > > > 
> > > > > Or I suppose if we had some kind of raid1 set up between
> > > > > memories we
> > > > > could read one of the other copies and rewrite it into the
> > > > > failing
> > > > > region immediately.
> > > > 
> > > > Yes, that is kind of what I was hoping to accomplish via this
> > > > discussion. How much would filesystems want to be involved in
> > > > this sort
> > > > of badblocks handling, if at all. I can refresh my patches that
> > > > provide
> > > > the fs notification, but that's the easy bit, and a starting
> > > > point.
> > > > 
> > > 
> > > I have some questions. Why moving badblock handling to file system
> > > level avoid the checking phase? In file system level for each I/O
> > > I
> > > still have to check the badblock list, right? Do you mean during
> > > mount
> > > it can go through the pmem device and locates all the data
> > > structures
> > > mangled by badblocks and handle them accordingly, so that during
> > > normal running the badblocks will never be accessed? Or, if there
> > > is
> > > replicataion/snapshot support, use a copy to recover the
> > > badblocks?
> > 
> > With ext4 badblocks, the main outcome is that the bad blocks would
> > be
> > pemanently marked in the allocation bitmap as being used, and they
> > would
> > never be allocated to a file, so they should never be accessed
> > unless
> > doing a full device scan (which ext4 and e2fsck never do).  That
> > would
> > avoid the need to check every I/O against the bad blocks list, if
> > the
> > driver knows that the filesystem will handle this.
> > 
> 
> Thank you for explanation. However this only works for free blocks,
> right? What about allocated blocks, like file data and metadata?
> 
Like Andreas said, the ext4 badblocks feature has not been in use, and
the current block layer badblocks are distinct and unrelated to these.
Can the ext4 badblocks infrastructure be revived and extended if we
decide to add badblocks to filesystems? Maybe - that was one of the
topics I was hoping to discuss/find out more about.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-20  0:55                   ` Verma, Vishal L
  0 siblings, 0 replies; 89+ messages in thread
From: Verma, Vishal L @ 2017-01-20  0:55 UTC (permalink / raw)
  To: andiry, adilger
  Cc: darrick.wong, Vyacheslav.Dubeyko, linux-block, slava, lsf-pc,
	linux-nvdimm, linux-fsdevel

T24gVHVlLCAyMDE3LTAxLTE3IGF0IDE4OjAxIC0wODAwLCBBbmRpcnkgWHUgd3JvdGU6DQo+IE9u
IFR1ZSwgSmFuIDE3LCAyMDE3IGF0IDQ6MTYgUE0sIEFuZHJlYXMgRGlsZ2VyIDxhZGlsZ2VyQGRp
bGdlci5jYT4NCj4gd3JvdGU6DQo+ID4gT24gSmFuIDE3LCAyMDE3LCBhdCAzOjE1IFBNLCBBbmRp
cnkgWHUgPGFuZGlyeUBnbWFpbC5jb20+IHdyb3RlOg0KPiA+ID4gT24gVHVlLCBKYW4gMTcsIDIw
MTcgYXQgMTozNSBQTSwgVmlzaGFsIFZlcm1hIDx2aXNoYWwubC52ZXJtYUBpbnRlDQo+ID4gPiBs
LmNvbT4gd3JvdGU6DQo+ID4gPiA+IE9uIDAxLzE2LCBEYXJyaWNrIEouIFdvbmcgd3JvdGU6DQo+
ID4gPiA+ID4gT24gRnJpLCBKYW4gMTMsIDIwMTcgYXQgMDU6NDk6MTBQTSAtMDcwMCwgVmlzaGFs
IFZlcm1hIHdyb3RlOg0KPiA+ID4gPiA+ID4gT24gMDEvMTQsIFNsYXZhIER1YmV5a28gd3JvdGU6
DQo+ID4gPiA+ID4gPiA+IA0KPiA+ID4gPiA+ID4gPiAtLS0tIE9yaWdpbmFsIE1lc3NhZ2UgLS0t
LQ0KPiA+ID4gPiA+ID4gPiBTdWJqZWN0OiBbTFNGL01NIFRPUElDXSBCYWRibG9ja3MgY2hlY2tp
bmcvcmVwcmVzZW50YXRpb24NCj4gPiA+ID4gPiA+ID4gaW4gZmlsZXN5c3RlbXMNCj4gPiA+ID4g
PiA+ID4gU2VudDogSmFuIDEzLCAyMDE3IDE6NDAgUE0NCj4gPiA+ID4gPiA+ID4gRnJvbTogIlZl
cm1hLCBWaXNoYWwgTCIgPHZpc2hhbC5sLnZlcm1hQGludGVsLmNvbT4NCj4gPiA+ID4gPiA+ID4g
VG86IGxzZi1wY0BsaXN0cy5saW51eC1mb3VuZGF0aW9uLm9yZw0KPiA+ID4gPiA+ID4gPiBDYzog
bGludXgtbnZkaW1tQGxpc3RzLjAxLm9yZywgbGludXgtYmxvY2tAdmdlci5rZXJuZWwub3JnDQo+
ID4gPiA+ID4gPiA+ICwgbGludXgtZnNkZXZlbEB2Z2VyLmtlcm5lbC5vcmcNCj4gPiA+ID4gPiA+
ID4gDQo+ID4gPiA+ID4gPiA+ID4gVGhlIGN1cnJlbnQgaW1wbGVtZW50YXRpb24gb2YgYmFkYmxv
Y2tzLCB3aGVyZSB3ZQ0KPiA+ID4gPiA+ID4gPiA+IGNvbnN1bHQgdGhlDQo+ID4gPiA+ID4gPiA+
ID4gYmFkYmxvY2tzIGxpc3QgZm9yIGV2ZXJ5IElPIGluIHRoZSBibG9jayBkcml2ZXIgd29ya3Ms
DQo+ID4gPiA+ID4gPiA+ID4gYW5kIGlzIGENCj4gPiA+ID4gPiA+ID4gPiBsYXN0IG9wdGlvbiBm
YWlsc2FmZSwgYnV0IGZyb20gYSB1c2VyIHBlcnNwZWN0aXZlLCBpdA0KPiA+ID4gPiA+ID4gPiA+
IGlzbid0IHRoZQ0KPiA+ID4gPiA+ID4gPiA+IGVhc2llc3QgaW50ZXJmYWNlIHRvIHdvcmsgd2l0
aC4NCj4gPiA+ID4gPiA+ID4gDQo+ID4gPiA+ID4gPiA+IEFzIEkgcmVtZW1iZXIsIEZBVCBhbmQg
SEZTKyBzcGVjaWZpY2F0aW9ucyBjb250YWluDQo+ID4gPiA+ID4gPiA+IGRlc2NyaXB0aW9uIG9m
IGJhZCBibG9ja3MNCj4gPiA+ID4gPiA+ID4gKHBoeXNpY2FsIHNlY3RvcnMpIHRhYmxlLiBJIGJl
bGlldmUgdGhhdCB0aGlzIHRhYmxlIHdhcw0KPiA+ID4gPiA+ID4gPiB1c2VkIGZvciB0aGUgY2Fz
ZSBvZg0KPiA+ID4gPiA+ID4gPiBmbG9wcHkgbWVkaWEuIEJ1dCwgZmluYWxseSwgdGhpcyB0YWJs
ZSBiZWNvbWVzIHRvIGJlIHRoZQ0KPiA+ID4gPiA+ID4gPiBjb21wbGV0ZWx5IG9ic29sZXRlDQo+
ID4gPiA+ID4gPiA+IGFydGVmYWN0IGJlY2F1c2UgbW9zdGx5IHN0b3JhZ2UgZGV2aWNlcyBhcmUg
cmVsaWFibHkNCj4gPiA+ID4gPiA+ID4gZW5vdWdoLiBXaHkgZG8geW91IG5lZWQNCj4gPiA+ID4g
PiANCj4gPiA+ID4gPiBleHQ0IGhhcyBhIGJhZGJsb2NrcyBpbm9kZSB0byBvd24gYWxsIHRoZSBi
YWQgc3BvdHMgb24gZGlzaywNCj4gPiA+ID4gPiBidXQgSVNUUiBpdA0KPiA+ID4gPiA+IGRvZXNu
J3Qgc3VwcG9ydCg/PykgZXh0ZW50cyBvciA2NC1iaXQgZmlsZXN5c3RlbXMsIGFuZCBtaWdodA0K
PiA+ID4gPiA+IGp1c3QgYmUgYQ0KPiA+ID4gPiA+IHZlc3RpZ2lhbCBvcmdhbiBhdCB0aGlzIHBv
aW50LsKgwqBYRlMgZG9lc24ndCBoYXZlIGFueXRoaW5nIHRvDQo+ID4gPiA+ID4gdHJhY2sgYmFk
DQo+ID4gPiA+ID4gYmxvY2tzIGN1cnJlbnRseS4uLi4NCj4gPiA+ID4gPiANCj4gPiA+ID4gPiA+
ID4gaW4gZXhwb3NpbmcgdGhlIGJhZCBibG9ja3Mgb24gdGhlIGZpbGUgc3lzdGVtIGxldmVsP8Kg
wqBEbw0KPiA+ID4gPiA+ID4gPiB5b3UgZXhwZWN0IHRoYXQgbmV4dA0KPiA+ID4gPiA+ID4gPiBn
ZW5lcmF0aW9uIG9mIE5WTSBtZW1vcnkgd2lsbCBiZSBzbyB1bnJlbGlhYmxlIHRoYXQgZmlsZQ0K
PiA+ID4gPiA+ID4gPiBzeXN0ZW0gbmVlZHMgdG8gbWFuYWdlDQo+ID4gPiA+ID4gPiA+IGJhZCBi
bG9ja3M/IFdoYXQncyBhYm91dCBlcmFzdXJlIGNvZGluZyBzY2hlbWVzPyBEbyBmaWxlDQo+ID4g
PiA+ID4gPiA+IHN5c3RlbSByZWFsbHkgbmVlZCB0byBzdWZmZXINCj4gPiA+ID4gPiA+ID4gZnJv
bSB0aGUgYmFkIGJsb2NrIGlzc3VlPw0KPiA+ID4gPiA+ID4gPiANCj4gPiA+ID4gPiA+ID4gVXN1
YWxseSwgd2UgYXJlIHVzaW5nIExCQXMgYW5kIGl0IGlzIHRoZSByZXNwb25zaWJpbGl0eSBvZg0K
PiA+ID4gPiA+ID4gPiBzdG9yYWdlIGRldmljZSB0byBtYXANCj4gPiA+ID4gPiA+ID4gYSBiYWQg
cGh5c2ljYWwgYmxvY2svcGFnZS9zZWN0b3IgaW50byB2YWxpZCBvbmUuIERvIHlvdQ0KPiA+ID4g
PiA+ID4gPiBtZWFuIHRoYXQgd2UgaGF2ZQ0KPiA+ID4gPiA+ID4gPiBhY2Nlc3MgdG8gcGh5c2lj
YWwgTlZNIG1lbW9yeSBhZGRyZXNzIGRpcmVjdGx5PyBCdXQgaXQNCj4gPiA+ID4gPiA+ID4gbG9v
a3MgbGlrZSB0aGF0IHdlIGNhbg0KPiA+ID4gPiA+ID4gPiBoYXZlIGEgImJhZCBibG9jayIgaXNz
dWUgZXZlbiB3ZSB3aWxsIGFjY2VzcyBkYXRhIGludG8NCj4gPiA+ID4gPiA+ID4gcGFnZSBjYWNo
ZSdzIG1lbW9yeQ0KPiA+ID4gPiA+ID4gPiBwYWdlIChpZiB3ZSB3aWxsIHVzZSBOVk0gbWVtb3J5
IGZvciBwYWdlIGNhY2hlLCBvZg0KPiA+ID4gPiA+ID4gPiBjb3Vyc2UpLiBTbywgd2hhdCBkbyB5
b3UNCj4gPiA+ID4gPiA+ID4gaW1wbHkgYnkgImJhZCBibG9jayIgaXNzdWU/DQo+ID4gPiA+ID4g
PiANCj4gPiA+ID4gPiA+IFdlIGRvbid0IGhhdmUgZGlyZWN0IHBoeXNpY2FsIGFjY2VzcyB0byB0
aGUgZGV2aWNlJ3MgYWRkcmVzcw0KPiA+ID4gPiA+ID4gc3BhY2UsIGluDQo+ID4gPiA+ID4gPiB0
aGUgc2Vuc2UgdGhlIGRldmljZSBpcyBzdGlsbCBmcmVlIHRvIHBlcmZvcm0gcmVtYXBwaW5nIG9m
DQo+ID4gPiA+ID4gPiBjaHVua3Mgb2YgTlZNDQo+ID4gPiA+ID4gPiB1bmRlcm5lYXRoIHVzLiBU
aGUgcHJvYmxlbSBpcyB0aGF0IHdoZW4gYSBibG9jayBvciBhZGRyZXNzDQo+ID4gPiA+ID4gPiBy
YW5nZSAoYXMNCj4gPiA+ID4gPiA+IHNtYWxsIGFzIGEgY2FjaGUgbGluZSkgZ29lcyBiYWQsIHRo
ZSBkZXZpY2UgbWFpbnRhaW5zIGENCj4gPiA+ID4gPiA+IHBvaXNvbiBiaXQgZm9yDQo+ID4gPiA+
ID4gPiBldmVyeSBhZmZlY3RlZCBjYWNoZSBsaW5lLiBCZWhpbmQgdGhlIHNjZW5lcywgaXQgbWF5
IGhhdmUNCj4gPiA+ID4gPiA+IGFscmVhZHkNCj4gPiA+ID4gPiA+IHJlbWFwcGVkIHRoZSByYW5n
ZSwgYnV0IHRoZSBjYWNoZSBsaW5lIHBvaXNvbiBoYXMgdG8gYmUga2VwdA0KPiA+ID4gPiA+ID4g
c28gdGhhdA0KPiA+ID4gPiA+ID4gdGhlcmUgaXMgYSBub3RpZmljYXRpb24gdG8gdGhlIHVzZXIv
b3duZXIgb2YgdGhlIGRhdGEgdGhhdA0KPiA+ID4gPiA+ID4gc29tZXRoaW5nIGhhcw0KPiA+ID4g
PiA+ID4gYmVlbiBsb3N0LiBTaW5jZSBOVk0gaXMgYnl0ZSBhZGRyZXNzYWJsZSBtZW1vcnkgc2l0
dGluZyBvbg0KPiA+ID4gPiA+ID4gdGhlIG1lbW9yeQ0KPiA+ID4gPiA+ID4gYnVzLCBzdWNoIGEg
cG9pc29uZWQgY2FjaGUgbGluZSByZXN1bHRzIGluIG1lbW9yeSBlcnJvcnMgYW5kDQo+ID4gPiA+
ID4gPiBTSUdCVVNlcy4NCj4gPiA+ID4gPiA+IENvbXBhcmVkIHRvIHRyYWRhdGlvbmFsIHN0b3Jh
Z2Ugd2hlcmUgYW4gYXBwIHdpbGwgZ2V0IG5pY2UNCj4gPiA+ID4gPiA+IGFuZCBmcmllbmRseQ0K
PiA+ID4gPiA+ID4gKHJlbGF0aXZlbHkgc3BlYWtpbmcuLikgLUVJT3MuIFRoZSB3aG9sZSBiYWRi
bG9ja3MNCj4gPiA+ID4gPiA+IGltcGxlbWVudGF0aW9uIHdhcw0KPiA+ID4gPiA+ID4gZG9uZSBz
byB0aGF0IHRoZSBkcml2ZXIgY2FuIGludGVyY2VwdCBJTyAoaS5lLiByZWFkcykgdG8NCj4gPiA+
ID4gPiA+IF9rbm93bl8gYmFkDQo+ID4gPiA+ID4gPiBsb2NhdGlvbnMsIGFuZCBzaG9ydC1jaXJj
dWl0IHRoZW0gd2l0aCBhbiBFSU8uIElmIHRoZSBkcml2ZXINCj4gPiA+ID4gPiA+IGRvZXNuJ3QN
Cj4gPiA+ID4gPiA+IGNhdGNoIHRoZXNlLCB0aGUgcmVhZHMgd2lsbCB0dXJuIGludG8gYSBtZW1v
cnkgYnVzIGFjY2VzcywNCj4gPiA+ID4gPiA+IGFuZCB0aGUNCj4gPiA+ID4gPiA+IHBvaXNvbiB3
aWxsIGNhdXNlIGEgU0lHQlVTLg0KPiA+ID4gPiA+IA0KPiA+ID4gPiA+ICJkcml2ZXIiIC4uLiB5
b3UgbWVhbiBYRlM/wqDCoE9yIGRvIHlvdSBtZWFuIHRoZSB0aGluZyB0aGF0DQo+ID4gPiA+ID4g
bWFrZXMgcG1lbQ0KPiA+ID4gPiA+IGxvb2sga2luZCBvZiBsaWtlIGEgdHJhZGl0aW9uYWwgYmxv
Y2sgZGV2aWNlPyA6KQ0KPiA+ID4gPiANCj4gPiA+ID4gWWVzLCB0aGUgdGhpbmcgdGhhdCBtYWtl
cyBwbWVtIGxvb2sgbGlrZSBhIGJsb2NrIGRldmljZSA6KSAtLQ0KPiA+ID4gPiBkcml2ZXJzL252
ZGltbS9wbWVtLmMNCj4gPiA+ID4gDQo+ID4gPiA+ID4gDQo+ID4gPiA+ID4gPiBUaGlzIGVmZm9y
dCBpcyB0byB0cnkgYW5kIG1ha2UgdGhpcyBiYWRibG9jayBjaGVja2luZw0KPiA+ID4gPiA+ID4g
c21hcnRlciAtIGFuZCB0cnkNCj4gPiA+ID4gPiA+IGFuZCByZWR1Y2UgdGhlIHBlbmFsdHkgb24g
ZXZlcnkgSU8gdG8gYSBzbWFsbGVyIHJhbmdlLCB3aGljaA0KPiA+ID4gPiA+ID4gb25seSB0aGUN
Cj4gPiA+ID4gPiA+IGZpbGVzeXN0ZW0gY2FuIGRvLg0KPiA+ID4gPiA+IA0KPiA+ID4gPiA+IFRo
b3VnaC4uLiBub3cgdGhhdCBYRlMgbWVyZ2VkIHRoZSByZXZlcnNlIG1hcHBpbmcgc3VwcG9ydCwN
Cj4gPiA+ID4gPiBJJ3ZlIGJlZW4NCj4gPiA+ID4gPiB3b25kZXJpbmcgaWYgdGhlcmUnbGwgYmUg
YSByZXN1Ym1pc3Npb24gb2YgdGhlIGRldmljZSBlcnJvcnMNCj4gPiA+ID4gPiBjYWxsYmFjaz8N
Cj4gPiA+ID4gPiBJdCBzdGlsbCB3b3VsZCBiZSB1c2VmdWwgdG8gYmUgYWJsZSB0byBpbmZvcm0g
dGhlIHVzZXIgdGhhdA0KPiA+ID4gPiA+IHBhcnQgb2YNCj4gPiA+ID4gPiB0aGVpciBmcyBoYXMg
Z29uZSBiYWQsIG9yLCBiZXR0ZXIgeWV0LCBpZiB0aGUgYnVmZmVyIGlzIHN0aWxsDQo+ID4gPiA+
ID4gaW4gbWVtb3J5DQo+ID4gPiA+ID4gc29tZXBsYWNlIGVsc2UsIGp1c3Qgd3JpdGUgaXQgYmFj
ayBvdXQuDQo+ID4gPiA+ID4gDQo+ID4gPiA+ID4gT3IgSSBzdXBwb3NlIGlmIHdlIGhhZCBzb21l
IGtpbmQgb2YgcmFpZDEgc2V0IHVwIGJldHdlZW4NCj4gPiA+ID4gPiBtZW1vcmllcyB3ZQ0KPiA+
ID4gPiA+IGNvdWxkIHJlYWQgb25lIG9mIHRoZSBvdGhlciBjb3BpZXMgYW5kIHJld3JpdGUgaXQg
aW50byB0aGUNCj4gPiA+ID4gPiBmYWlsaW5nDQo+ID4gPiA+ID4gcmVnaW9uIGltbWVkaWF0ZWx5
Lg0KPiA+ID4gPiANCj4gPiA+ID4gWWVzLCB0aGF0IGlzIGtpbmQgb2Ygd2hhdCBJIHdhcyBob3Bp
bmcgdG8gYWNjb21wbGlzaCB2aWEgdGhpcw0KPiA+ID4gPiBkaXNjdXNzaW9uLiBIb3cgbXVjaCB3
b3VsZCBmaWxlc3lzdGVtcyB3YW50IHRvIGJlIGludm9sdmVkIGluDQo+ID4gPiA+IHRoaXMgc29y
dA0KPiA+ID4gPiBvZiBiYWRibG9ja3MgaGFuZGxpbmcsIGlmIGF0IGFsbC4gSSBjYW4gcmVmcmVz
aCBteSBwYXRjaGVzIHRoYXQNCj4gPiA+ID4gcHJvdmlkZQ0KPiA+ID4gPiB0aGUgZnMgbm90aWZp
Y2F0aW9uLCBidXQgdGhhdCdzIHRoZSBlYXN5IGJpdCwgYW5kIGEgc3RhcnRpbmcNCj4gPiA+ID4g
cG9pbnQuDQo+ID4gPiA+IA0KPiA+ID4gDQo+ID4gPiBJIGhhdmUgc29tZSBxdWVzdGlvbnMuIFdo
eSBtb3ZpbmcgYmFkYmxvY2sgaGFuZGxpbmcgdG8gZmlsZSBzeXN0ZW0NCj4gPiA+IGxldmVsIGF2
b2lkIHRoZSBjaGVja2luZyBwaGFzZT8gSW4gZmlsZSBzeXN0ZW0gbGV2ZWwgZm9yIGVhY2ggSS9P
DQo+ID4gPiBJDQo+ID4gPiBzdGlsbCBoYXZlIHRvIGNoZWNrIHRoZSBiYWRibG9jayBsaXN0LCBy
aWdodD8gRG8geW91IG1lYW4gZHVyaW5nDQo+ID4gPiBtb3VudA0KPiA+ID4gaXQgY2FuIGdvIHRo
cm91Z2ggdGhlIHBtZW0gZGV2aWNlIGFuZCBsb2NhdGVzIGFsbCB0aGUgZGF0YQ0KPiA+ID4gc3Ry
dWN0dXJlcw0KPiA+ID4gbWFuZ2xlZCBieSBiYWRibG9ja3MgYW5kIGhhbmRsZSB0aGVtIGFjY29y
ZGluZ2x5LCBzbyB0aGF0IGR1cmluZw0KPiA+ID4gbm9ybWFsIHJ1bm5pbmcgdGhlIGJhZGJsb2Nr
cyB3aWxsIG5ldmVyIGJlIGFjY2Vzc2VkPyBPciwgaWYgdGhlcmUNCj4gPiA+IGlzDQo+ID4gPiBy
ZXBsaWNhdGFpb24vc25hcHNob3Qgc3VwcG9ydCwgdXNlIGEgY29weSB0byByZWNvdmVyIHRoZQ0K
PiA+ID4gYmFkYmxvY2tzPw0KPiA+IA0KPiA+IFdpdGggZXh0NCBiYWRibG9ja3MsIHRoZSBtYWlu
IG91dGNvbWUgaXMgdGhhdCB0aGUgYmFkIGJsb2NrcyB3b3VsZA0KPiA+IGJlDQo+ID4gcGVtYW5l
bnRseSBtYXJrZWQgaW4gdGhlIGFsbG9jYXRpb24gYml0bWFwIGFzIGJlaW5nIHVzZWQsIGFuZCB0
aGV5DQo+ID4gd291bGQNCj4gPiBuZXZlciBiZSBhbGxvY2F0ZWQgdG8gYSBmaWxlLCBzbyB0aGV5
IHNob3VsZCBuZXZlciBiZSBhY2Nlc3NlZA0KPiA+IHVubGVzcw0KPiA+IGRvaW5nIGEgZnVsbCBk
ZXZpY2Ugc2NhbiAod2hpY2ggZXh0NCBhbmQgZTJmc2NrIG5ldmVyIGRvKS7CoMKgVGhhdA0KPiA+
IHdvdWxkDQo+ID4gYXZvaWQgdGhlIG5lZWQgdG8gY2hlY2sgZXZlcnkgSS9PIGFnYWluc3QgdGhl
IGJhZCBibG9ja3MgbGlzdCwgaWYNCj4gPiB0aGUNCj4gPiBkcml2ZXIga25vd3MgdGhhdCB0aGUg
ZmlsZXN5c3RlbSB3aWxsIGhhbmRsZSB0aGlzLg0KPiA+IA0KPiANCj4gVGhhbmsgeW91IGZvciBl
eHBsYW5hdGlvbi4gSG93ZXZlciB0aGlzIG9ubHkgd29ya3MgZm9yIGZyZWUgYmxvY2tzLA0KPiBy
aWdodD8gV2hhdCBhYm91dCBhbGxvY2F0ZWQgYmxvY2tzLCBsaWtlIGZpbGUgZGF0YSBhbmQgbWV0
YWRhdGE/DQo+IA0KTGlrZSBBbmRyZWFzIHNhaWQsIHRoZSBleHQ0IGJhZGJsb2NrcyBmZWF0dXJl
IGhhcyBub3QgYmVlbiBpbiB1c2UsIGFuZA0KdGhlIGN1cnJlbnQgYmxvY2sgbGF5ZXIgYmFkYmxv
Y2tzIGFyZSBkaXN0aW5jdCBhbmQgdW5yZWxhdGVkIHRvIHRoZXNlLg0KQ2FuIHRoZSBleHQ0IGJh
ZGJsb2NrcyBpbmZyYXN0cnVjdHVyZSBiZSByZXZpdmVkIGFuZCBleHRlbmRlZCBpZiB3ZQ0KZGVj
aWRlIHRvIGFkZCBiYWRibG9ja3MgdG8gZmlsZXN5c3RlbXM/IE1heWJlIC0gdGhhdCB3YXMgb25l
IG9mIHRoZQ0KdG9waWNzIEkgd2FzIGhvcGluZyB0byBkaXNjdXNzL2ZpbmQgb3V0IG1vcmUgYWJv
dXQu

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-19 19:03                                     ` Dan Williams
@ 2017-01-20  9:03                                         ` Jan Kara
  -1 siblings, 0 replies; 89+ messages in thread
From: Jan Kara @ 2017-01-20  9:03 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Vyacheslav.Dubeyko-Sjgp3cTcYWE,
	darrick.wong-QHcLZuEGTsvQT0dZR+AlfA,
	linux-nvdimm-y27Ovi1pjclAfugRpC6u6w,
	linux-block-u79uwXL29TY76Z2rM5mHXA, slava-yeENwD64cLxBDgjK7y7TUQ,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Thu 19-01-17 11:03:12, Dan Williams wrote:
> On Thu, Jan 19, 2017 at 10:59 AM, Vishal Verma <vishal.l.verma-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> > On 01/19, Jan Kara wrote:
> >> On Wed 18-01-17 21:56:58, Verma, Vishal L wrote:
> >> > On Wed, 2017-01-18 at 13:32 -0800, Dan Williams wrote:
> >> > > On Wed, Jan 18, 2017 at 1:02 PM, Darrick J. Wong
> >> > > <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> >> > > > On Wed, Jan 18, 2017 at 03:39:17PM -0500, Jeff Moyer wrote:
> >> > > > > Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> writes:
> >> > > > >
> >> > > > > > On Tue 17-01-17 15:14:21, Vishal Verma wrote:
> >> > > > > > > Your note on the online repair does raise another tangentially
> >> > > > > > > related
> >> > > > > > > topic. Currently, if there are badblocks, writes via the bio
> >> > > > > > > submission
> >> > > > > > > path will clear the error (if the hardware is able to remap
> >> > > > > > > the bad
> >> > > > > > > locations). However, if the filesystem is mounted eith DAX,
> >> > > > > > > even
> >> > > > > > > non-mmap operations - read() and write() will go through the
> >> > > > > > > dax paths
> >> > > > > > > (dax_do_io()). We haven't found a good/agreeable way to
> >> > > > > > > perform
> >> > > > > > > error-clearing in this case. So currently, if a dax mounted
> >> > > > > > > filesystem
> >> > > > > > > has badblocks, the only way to clear those badblocks is to
> >> > > > > > > mount it
> >> > > > > > > without DAX, and overwrite/zero the bad locations. This is a
> >> > > > > > > pretty
> >> > > > > > > terrible user experience, and I'm hoping this can be solved in
> >> > > > > > > a better
> >> > > > > > > way.
> >> > > > > >
> >> > > > > > Please remind me, what is the problem with DAX code doing
> >> > > > > > necessary work to
> >> > > > > > clear the error when it gets EIO from memcpy on write?
> >> > > > >
> >> > > > > You won't get an MCE for a store;  only loads generate them.
> >> > > > >
> >> > > > > Won't fallocate FL_ZERO_RANGE clear bad blocks when mounted with
> >> > > > > -o dax?
> >> > > >
> >> > > > Not necessarily; XFS usually implements this by punching out the
> >> > > > range
> >> > > > and then reallocating it as unwritten blocks.
> >> > > >
> >> > >
> >> > > That does clear the error because the unwritten blocks are zeroed and
> >> > > errors cleared when they become allocated again.
> >> >
> >> > Yes, the problem was that writes won't clear errors. zeroing through
> >> > either hole-punch, truncate, unlinking the file should all work
> >> > (assuming the hole-punch or truncate ranges wholly contain the
> >> > 'badblock' sector).
> >>
> >> Let me repeat my question: You have mentioned that if we do IO through DAX,
> >> writes won't clear errors and we should fall back to normal block path to
> >> do write to clear the error. What does prevent us from directly clearing
> >> the error from DAX path?
> >>
> > With DAX, all IO goes through DAX paths. There are two cases:
> > 1. mmap and loads/stores: Obviously there is no kernel intervention
> > here, and no badblocks handling is possible.
> > 2. read() or write() IO: In the absence of dax, this would go through
> > the bio submission path, through the pmem driver, and that would handle
> > error clearing. With DAX, this goes through dax_iomap_actor, which also
> > doesn't go through the pmem driver (it does a dax mapping, followed by
> > essentially memcpy), and hence cannot handle badblocks.
> 
> Hmm, that may no longer be true after my changes to push dax flushing
> to the driver. I.e. we could have a copy_from_iter() implementation
> that attempts to clear errors... I'll get that series out and we can
> discuss there.

Yeah, that was precisely my point - doing copy_from_iter() that clears
errors should be possible...

								Honza
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-20  9:03                                         ` Jan Kara
  0 siblings, 0 replies; 89+ messages in thread
From: Jan Kara @ 2017-01-20  9:03 UTC (permalink / raw)
  To: Dan Williams
  Cc: Vishal Verma, Jan Kara, darrick.wong, Vyacheslav.Dubeyko,
	linux-nvdimm, linux-block, slava, linux-fsdevel, lsf-pc

On Thu 19-01-17 11:03:12, Dan Williams wrote:
> On Thu, Jan 19, 2017 at 10:59 AM, Vishal Verma <vishal.l.verma@intel.com> wrote:
> > On 01/19, Jan Kara wrote:
> >> On Wed 18-01-17 21:56:58, Verma, Vishal L wrote:
> >> > On Wed, 2017-01-18 at 13:32 -0800, Dan Williams wrote:
> >> > > On Wed, Jan 18, 2017 at 1:02 PM, Darrick J. Wong
> >> > > <darrick.wong@oracle.com> wrote:
> >> > > > On Wed, Jan 18, 2017 at 03:39:17PM -0500, Jeff Moyer wrote:
> >> > > > > Jan Kara <jack@suse.cz> writes:
> >> > > > >
> >> > > > > > On Tue 17-01-17 15:14:21, Vishal Verma wrote:
> >> > > > > > > Your note on the online repair does raise another tangentially
> >> > > > > > > related
> >> > > > > > > topic. Currently, if there are badblocks, writes via the bio
> >> > > > > > > submission
> >> > > > > > > path will clear the error (if the hardware is able to remap
> >> > > > > > > the bad
> >> > > > > > > locations). However, if the filesystem is mounted eith DAX,
> >> > > > > > > even
> >> > > > > > > non-mmap operations - read() and write() will go through the
> >> > > > > > > dax paths
> >> > > > > > > (dax_do_io()). We haven't found a good/agreeable way to
> >> > > > > > > perform
> >> > > > > > > error-clearing in this case. So currently, if a dax mounted
> >> > > > > > > filesystem
> >> > > > > > > has badblocks, the only way to clear those badblocks is to
> >> > > > > > > mount it
> >> > > > > > > without DAX, and overwrite/zero the bad locations. This is a
> >> > > > > > > pretty
> >> > > > > > > terrible user experience, and I'm hoping this can be solved in
> >> > > > > > > a better
> >> > > > > > > way.
> >> > > > > >
> >> > > > > > Please remind me, what is the problem with DAX code doing
> >> > > > > > necessary work to
> >> > > > > > clear the error when it gets EIO from memcpy on write?
> >> > > > >
> >> > > > > You won't get an MCE for a store;  only loads generate them.
> >> > > > >
> >> > > > > Won't fallocate FL_ZERO_RANGE clear bad blocks when mounted with
> >> > > > > -o dax?
> >> > > >
> >> > > > Not necessarily; XFS usually implements this by punching out the
> >> > > > range
> >> > > > and then reallocating it as unwritten blocks.
> >> > > >
> >> > >
> >> > > That does clear the error because the unwritten blocks are zeroed and
> >> > > errors cleared when they become allocated again.
> >> >
> >> > Yes, the problem was that writes won't clear errors. zeroing through
> >> > either hole-punch, truncate, unlinking the file should all work
> >> > (assuming the hole-punch or truncate ranges wholly contain the
> >> > 'badblock' sector).
> >>
> >> Let me repeat my question: You have mentioned that if we do IO through DAX,
> >> writes won't clear errors and we should fall back to normal block path to
> >> do write to clear the error. What does prevent us from directly clearing
> >> the error from DAX path?
> >>
> > With DAX, all IO goes through DAX paths. There are two cases:
> > 1. mmap and loads/stores: Obviously there is no kernel intervention
> > here, and no badblocks handling is possible.
> > 2. read() or write() IO: In the absence of dax, this would go through
> > the bio submission path, through the pmem driver, and that would handle
> > error clearing. With DAX, this goes through dax_iomap_actor, which also
> > doesn't go through the pmem driver (it does a dax mapping, followed by
> > essentially memcpy), and hence cannot handle badblocks.
> 
> Hmm, that may no longer be true after my changes to push dax flushing
> to the driver. I.e. we could have a copy_from_iter() implementation
> that attempts to clear errors... I'll get that series out and we can
> discuss there.

Yeah, that was precisely my point - doing copy_from_iter() that clears
errors should be possible...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-20  0:46                     ` Vishal Verma
@ 2017-01-20  9:24                       ` Yasunori Goto
  -1 siblings, 0 replies; 89+ messages in thread
From: Yasunori Goto @ 2017-01-20  9:24 UTC (permalink / raw)
  To: Vishal Verma
  Cc: Andreas Dilger, Slava Dubeyko, Darrick J. Wong,
	linux-nvdimm@lists.01.org, linux-block, Andiry Xu,
	Viacheslav Dubeyko, Linux FS Devel, lsf-pc

Hello,
Virshal-san.

First of all, your discussion is quite interesting for me. Thanks.

> > 
> > If DUE does happen and is flagged to the file system via MCE (somehow...),
> > and the fs finds that the error corrupts its allocated data page, or
> > metadata, now if the fs wants to recover its data the intuition is that
> > there needs to be a stronger error correction mechanism to correct the
> > hardware-uncorrectable errors. So knowing the hardware ECC baseline is
> > helpful for the file system to understand how severe are the faults in
> > badblocks, and develop its recovery methods.
> 
> Like mentioned before, this discussion is more about presentation of
> errors in a known consumable format, rather than recovering from errors.
> While recovering from errors is interesting, we already have layers
> like RAID for that, and they are as applicable to NVDIMM backed storage
> as they have been for disk/SSD based storage.

I have one question here.

Certainly, user can use LVM mirroring for storage mode of NVDIMM.
However, NVDIMM has DAX mode. 
Can user use LVM mirroring for NVDIMM DAX mode?
I could not find any information that LVM support DAX....

In addition, current specs of NVDIMM (*) only define interleave feature of NVDIMMs. 
They does not mention about mirroring feature.
So, I don't understand how to use mirroring for DAX.

(*) "NVDIMM Namespace Specification" , "NVDIMM Block Window Driver Writer’s Guide",
       and "ACPI 6.1"

Regards,
---
Yasunori Goto

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-20  9:24                       ` Yasunori Goto
  0 siblings, 0 replies; 89+ messages in thread
From: Yasunori Goto @ 2017-01-20  9:24 UTC (permalink / raw)
  To: Vishal Verma
  Cc: Lu Zhang, Andreas Dilger, Slava Dubeyko, Darrick J. Wong,
	linux-nvdimm@lists.01.org, linux-block, Linux FS Devel,
	Viacheslav Dubeyko, Andiry Xu, lsf-pc

Hello,
Virshal-san.

First of all, your discussion is quite interesting for me. Thanks.

> > 
> > If DUE does happen and is flagged to the file system via MCE (somehow...),
> > and the fs finds that the error corrupts its allocated data page, or
> > metadata, now if the fs wants to recover its data the intuition is that
> > there needs to be a stronger error correction mechanism to correct the
> > hardware-uncorrectable errors. So knowing the hardware ECC baseline is
> > helpful for the file system to understand how severe are the faults in
> > badblocks, and develop its recovery methods.
> 
> Like mentioned before, this discussion is more about presentation of
> errors in a known consumable format, rather than recovering from errors.
> While recovering from errors is interesting, we already have layers
> like RAID for that, and they are as applicable to NVDIMM backed storage
> as they have been for disk/SSD based storage.

I have one question here.

Certainly, user can use LVM mirroring for storage mode of NVDIMM.
However, NVDIMM has DAX mode. 
Can user use LVM mirroring for NVDIMM DAX mode?
I could not find any information that LVM support DAX....

In addition, current specs of NVDIMM (*) only define interleave feature of NVDIMMs. 
They does not mention about mirroring feature.
So, I don't understand how to use mirroring for DAX.

(*) "NVDIMM Namespace Specification" , "NVDIMM Block Window Driver Writer’s Guide",
       and "ACPI 6.1"

Regards,
---
Yasunori Goto


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-19 21:17                   ` Vishal Verma
@ 2017-01-20  9:47                     ` Jan Kara
  -1 siblings, 0 replies; 89+ messages in thread
From: Jan Kara @ 2017-01-20  9:47 UTC (permalink / raw)
  To: Vishal Verma
  Cc: Jan Kara, Slava Dubeyko, Darrick J. Wong,
	linux-nvdimm@lists.01.org, linux-block, Linux FS Devel,
	Viacheslav Dubeyko, Andiry Xu, lsf-pc

On Thu 19-01-17 14:17:19, Vishal Verma wrote:
> On 01/18, Jan Kara wrote:
> > On Tue 17-01-17 15:37:05, Vishal Verma wrote:
> > 2) PMEM is exposed for DAX aware filesystem. This seems to be what you are
> > mostly interested in. We could possibly do something more efficient than
> > what NVDIMM driver does however the complexity would be relatively high and
> > frankly I'm far from convinced this is really worth it. If there are so
> > many badblocks this would matter, the HW has IMHO bigger problems than
> > performance.
> 
> Correct, and Dave was of the opinion that once at least XFS has reverse
> mapping support (which it does now), adding badblocks information to
> that should not be a hard lift, and should be a better solution. I
> suppose should try to benchmark how much of a penalty the current badblock
> checking in the NVVDIMM driver imposes. The penalty is not because there
> may be a large number of badblocks, but just due to the fact that we
> have to do this check for every IO, in fact, every 'bvec' in a bio.

Well, letting filesystem know is certainly good from error reporting quality
POV. I guess I'll leave it upto XFS guys to tell whether they can be more
efficient in checking whether current IO overlaps with any of given bad
blocks.
 
> > Now my question: Why do we bother with badblocks at all? In cases 1) and 2)
> > if the platform can recover from MCE, we can just always access persistent
> > memory using memcpy_mcsafe(), if that fails, return -EIO. Actually that
> > seems to already happen so we just need to make sure all places handle
> > returned errors properly (e.g. fs/dax.c does not seem to) and we are done.
> > No need for bad blocks list at all, no slow down unless we hit a bad cell
> > and in that case who cares about performance when the data is gone...
> 
> Even when we have MCE recovery, we cannot do away with the badblocks
> list:
> 1. My understanding is that the hardware's ability to do MCE recovery is
> limited/best-effort, and is not guaranteed. There can be circumstances
> that cause a "Processor Context Corrupt" state, which is unrecoverable.

Well, then they have to work on improving the hardware. Because having HW
that just sometimes gets stuck instead of reporting bad storage is simply
not acceptable. And no matter how hard you try you cannot avoid MCEs from
OS when accessing persistent memory so OS just has no way to avoid that
risk.

> 2. We still need to maintain a badblocks list so that we know what
> blocks need to be cleared (via the ACPI method) on writes.

Well, why cannot we just do the write, see whether we got CMCI and if yes,
clear the error via the ACPI method?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-20  9:47                     ` Jan Kara
  0 siblings, 0 replies; 89+ messages in thread
From: Jan Kara @ 2017-01-20  9:47 UTC (permalink / raw)
  To: Vishal Verma
  Cc: Jan Kara, Andiry Xu, Slava Dubeyko, Darrick J. Wong,
	linux-nvdimm@lists.01.org, linux-block, Viacheslav Dubeyko,
	Linux FS Devel, lsf-pc

On Thu 19-01-17 14:17:19, Vishal Verma wrote:
> On 01/18, Jan Kara wrote:
> > On Tue 17-01-17 15:37:05, Vishal Verma wrote:
> > 2) PMEM is exposed for DAX aware filesystem. This seems to be what you are
> > mostly interested in. We could possibly do something more efficient than
> > what NVDIMM driver does however the complexity would be relatively high and
> > frankly I'm far from convinced this is really worth it. If there are so
> > many badblocks this would matter, the HW has IMHO bigger problems than
> > performance.
> 
> Correct, and Dave was of the opinion that once at least XFS has reverse
> mapping support (which it does now), adding badblocks information to
> that should not be a hard lift, and should be a better solution. I
> suppose should try to benchmark how much of a penalty the current badblock
> checking in the NVVDIMM driver imposes. The penalty is not because there
> may be a large number of badblocks, but just due to the fact that we
> have to do this check for every IO, in fact, every 'bvec' in a bio.

Well, letting filesystem know is certainly good from error reporting quality
POV. I guess I'll leave it upto XFS guys to tell whether they can be more
efficient in checking whether current IO overlaps with any of given bad
blocks.
 
> > Now my question: Why do we bother with badblocks at all? In cases 1) and 2)
> > if the platform can recover from MCE, we can just always access persistent
> > memory using memcpy_mcsafe(), if that fails, return -EIO. Actually that
> > seems to already happen so we just need to make sure all places handle
> > returned errors properly (e.g. fs/dax.c does not seem to) and we are done.
> > No need for bad blocks list at all, no slow down unless we hit a bad cell
> > and in that case who cares about performance when the data is gone...
> 
> Even when we have MCE recovery, we cannot do away with the badblocks
> list:
> 1. My understanding is that the hardware's ability to do MCE recovery is
> limited/best-effort, and is not guaranteed. There can be circumstances
> that cause a "Processor Context Corrupt" state, which is unrecoverable.

Well, then they have to work on improving the hardware. Because having HW
that just sometimes gets stuck instead of reporting bad storage is simply
not acceptable. And no matter how hard you try you cannot avoid MCEs from
OS when accessing persistent memory so OS just has no way to avoid that
risk.

> 2. We still need to maintain a badblocks list so that we know what
> blocks need to be cleared (via the ACPI method) on writes.

Well, why cannot we just do the write, see whether we got CMCI and if yes,
clear the error via the ACPI method?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-20  9:47                     ` Jan Kara
@ 2017-01-20 15:42                       ` Dan Williams
  -1 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2017-01-20 15:42 UTC (permalink / raw)
  To: Jan Kara
  Cc: Slava Dubeyko, Darrick J. Wong, linux-nvdimm@lists.01.org,
	linux-block, Andiry Xu, Viacheslav Dubeyko, Linux FS Devel,
	lsf-pc

On Fri, Jan 20, 2017 at 1:47 AM, Jan Kara <jack@suse.cz> wrote:
> On Thu 19-01-17 14:17:19, Vishal Verma wrote:
>> On 01/18, Jan Kara wrote:
>> > On Tue 17-01-17 15:37:05, Vishal Verma wrote:
>> > 2) PMEM is exposed for DAX aware filesystem. This seems to be what you are
>> > mostly interested in. We could possibly do something more efficient than
>> > what NVDIMM driver does however the complexity would be relatively high and
>> > frankly I'm far from convinced this is really worth it. If there are so
>> > many badblocks this would matter, the HW has IMHO bigger problems than
>> > performance.
>>
>> Correct, and Dave was of the opinion that once at least XFS has reverse
>> mapping support (which it does now), adding badblocks information to
>> that should not be a hard lift, and should be a better solution. I
>> suppose should try to benchmark how much of a penalty the current badblock
>> checking in the NVVDIMM driver imposes. The penalty is not because there
>> may be a large number of badblocks, but just due to the fact that we
>> have to do this check for every IO, in fact, every 'bvec' in a bio.
>
> Well, letting filesystem know is certainly good from error reporting quality
> POV. I guess I'll leave it upto XFS guys to tell whether they can be more
> efficient in checking whether current IO overlaps with any of given bad
> blocks.
>
>> > Now my question: Why do we bother with badblocks at all? In cases 1) and 2)
>> > if the platform can recover from MCE, we can just always access persistent
>> > memory using memcpy_mcsafe(), if that fails, return -EIO. Actually that
>> > seems to already happen so we just need to make sure all places handle
>> > returned errors properly (e.g. fs/dax.c does not seem to) and we are done.
>> > No need for bad blocks list at all, no slow down unless we hit a bad cell
>> > and in that case who cares about performance when the data is gone...
>>
>> Even when we have MCE recovery, we cannot do away with the badblocks
>> list:
>> 1. My understanding is that the hardware's ability to do MCE recovery is
>> limited/best-effort, and is not guaranteed. There can be circumstances
>> that cause a "Processor Context Corrupt" state, which is unrecoverable.
>
> Well, then they have to work on improving the hardware. Because having HW
> that just sometimes gets stuck instead of reporting bad storage is simply
> not acceptable. And no matter how hard you try you cannot avoid MCEs from
> OS when accessing persistent memory so OS just has no way to avoid that
> risk.
>
>> 2. We still need to maintain a badblocks list so that we know what
>> blocks need to be cleared (via the ACPI method) on writes.
>
> Well, why cannot we just do the write, see whether we got CMCI and if yes,
> clear the error via the ACPI method?

I would need to check if you get the address reported in the CMCI, but
it would only fire if the write triggered a read-modify-write cycle. I
suspect most copies to pmem, through something like
arch_memcpy_to_pmem(), are not triggering any reads. It also triggers
asynchronously, so what data do you write after clearing the error?
There may have been more writes while the CMCI was being delivered.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-20 15:42                       ` Dan Williams
  0 siblings, 0 replies; 89+ messages in thread
From: Dan Williams @ 2017-01-20 15:42 UTC (permalink / raw)
  To: Jan Kara
  Cc: Vishal Verma, Slava Dubeyko, Darrick J. Wong,
	linux-nvdimm@lists.01.org, linux-block, Linux FS Devel,
	Viacheslav Dubeyko, Andiry Xu, lsf-pc

On Fri, Jan 20, 2017 at 1:47 AM, Jan Kara <jack@suse.cz> wrote:
> On Thu 19-01-17 14:17:19, Vishal Verma wrote:
>> On 01/18, Jan Kara wrote:
>> > On Tue 17-01-17 15:37:05, Vishal Verma wrote:
>> > 2) PMEM is exposed for DAX aware filesystem. This seems to be what you are
>> > mostly interested in. We could possibly do something more efficient than
>> > what NVDIMM driver does however the complexity would be relatively high and
>> > frankly I'm far from convinced this is really worth it. If there are so
>> > many badblocks this would matter, the HW has IMHO bigger problems than
>> > performance.
>>
>> Correct, and Dave was of the opinion that once at least XFS has reverse
>> mapping support (which it does now), adding badblocks information to
>> that should not be a hard lift, and should be a better solution. I
>> suppose should try to benchmark how much of a penalty the current badblock
>> checking in the NVVDIMM driver imposes. The penalty is not because there
>> may be a large number of badblocks, but just due to the fact that we
>> have to do this check for every IO, in fact, every 'bvec' in a bio.
>
> Well, letting filesystem know is certainly good from error reporting quality
> POV. I guess I'll leave it upto XFS guys to tell whether they can be more
> efficient in checking whether current IO overlaps with any of given bad
> blocks.
>
>> > Now my question: Why do we bother with badblocks at all? In cases 1) and 2)
>> > if the platform can recover from MCE, we can just always access persistent
>> > memory using memcpy_mcsafe(), if that fails, return -EIO. Actually that
>> > seems to already happen so we just need to make sure all places handle
>> > returned errors properly (e.g. fs/dax.c does not seem to) and we are done.
>> > No need for bad blocks list at all, no slow down unless we hit a bad cell
>> > and in that case who cares about performance when the data is gone...
>>
>> Even when we have MCE recovery, we cannot do away with the badblocks
>> list:
>> 1. My understanding is that the hardware's ability to do MCE recovery is
>> limited/best-effort, and is not guaranteed. There can be circumstances
>> that cause a "Processor Context Corrupt" state, which is unrecoverable.
>
> Well, then they have to work on improving the hardware. Because having HW
> that just sometimes gets stuck instead of reporting bad storage is simply
> not acceptable. And no matter how hard you try you cannot avoid MCEs from
> OS when accessing persistent memory so OS just has no way to avoid that
> risk.
>
>> 2. We still need to maintain a badblocks list so that we know what
>> blocks need to be cleared (via the ACPI method) on writes.
>
> Well, why cannot we just do the write, see whether we got CMCI and if yes,
> clear the error via the ACPI method?

I would need to check if you get the address reported in the CMCI, but
it would only fire if the write triggered a read-modify-write cycle. I
suspect most copies to pmem, through something like
arch_memcpy_to_pmem(), are not triggering any reads. It also triggers
asynchronously, so what data do you write after clearing the error?
There may have been more writes while the CMCI was being delivered.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-20  9:24                       ` Yasunori Goto
  (?)
@ 2017-01-21  0:23                           ` Kani, Toshimitsu
  -1 siblings, 0 replies; 89+ messages in thread
From: Kani, Toshimitsu @ 2017-01-21  0:23 UTC (permalink / raw)
  To: y-goto-+CUm20s59erQFUHtdCDX3A, vishal.l.verma-ral2JQCrhuEAvxtiuMwx3w
  Cc: adilger-m1MBpc4rdrD3fQ9qLvQP4Q, Vyacheslav.Dubeyko-Sjgp3cTcYWE,
	darrick.wong-QHcLZuEGTsvQT0dZR+AlfA,
	linux-nvdimm-y27Ovi1pjclAfugRpC6u6w,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	andiry-Re5JQEeQqe8AvxtiuMwx3w, slava-yeENwD64cLxBDgjK7y7TUQ,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Fri, 2017-01-20 at 18:24 +0900, Yasunori Goto wrote:
 :
> > 
> > Like mentioned before, this discussion is more about presentation
> > of errors in a known consumable format, rather than recovering from
> > errors. While recovering from errors is interesting, we already
> > have layers like RAID for that, and they are as applicable to
> > NVDIMM backed storage as they have been for disk/SSD based storage.
> 
> I have one question here.
> 
> Certainly, user can use LVM mirroring for storage mode of NVDIMM.
> However, NVDIMM has DAX mode. 
> Can user use LVM mirroring for NVDIMM DAX mode?
> I could not find any information that LVM support DAX....

dm-linear and dm-stripe support DAX.  This is done by mapping block
allocations to LVM physical devices.  Once blocks are allocated, all
DAX I/Os are direct and do not go through the device-mapper layer.  We
may be able to change it for read/write paths, but it remains true for
mmap.  So, I do not think DAX can be supported with LVM mirroring. 
This does not preclude hardware mirroring, though.

-Toshi
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-21  0:23                           ` Kani, Toshimitsu
  0 siblings, 0 replies; 89+ messages in thread
From: Kani, Toshimitsu @ 2017-01-21  0:23 UTC (permalink / raw)
  To: y-goto, vishal.l.verma
  Cc: Vyacheslav.Dubeyko, linux-block, lsf-pc, linux-nvdimm, slava,
	adilger, darrick.wong, linux-fsdevel, andiry

T24gRnJpLCAyMDE3LTAxLTIwIGF0IDE4OjI0ICswOTAwLCBZYXN1bm9yaSBHb3RvIHdyb3RlOg0K
IDoNCj4gPiANCj4gPiBMaWtlIG1lbnRpb25lZCBiZWZvcmUsIHRoaXMgZGlzY3Vzc2lvbiBpcyBt
b3JlIGFib3V0IHByZXNlbnRhdGlvbg0KPiA+IG9mIGVycm9ycyBpbiBhIGtub3duIGNvbnN1bWFi
bGUgZm9ybWF0LCByYXRoZXIgdGhhbiByZWNvdmVyaW5nIGZyb20NCj4gPiBlcnJvcnMuIFdoaWxl
IHJlY292ZXJpbmcgZnJvbSBlcnJvcnMgaXMgaW50ZXJlc3RpbmcsIHdlIGFscmVhZHkNCj4gPiBo
YXZlIGxheWVycyBsaWtlIFJBSUQgZm9yIHRoYXQsIGFuZCB0aGV5IGFyZSBhcyBhcHBsaWNhYmxl
IHRvDQo+ID4gTlZESU1NIGJhY2tlZCBzdG9yYWdlIGFzIHRoZXkgaGF2ZSBiZWVuIGZvciBkaXNr
L1NTRCBiYXNlZCBzdG9yYWdlLg0KPiANCj4gSSBoYXZlIG9uZSBxdWVzdGlvbiBoZXJlLg0KPiAN
Cj4gQ2VydGFpbmx5LCB1c2VyIGNhbiB1c2UgTFZNIG1pcnJvcmluZyBmb3Igc3RvcmFnZSBtb2Rl
IG9mIE5WRElNTS4NCj4gSG93ZXZlciwgTlZESU1NIGhhcyBEQVggbW9kZS7CoA0KPiBDYW4gdXNl
ciB1c2UgTFZNIG1pcnJvcmluZyBmb3IgTlZESU1NIERBWCBtb2RlPw0KPiBJIGNvdWxkIG5vdCBm
aW5kIGFueSBpbmZvcm1hdGlvbiB0aGF0IExWTSBzdXBwb3J0IERBWC4uLi4NCg0KZG0tbGluZWFy
IGFuZCBkbS1zdHJpcGUgc3VwcG9ydCBEQVguICBUaGlzIGlzIGRvbmUgYnkgbWFwcGluZyBibG9j
aw0KYWxsb2NhdGlvbnMgdG8gTFZNIHBoeXNpY2FsIGRldmljZXMuICBPbmNlIGJsb2NrcyBhcmUg
YWxsb2NhdGVkLCBhbGwNCkRBWCBJL09zIGFyZSBkaXJlY3QgYW5kIGRvIG5vdCBnbyB0aHJvdWdo
IHRoZSBkZXZpY2UtbWFwcGVyIGxheWVyLiAgV2UNCm1heSBiZSBhYmxlIHRvIGNoYW5nZSBpdCBm
b3IgcmVhZC93cml0ZSBwYXRocywgYnV0IGl0IHJlbWFpbnMgdHJ1ZSBmb3INCm1tYXAuICBTbywg
SSBkbyBub3QgdGhpbmsgREFYIGNhbiBiZSBzdXBwb3J0ZWQgd2l0aCBMVk0gbWlycm9yaW5nLiAN
ClRoaXMgZG9lcyBub3QgcHJlY2x1ZGUgaGFyZHdhcmUgbWlycm9yaW5nLCB0aG91Z2guDQoNCi1U
b3NoaQ0K

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-21  0:23                           ` Kani, Toshimitsu
  0 siblings, 0 replies; 89+ messages in thread
From: Kani, Toshimitsu @ 2017-01-21  0:23 UTC (permalink / raw)
  To: y-goto, vishal.l.verma
  Cc: Vyacheslav.Dubeyko, linux-block, lsf-pc, linux-nvdimm, slava,
	adilger, darrick.wong, linux-fsdevel, andiry

On Fri, 2017-01-20 at 18:24 +0900, Yasunori Goto wrote:
 :
> > 
> > Like mentioned before, this discussion is more about presentation
> > of errors in a known consumable format, rather than recovering from
> > errors. While recovering from errors is interesting, we already
> > have layers like RAID for that, and they are as applicable to
> > NVDIMM backed storage as they have been for disk/SSD based storage.
> 
> I have one question here.
> 
> Certainly, user can use LVM mirroring for storage mode of NVDIMM.
> However, NVDIMM has DAX mode. 
> Can user use LVM mirroring for NVDIMM DAX mode?
> I could not find any information that LVM support DAX....

dm-linear and dm-stripe support DAX.  This is done by mapping block
allocations to LVM physical devices.  Once blocks are allocated, all
DAX I/Os are direct and do not go through the device-mapper layer.  We
may be able to change it for read/write paths, but it remains true for
mmap.  So, I do not think DAX can be supported with LVM mirroring. 
This does not preclude hardware mirroring, though.

-Toshi

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-20 15:42                       ` Dan Williams
@ 2017-01-24  7:46                         ` Jan Kara
  -1 siblings, 0 replies; 89+ messages in thread
From: Jan Kara @ 2017-01-24  7:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Slava Dubeyko, Darrick J. Wong,
	linux-nvdimm@lists.01.org, linux-block, Andiry Xu,
	Viacheslav Dubeyko, Linux FS Devel, lsf-pc

On Fri 20-01-17 07:42:09, Dan Williams wrote:
> On Fri, Jan 20, 2017 at 1:47 AM, Jan Kara <jack@suse.cz> wrote:
> > On Thu 19-01-17 14:17:19, Vishal Verma wrote:
> >> On 01/18, Jan Kara wrote:
> >> > On Tue 17-01-17 15:37:05, Vishal Verma wrote:
> >> > 2) PMEM is exposed for DAX aware filesystem. This seems to be what you are
> >> > mostly interested in. We could possibly do something more efficient than
> >> > what NVDIMM driver does however the complexity would be relatively high and
> >> > frankly I'm far from convinced this is really worth it. If there are so
> >> > many badblocks this would matter, the HW has IMHO bigger problems than
> >> > performance.
> >>
> >> Correct, and Dave was of the opinion that once at least XFS has reverse
> >> mapping support (which it does now), adding badblocks information to
> >> that should not be a hard lift, and should be a better solution. I
> >> suppose should try to benchmark how much of a penalty the current badblock
> >> checking in the NVVDIMM driver imposes. The penalty is not because there
> >> may be a large number of badblocks, but just due to the fact that we
> >> have to do this check for every IO, in fact, every 'bvec' in a bio.
> >
> > Well, letting filesystem know is certainly good from error reporting quality
> > POV. I guess I'll leave it upto XFS guys to tell whether they can be more
> > efficient in checking whether current IO overlaps with any of given bad
> > blocks.
> >
> >> > Now my question: Why do we bother with badblocks at all? In cases 1) and 2)
> >> > if the platform can recover from MCE, we can just always access persistent
> >> > memory using memcpy_mcsafe(), if that fails, return -EIO. Actually that
> >> > seems to already happen so we just need to make sure all places handle
> >> > returned errors properly (e.g. fs/dax.c does not seem to) and we are done.
> >> > No need for bad blocks list at all, no slow down unless we hit a bad cell
> >> > and in that case who cares about performance when the data is gone...
> >>
> >> Even when we have MCE recovery, we cannot do away with the badblocks
> >> list:
> >> 1. My understanding is that the hardware's ability to do MCE recovery is
> >> limited/best-effort, and is not guaranteed. There can be circumstances
> >> that cause a "Processor Context Corrupt" state, which is unrecoverable.
> >
> > Well, then they have to work on improving the hardware. Because having HW
> > that just sometimes gets stuck instead of reporting bad storage is simply
> > not acceptable. And no matter how hard you try you cannot avoid MCEs from
> > OS when accessing persistent memory so OS just has no way to avoid that
> > risk.
> >
> >> 2. We still need to maintain a badblocks list so that we know what
> >> blocks need to be cleared (via the ACPI method) on writes.
> >
> > Well, why cannot we just do the write, see whether we got CMCI and if yes,
> > clear the error via the ACPI method?
> 
> I would need to check if you get the address reported in the CMCI, but
> it would only fire if the write triggered a read-modify-write cycle. I
> suspect most copies to pmem, through something like
> arch_memcpy_to_pmem(), are not triggering any reads. It also triggers
> asynchronously, so what data do you write after clearing the error?
> There may have been more writes while the CMCI was being delivered.

OK, I see. And if we just write new data but don't clear error on write
through the ACPI method, will we still get MCE on following read of that
data? But regardless whether we get MCE or not, I suppose that the memory
location will be still marked as bad in some ACPI table, won't it?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-24  7:46                         ` Jan Kara
  0 siblings, 0 replies; 89+ messages in thread
From: Jan Kara @ 2017-01-24  7:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Vishal Verma, Slava Dubeyko, Darrick J. Wong,
	linux-nvdimm@lists.01.org, linux-block, Linux FS Devel,
	Viacheslav Dubeyko, Andiry Xu, lsf-pc

On Fri 20-01-17 07:42:09, Dan Williams wrote:
> On Fri, Jan 20, 2017 at 1:47 AM, Jan Kara <jack@suse.cz> wrote:
> > On Thu 19-01-17 14:17:19, Vishal Verma wrote:
> >> On 01/18, Jan Kara wrote:
> >> > On Tue 17-01-17 15:37:05, Vishal Verma wrote:
> >> > 2) PMEM is exposed for DAX aware filesystem. This seems to be what you are
> >> > mostly interested in. We could possibly do something more efficient than
> >> > what NVDIMM driver does however the complexity would be relatively high and
> >> > frankly I'm far from convinced this is really worth it. If there are so
> >> > many badblocks this would matter, the HW has IMHO bigger problems than
> >> > performance.
> >>
> >> Correct, and Dave was of the opinion that once at least XFS has reverse
> >> mapping support (which it does now), adding badblocks information to
> >> that should not be a hard lift, and should be a better solution. I
> >> suppose should try to benchmark how much of a penalty the current badblock
> >> checking in the NVVDIMM driver imposes. The penalty is not because there
> >> may be a large number of badblocks, but just due to the fact that we
> >> have to do this check for every IO, in fact, every 'bvec' in a bio.
> >
> > Well, letting filesystem know is certainly good from error reporting quality
> > POV. I guess I'll leave it upto XFS guys to tell whether they can be more
> > efficient in checking whether current IO overlaps with any of given bad
> > blocks.
> >
> >> > Now my question: Why do we bother with badblocks at all? In cases 1) and 2)
> >> > if the platform can recover from MCE, we can just always access persistent
> >> > memory using memcpy_mcsafe(), if that fails, return -EIO. Actually that
> >> > seems to already happen so we just need to make sure all places handle
> >> > returned errors properly (e.g. fs/dax.c does not seem to) and we are done.
> >> > No need for bad blocks list at all, no slow down unless we hit a bad cell
> >> > and in that case who cares about performance when the data is gone...
> >>
> >> Even when we have MCE recovery, we cannot do away with the badblocks
> >> list:
> >> 1. My understanding is that the hardware's ability to do MCE recovery is
> >> limited/best-effort, and is not guaranteed. There can be circumstances
> >> that cause a "Processor Context Corrupt" state, which is unrecoverable.
> >
> > Well, then they have to work on improving the hardware. Because having HW
> > that just sometimes gets stuck instead of reporting bad storage is simply
> > not acceptable. And no matter how hard you try you cannot avoid MCEs from
> > OS when accessing persistent memory so OS just has no way to avoid that
> > risk.
> >
> >> 2. We still need to maintain a badblocks list so that we know what
> >> blocks need to be cleared (via the ACPI method) on writes.
> >
> > Well, why cannot we just do the write, see whether we got CMCI and if yes,
> > clear the error via the ACPI method?
> 
> I would need to check if you get the address reported in the CMCI, but
> it would only fire if the write triggered a read-modify-write cycle. I
> suspect most copies to pmem, through something like
> arch_memcpy_to_pmem(), are not triggering any reads. It also triggers
> asynchronously, so what data do you write after clearing the error?
> There may have been more writes while the CMCI was being delivered.

OK, I see. And if we just write new data but don't clear error on write
through the ACPI method, will we still get MCE on following read of that
data? But regardless whether we get MCE or not, I suppose that the memory
location will be still marked as bad in some ACPI table, won't it?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
  2017-01-24  7:46                         ` Jan Kara
@ 2017-01-24 19:59                           ` Vishal Verma
  -1 siblings, 0 replies; 89+ messages in thread
From: Vishal Verma @ 2017-01-24 19:59 UTC (permalink / raw)
  To: Jan Kara
  Cc: Slava Dubeyko, Darrick J. Wong, linux-nvdimm@lists.01.org,
	linux-block, Andiry Xu, Viacheslav Dubeyko, Linux FS Devel,
	lsf-pc

On 01/24, Jan Kara wrote:
> On Fri 20-01-17 07:42:09, Dan Williams wrote:
> > On Fri, Jan 20, 2017 at 1:47 AM, Jan Kara <jack@suse.cz> wrote:
> > > On Thu 19-01-17 14:17:19, Vishal Verma wrote:
> > >> On 01/18, Jan Kara wrote:
> > >> > On Tue 17-01-17 15:37:05, Vishal Verma wrote:
> > >> > 2) PMEM is exposed for DAX aware filesystem. This seems to be what you are
> > >> > mostly interested in. We could possibly do something more efficient than
> > >> > what NVDIMM driver does however the complexity would be relatively high and
> > >> > frankly I'm far from convinced this is really worth it. If there are so
> > >> > many badblocks this would matter, the HW has IMHO bigger problems than
> > >> > performance.
> > >>
> > >> Correct, and Dave was of the opinion that once at least XFS has reverse
> > >> mapping support (which it does now), adding badblocks information to
> > >> that should not be a hard lift, and should be a better solution. I
> > >> suppose should try to benchmark how much of a penalty the current badblock
> > >> checking in the NVVDIMM driver imposes. The penalty is not because there
> > >> may be a large number of badblocks, but just due to the fact that we
> > >> have to do this check for every IO, in fact, every 'bvec' in a bio.
> > >
> > > Well, letting filesystem know is certainly good from error reporting quality
> > > POV. I guess I'll leave it upto XFS guys to tell whether they can be more
> > > efficient in checking whether current IO overlaps with any of given bad
> > > blocks.
> > >
> > >> > Now my question: Why do we bother with badblocks at all? In cases 1) and 2)
> > >> > if the platform can recover from MCE, we can just always access persistent
> > >> > memory using memcpy_mcsafe(), if that fails, return -EIO. Actually that
> > >> > seems to already happen so we just need to make sure all places handle
> > >> > returned errors properly (e.g. fs/dax.c does not seem to) and we are done.
> > >> > No need for bad blocks list at all, no slow down unless we hit a bad cell
> > >> > and in that case who cares about performance when the data is gone...
> > >>
> > >> Even when we have MCE recovery, we cannot do away with the badblocks
> > >> list:
> > >> 1. My understanding is that the hardware's ability to do MCE recovery is
> > >> limited/best-effort, and is not guaranteed. There can be circumstances
> > >> that cause a "Processor Context Corrupt" state, which is unrecoverable.
> > >
> > > Well, then they have to work on improving the hardware. Because having HW
> > > that just sometimes gets stuck instead of reporting bad storage is simply
> > > not acceptable. And no matter how hard you try you cannot avoid MCEs from
> > > OS when accessing persistent memory so OS just has no way to avoid that
> > > risk.
> > >
> > >> 2. We still need to maintain a badblocks list so that we know what
> > >> blocks need to be cleared (via the ACPI method) on writes.
> > >
> > > Well, why cannot we just do the write, see whether we got CMCI and if yes,
> > > clear the error via the ACPI method?
> > 
> > I would need to check if you get the address reported in the CMCI, but
> > it would only fire if the write triggered a read-modify-write cycle. I
> > suspect most copies to pmem, through something like
> > arch_memcpy_to_pmem(), are not triggering any reads. It also triggers
> > asynchronously, so what data do you write after clearing the error?
> > There may have been more writes while the CMCI was being delivered.
> 
> OK, I see. And if we just write new data but don't clear error on write
> through the ACPI method, will we still get MCE on following read of that
> data? But regardless whether we get MCE or not, I suppose that the memory
> location will be still marked as bad in some ACPI table, won't it?

Correct, the location will continue to result in MCEs on reads if it isn't
marked as clear explicitly. I'm not sure that there is an ACPI table
that keeps a list of bad locations, it is just a poison bit in the cache
line, and presumable DIMMs will have some internal data structures that
also mark bad locations.

> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-24 19:59                           ` Vishal Verma
  0 siblings, 0 replies; 89+ messages in thread
From: Vishal Verma @ 2017-01-24 19:59 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dan Williams, Slava Dubeyko, Darrick J. Wong,
	linux-nvdimm@lists.01.org, linux-block, Linux FS Devel,
	Viacheslav Dubeyko, Andiry Xu, lsf-pc

On 01/24, Jan Kara wrote:
> On Fri 20-01-17 07:42:09, Dan Williams wrote:
> > On Fri, Jan 20, 2017 at 1:47 AM, Jan Kara <jack@suse.cz> wrote:
> > > On Thu 19-01-17 14:17:19, Vishal Verma wrote:
> > >> On 01/18, Jan Kara wrote:
> > >> > On Tue 17-01-17 15:37:05, Vishal Verma wrote:
> > >> > 2) PMEM is exposed for DAX aware filesystem. This seems to be what you are
> > >> > mostly interested in. We could possibly do something more efficient than
> > >> > what NVDIMM driver does however the complexity would be relatively high and
> > >> > frankly I'm far from convinced this is really worth it. If there are so
> > >> > many badblocks this would matter, the HW has IMHO bigger problems than
> > >> > performance.
> > >>
> > >> Correct, and Dave was of the opinion that once at least XFS has reverse
> > >> mapping support (which it does now), adding badblocks information to
> > >> that should not be a hard lift, and should be a better solution. I
> > >> suppose should try to benchmark how much of a penalty the current badblock
> > >> checking in the NVVDIMM driver imposes. The penalty is not because there
> > >> may be a large number of badblocks, but just due to the fact that we
> > >> have to do this check for every IO, in fact, every 'bvec' in a bio.
> > >
> > > Well, letting filesystem know is certainly good from error reporting quality
> > > POV. I guess I'll leave it upto XFS guys to tell whether they can be more
> > > efficient in checking whether current IO overlaps with any of given bad
> > > blocks.
> > >
> > >> > Now my question: Why do we bother with badblocks at all? In cases 1) and 2)
> > >> > if the platform can recover from MCE, we can just always access persistent
> > >> > memory using memcpy_mcsafe(), if that fails, return -EIO. Actually that
> > >> > seems to already happen so we just need to make sure all places handle
> > >> > returned errors properly (e.g. fs/dax.c does not seem to) and we are done.
> > >> > No need for bad blocks list at all, no slow down unless we hit a bad cell
> > >> > and in that case who cares about performance when the data is gone...
> > >>
> > >> Even when we have MCE recovery, we cannot do away with the badblocks
> > >> list:
> > >> 1. My understanding is that the hardware's ability to do MCE recovery is
> > >> limited/best-effort, and is not guaranteed. There can be circumstances
> > >> that cause a "Processor Context Corrupt" state, which is unrecoverable.
> > >
> > > Well, then they have to work on improving the hardware. Because having HW
> > > that just sometimes gets stuck instead of reporting bad storage is simply
> > > not acceptable. And no matter how hard you try you cannot avoid MCEs from
> > > OS when accessing persistent memory so OS just has no way to avoid that
> > > risk.
> > >
> > >> 2. We still need to maintain a badblocks list so that we know what
> > >> blocks need to be cleared (via the ACPI method) on writes.
> > >
> > > Well, why cannot we just do the write, see whether we got CMCI and if yes,
> > > clear the error via the ACPI method?
> > 
> > I would need to check if you get the address reported in the CMCI, but
> > it would only fire if the write triggered a read-modify-write cycle. I
> > suspect most copies to pmem, through something like
> > arch_memcpy_to_pmem(), are not triggering any reads. It also triggers
> > asynchronously, so what data do you write after clearing the error?
> > There may have been more writes while the CMCI was being delivered.
> 
> OK, I see. And if we just write new data but don't clear error on write
> through the ACPI method, will we still get MCE on following read of that
> data? But regardless whether we get MCE or not, I suppose that the memory
> location will be still marked as bad in some ACPI table, won't it?

Correct, the location will continue to result in MCEs on reads if it isn't
marked as clear explicitly. I'm not sure that there is an ACPI table
that keeps a list of bad locations, it is just a poison bit in the cache
line, and presumable DIMMs will have some internal data structures that
also mark bad locations.

> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-13 21:40 ` Verma, Vishal L
  0 siblings, 0 replies; 89+ messages in thread
From: Verma, Vishal L @ 2017-01-13 21:40 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-block, linux-fsdevel, linux-nvdimm

The current implementation of badblocks, where we consult the badblocks
list for every IO in the block driver works, and is a last option
failsafe, but from a user perspective, it isn't the easiest interface to
work with.

A while back, Dave Chinner had suggested a move towards smarter
handling, and I posted initial RFC patches [1], but since then the topic
hasn't really moved forward.

I'd like to propose and have a discussion about the following new
functionality:

1. Filesystems develop a native representation of badblocks. For
example, in xfs, this would (presumably) be linked to the reverse
mapping btree. The filesystem representation has the potential to be 
more efficient than the block driver doing the check, as the fs can
check the IO happening on a file against just that file's range. In
contrast, today, the block driver checks against the whole block device
range for every IO. On encountering badblocks, the filesystem can
generate a better notification/error message that points the user to 
(file, offset) as opposed to the block driver, which can only provide
(block-device, sector).

2. The block layer adds a notifier to badblock addition/removal
operations, which the filesystem subscribes to, and uses to maintain its
badblocks accounting. (This part is implemented as a proof of concept in
the RFC mentioned above [1]).

3. The filesystem has a way of telling the block driver (a flag? a
different/new interface?) that it is responsible for badblock checking
so that the driver doesn't have to do its check. The driver checking
will have to remain in place as a catch-all for filesystems/interfaces
that don't or aren't capable of doing the checks at a higher layer.


Additionally, I saw some discussion about logical depop on the lists
again, and I was involved with discussions last year about expanding the
the badblocks infrastructure for this use. If that is a topic again this
year, I'd like to be involved in it too.

I'm also interested in participating in any other pmem/NVDIMM related
discussions.


Thank you,
	-Vishal


[1]: http://www.linux.sgi.com/archives/xfs/2016-06/msg00299.html
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-13 21:40 ` Verma, Vishal L
  0 siblings, 0 replies; 89+ messages in thread
From: Verma, Vishal L @ 2017-01-13 21:40 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-nvdimm, linux-block, linux-fsdevel

VGhlIGN1cnJlbnQgaW1wbGVtZW50YXRpb24gb2YgYmFkYmxvY2tzLCB3aGVyZSB3ZSBjb25zdWx0
IHRoZSBiYWRibG9ja3MNCmxpc3QgZm9yIGV2ZXJ5IElPIGluIHRoZSBibG9jayBkcml2ZXIgd29y
a3MsIGFuZCBpcyBhIGxhc3Qgb3B0aW9uDQpmYWlsc2FmZSwgYnV0IGZyb20gYSB1c2VyIHBlcnNw
ZWN0aXZlLCBpdCBpc24ndCB0aGUgZWFzaWVzdCBpbnRlcmZhY2UgdG8NCndvcmsgd2l0aC4NCg0K
QSB3aGlsZSBiYWNrLCBEYXZlIENoaW5uZXIgaGFkIHN1Z2dlc3RlZCBhIG1vdmUgdG93YXJkcyBz
bWFydGVyDQpoYW5kbGluZywgYW5kIEkgcG9zdGVkIGluaXRpYWwgUkZDIHBhdGNoZXMgWzFdLCBi
dXQgc2luY2UgdGhlbiB0aGUgdG9waWMNCmhhc24ndCByZWFsbHkgbW92ZWQgZm9yd2FyZC4NCg0K
SSdkIGxpa2UgdG8gcHJvcG9zZSBhbmQgaGF2ZSBhIGRpc2N1c3Npb24gYWJvdXQgdGhlIGZvbGxv
d2luZyBuZXcNCmZ1bmN0aW9uYWxpdHk6DQoNCjEuIEZpbGVzeXN0ZW1zIGRldmVsb3AgYSBuYXRp
dmUgcmVwcmVzZW50YXRpb24gb2YgYmFkYmxvY2tzLiBGb3INCmV4YW1wbGUsIGluIHhmcywgdGhp
cyB3b3VsZCAocHJlc3VtYWJseSkgYmUgbGlua2VkIHRvIHRoZSByZXZlcnNlDQptYXBwaW5nIGJ0
cmVlLiBUaGUgZmlsZXN5c3RlbSByZXByZXNlbnRhdGlvbiBoYXMgdGhlIHBvdGVudGlhbCB0byBi
ZcKgDQptb3JlIGVmZmljaWVudCB0aGFuIHRoZSBibG9jayBkcml2ZXIgZG9pbmcgdGhlIGNoZWNr
LCBhcyB0aGUgZnMgY2FuDQpjaGVjayB0aGUgSU8gaGFwcGVuaW5nIG9uIGEgZmlsZSBhZ2FpbnN0
IGp1c3QgdGhhdCBmaWxlJ3MgcmFuZ2UuIEluDQpjb250cmFzdCwgdG9kYXksIHRoZSBibG9jayBk
cml2ZXIgY2hlY2tzIGFnYWluc3QgdGhlIHdob2xlIGJsb2NrIGRldmljZQ0KcmFuZ2UgZm9yIGV2
ZXJ5IElPLiBPbiBlbmNvdW50ZXJpbmcgYmFkYmxvY2tzLCB0aGUgZmlsZXN5c3RlbSBjYW4NCmdl
bmVyYXRlIGEgYmV0dGVyIG5vdGlmaWNhdGlvbi9lcnJvciBtZXNzYWdlIHRoYXQgcG9pbnRzIHRo
ZSB1c2VyIHRvwqANCihmaWxlLCBvZmZzZXQpIGFzIG9wcG9zZWQgdG8gdGhlIGJsb2NrIGRyaXZl
ciwgd2hpY2ggY2FuIG9ubHkgcHJvdmlkZQ0KKGJsb2NrLWRldmljZSwgc2VjdG9yKS4NCg0KMi4g
VGhlIGJsb2NrIGxheWVyIGFkZHMgYSBub3RpZmllciB0byBiYWRibG9jayBhZGRpdGlvbi9yZW1v
dmFsDQpvcGVyYXRpb25zLCB3aGljaCB0aGUgZmlsZXN5c3RlbSBzdWJzY3JpYmVzIHRvLCBhbmQg
dXNlcyB0byBtYWludGFpbiBpdHMNCmJhZGJsb2NrcyBhY2NvdW50aW5nLiAoVGhpcyBwYXJ0IGlz
IGltcGxlbWVudGVkIGFzIGEgcHJvb2Ygb2YgY29uY2VwdCBpbg0KdGhlIFJGQyBtZW50aW9uZWQg
YWJvdmUgWzFdKS4NCg0KMy4gVGhlIGZpbGVzeXN0ZW0gaGFzIGEgd2F5IG9mIHRlbGxpbmcgdGhl
IGJsb2NrIGRyaXZlciAoYSBmbGFnPyBhDQpkaWZmZXJlbnQvbmV3IGludGVyZmFjZT8pIHRoYXQg
aXQgaXMgcmVzcG9uc2libGUgZm9yIGJhZGJsb2NrIGNoZWNraW5nDQpzbyB0aGF0IHRoZSBkcml2
ZXIgZG9lc24ndCBoYXZlIHRvIGRvIGl0cyBjaGVjay4gVGhlIGRyaXZlciBjaGVja2luZw0Kd2ls
bCBoYXZlIHRvIHJlbWFpbiBpbiBwbGFjZSBhcyBhIGNhdGNoLWFsbCBmb3IgZmlsZXN5c3RlbXMv
aW50ZXJmYWNlcw0KdGhhdCBkb24ndCBvciBhcmVuJ3QgY2FwYWJsZSBvZiBkb2luZyB0aGUgY2hl
Y2tzIGF0IGEgaGlnaGVyIGxheWVyLg0KDQoNCkFkZGl0aW9uYWxseSwgSSBzYXcgc29tZSBkaXNj
dXNzaW9uIGFib3V0IGxvZ2ljYWwgZGVwb3Agb24gdGhlIGxpc3RzDQphZ2FpbiwgYW5kIEkgd2Fz
IGludm9sdmVkIHdpdGggZGlzY3Vzc2lvbnMgbGFzdCB5ZWFyIGFib3V0IGV4cGFuZGluZyB0aGUN
CnRoZSBiYWRibG9ja3MgaW5mcmFzdHJ1Y3R1cmUgZm9yIHRoaXMgdXNlLiBJZiB0aGF0IGlzIGEg
dG9waWMgYWdhaW4gdGhpcw0KeWVhciwgSSdkIGxpa2UgdG8gYmUgaW52b2x2ZWQgaW4gaXQgdG9v
Lg0KDQpJJ20gYWxzbyBpbnRlcmVzdGVkIGluIHBhcnRpY2lwYXRpbmcgaW4gYW55IG90aGVyIHBt
ZW0vTlZESU1NIHJlbGF0ZWQNCmRpc2N1c3Npb25zLg0KDQoNClRoYW5rIHlvdSwNCgktVmlzaGFs
DQoNCg0KWzFdOiBodHRwOi8vd3d3LmxpbnV4LnNnaS5jb20vYXJjaGl2ZXMveGZzLzIwMTYtMDYv
bXNnMDAyOTkuaHRtbA==

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [LSF/MM TOPIC] Badblocks checking/representation in filesystems
@ 2017-01-13 21:40 ` Verma, Vishal L
  0 siblings, 0 replies; 89+ messages in thread
From: Verma, Vishal L @ 2017-01-13 21:40 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-nvdimm, linux-block, linux-fsdevel

The current implementation of badblocks, where we consult the badblocks
list for every IO in the block driver works, and is a last option
failsafe, but from a user perspective, it isn't the easiest interface to
work with.

A while back, Dave Chinner had suggested a move towards smarter
handling, and I posted initial RFC patches [1], but since then the topic
hasn't really moved forward.

I'd like to propose and have a discussion about the following new
functionality:

1. Filesystems develop a native representation of badblocks. For
example, in xfs, this would (presumably) be linked to the reverse
mapping btree. The filesystem representation has the potential to be 
more efficient than the block driver doing the check, as the fs can
check the IO happening on a file against just that file's range. In
contrast, today, the block driver checks against the whole block device
range for every IO. On encountering badblocks, the filesystem can
generate a better notification/error message that points the user to 
(file, offset) as opposed to the block driver, which can only provide
(block-device, sector).

2. The block layer adds a notifier to badblock addition/removal
operations, which the filesystem subscribes to, and uses to maintain its
badblocks accounting. (This part is implemented as a proof of concept in
the RFC mentioned above [1]).

3. The filesystem has a way of telling the block driver (a flag? a
different/new interface?) that it is responsible for badblock checking
so that the driver doesn't have to do its check. The driver checking
will have to remain in place as a catch-all for filesystems/interfaces
that don't or aren't capable of doing the checks at a higher layer.


Additionally, I saw some discussion about logical depop on the lists
again, and I was involved with discussions last year about expanding the
the badblocks infrastructure for this use. If that is a topic again this
year, I'd like to be involved in it too.

I'm also interested in participating in any other pmem/NVDIMM related
discussions.


Thank you,
	-Vishal


[1]: http://www.linux.sgi.com/archives/xfs/2016-06/msg00299.html

^ permalink raw reply	[flat|nested] 89+ messages in thread

end of thread, other threads:[~2017-01-24 20:00 UTC | newest]

Thread overview: 89+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <at1mp6pou4lenesjdgh22k4p.1484345585589@email.android.com>
     [not found] ` <b9rbflutjt10mb4ofherta8j.1484345610771@email.android.com>
2017-01-14  0:00   ` [LSF/MM TOPIC] Badblocks checking/representation in filesystems Slava Dubeyko
2017-01-14  0:00     ` Slava Dubeyko
2017-01-14  0:00     ` Slava Dubeyko
2017-01-14  0:49     ` Vishal Verma
2017-01-14  0:49       ` Vishal Verma
2017-01-16  2:27       ` Slava Dubeyko
2017-01-16  2:27         ` Slava Dubeyko
2017-01-16  2:27         ` Slava Dubeyko
2017-01-17 14:37         ` [Lsf-pc] " Jan Kara
2017-01-17 14:37           ` Jan Kara
2017-01-17 15:08           ` Christoph Hellwig
2017-01-17 15:08             ` Christoph Hellwig
2017-01-17 22:14           ` Vishal Verma
2017-01-17 22:14             ` Vishal Verma
2017-01-18 10:16             ` Jan Kara
2017-01-18 10:16               ` Jan Kara
2017-01-18 20:39               ` Jeff Moyer
2017-01-18 20:39                 ` Jeff Moyer
2017-01-18 21:02                 ` Darrick J. Wong
2017-01-18 21:02                   ` Darrick J. Wong
2017-01-18 21:32                   ` Dan Williams
2017-01-18 21:32                     ` Dan Williams
     [not found]                     ` <CAPcyv4hd7bpCa7d9msX0Y8gLz7WsqXT3VExQwwLuAcsmMxVTPg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-01-18 21:56                       ` Verma, Vishal L
2017-01-18 21:56                         ` Verma, Vishal L
2017-01-18 21:56                         ` Verma, Vishal L
     [not found]                         ` <1484776549.4358.33.camel-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2017-01-19  8:10                           ` Jan Kara
2017-01-19  8:10                             ` Jan Kara
     [not found]                             ` <20170119081011.GA2565-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
2017-01-19 18:59                               ` Vishal Verma
2017-01-19 18:59                                 ` Vishal Verma
     [not found]                                 ` <20170119185910.GF4880-PxNA6LsHknajYZd8rzuJLNh3ngVCH38I@public.gmane.org>
2017-01-19 19:03                                   ` Dan Williams
2017-01-19 19:03                                     ` Dan Williams
     [not found]                                     ` <CAPcyv4jZz_iqLutd0gPEL3udqbFxvBH8CZY5oDgUjG5dGbC2gg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-01-20  9:03                                       ` Jan Kara
2017-01-20  9:03                                         ` Jan Kara
2017-01-17 23:15           ` Slava Dubeyko
2017-01-17 23:15             ` Slava Dubeyko
2017-01-17 23:15             ` Slava Dubeyko
2017-01-18 20:47             ` Jeff Moyer
2017-01-18 20:47               ` Jeff Moyer
2017-01-19  2:56               ` Slava Dubeyko
2017-01-19  2:56                 ` Slava Dubeyko
2017-01-19  2:56                 ` Slava Dubeyko
2017-01-19 19:33                 ` Jeff Moyer
2017-01-19 19:33                   ` Jeff Moyer
2017-01-17  6:33       ` Darrick J. Wong
2017-01-17  6:33         ` Darrick J. Wong
2017-01-17 21:35         ` Vishal Verma
2017-01-17 21:35           ` Vishal Verma
2017-01-17 22:15           ` Andiry Xu
2017-01-17 22:15             ` Andiry Xu
2017-01-17 22:37             ` Vishal Verma
2017-01-17 22:37               ` Vishal Verma
2017-01-17 23:20               ` Andiry Xu
2017-01-17 23:20                 ` Andiry Xu
2017-01-17 23:51                 ` Vishal Verma
2017-01-17 23:51                   ` Vishal Verma
2017-01-18  1:58                   ` Andiry Xu
2017-01-18  1:58                     ` Andiry Xu
     [not found]                     ` <CAOvWMLZCt39EDg-1uppVVUeRG40JvOo9sKLY2XMuynZdnc0W9w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-01-20  0:32                       ` Verma, Vishal L
2017-01-20  0:32                         ` Verma, Vishal L
2017-01-20  0:32                         ` Verma, Vishal L
2017-01-18  9:38               ` [Lsf-pc] " Jan Kara
2017-01-18  9:38                 ` Jan Kara
2017-01-19 21:17                 ` Vishal Verma
2017-01-19 21:17                   ` Vishal Verma
2017-01-20  9:47                   ` Jan Kara
2017-01-20  9:47                     ` Jan Kara
2017-01-20 15:42                     ` Dan Williams
2017-01-20 15:42                       ` Dan Williams
2017-01-24  7:46                       ` Jan Kara
2017-01-24  7:46                         ` Jan Kara
2017-01-24 19:59                         ` Vishal Verma
2017-01-24 19:59                           ` Vishal Verma
2017-01-18  0:16             ` Andreas Dilger
2017-01-18  2:01               ` Andiry Xu
2017-01-18  2:01                 ` Andiry Xu
2017-01-18  3:08                 ` Lu Zhang
2017-01-18  3:08                   ` Lu Zhang
2017-01-20  0:46                   ` Vishal Verma
2017-01-20  0:46                     ` Vishal Verma
2017-01-20  9:24                     ` Yasunori Goto
2017-01-20  9:24                       ` Yasunori Goto
     [not found]                       ` <20170120182435.0E12.E1E9C6FF-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2017-01-21  0:23                         ` Kani, Toshimitsu
2017-01-21  0:23                           ` Kani, Toshimitsu
2017-01-21  0:23                           ` Kani, Toshimitsu
2017-01-20  0:55                 ` Verma, Vishal L
2017-01-20  0:55                   ` Verma, Vishal L
2017-01-13 21:40 Verma, Vishal L
2017-01-13 21:40 ` Verma, Vishal L
2017-01-13 21:40 ` Verma, Vishal L

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.