From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-it0-f66.google.com ([209.85.214.66]:35453 "EHLO mail-it0-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751555AbdARARn (ORCPT ); Tue, 17 Jan 2017 19:17:43 -0500 Received: by mail-it0-f66.google.com with SMTP id 203so10844ith.2 for ; Tue, 17 Jan 2017 16:17:08 -0800 (PST) Content-Type: multipart/signed; boundary="Apple-Mail=_29505F4B-67B1-4A11-A0F5-E97EA94963CA"; protocol="application/pgp-signature"; micalg=pgp-sha256 Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems From: Andreas Dilger In-Reply-To: Date: Tue, 17 Jan 2017 17:16:22 -0700 Cc: Vishal Verma , "Darrick J. Wong" , Slava Dubeyko , "lsf-pc@lists.linux-foundation.org" , "linux-nvdimm@lists.01.org" , "linux-block@vger.kernel.org" , Linux FS Devel , Viacheslav Dubeyko Message-Id: <1BAF6FD6-1FDB-4F7C-A915-891F46E78B8C@dilger.ca> References: <20170114004910.GA4880@omniknight.lm.intel.com> <20170117063355.GL14033@birch.djwong.org> <20170117213549.GB4880@omniknight.lm.intel.com> To: Andiry Xu Sender: linux-fsdevel-owner@vger.kernel.org List-ID: --Apple-Mail=_29505F4B-67B1-4A11-A0F5-E97EA94963CA Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii On Jan 17, 2017, at 3:15 PM, Andiry Xu wrote: > On Tue, Jan 17, 2017 at 1:35 PM, Vishal Verma = wrote: >> On 01/16, Darrick J. Wong wrote: >>> On Fri, Jan 13, 2017 at 05:49:10PM -0700, Vishal Verma wrote: >>>> On 01/14, Slava Dubeyko wrote: >>>>>=20 >>>>> ---- Original Message ---- >>>>> Subject: [LSF/MM TOPIC] Badblocks checking/representation in = filesystems >>>>> Sent: Jan 13, 2017 1:40 PM >>>>> From: "Verma, Vishal L" >>>>> To: lsf-pc@lists.linux-foundation.org >>>>> Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org, = linux-fsdevel@vger.kernel.org >>>>>=20 >>>>>> The current implementation of badblocks, where we consult the >>>>>> badblocks list for every IO in the block driver works, and is a >>>>>> last option failsafe, but from a user perspective, it isn't the >>>>>> easiest interface to work with. >>>>>=20 >>>>> As I remember, FAT and HFS+ specifications contain description of = bad blocks >>>>> (physical sectors) table. I believe that this table was used for = the case of >>>>> floppy media. But, finally, this table becomes to be the = completely obsolete >>>>> artefact because mostly storage devices are reliably enough. Why = do you need >>>=20 >>> ext4 has a badblocks inode to own all the bad spots on disk, but = ISTR it >>> doesn't support(??) extents or 64-bit filesystems, and might just be = a >>> vestigial organ at this point. XFS doesn't have anything to track = bad >>> blocks currently.... >>>=20 >>>>> in exposing the bad blocks on the file system level? Do you = expect that next >>>>> generation of NVM memory will be so unreliable that file system = needs to manage >>>>> bad blocks? What's about erasure coding schemes? Do file system = really need to suffer >>>>> from the bad block issue? >>>>>=20 >>>>> Usually, we are using LBAs and it is the responsibility of storage = device to map >>>>> a bad physical block/page/sector into valid one. Do you mean that = we have >>>>> access to physical NVM memory address directly? But it looks like = that we can >>>>> have a "bad block" issue even we will access data into page = cache's memory >>>>> page (if we will use NVM memory for page cache, of course). So, = what do you >>>>> imply by "bad block" issue? >>>>=20 >>>> We don't have direct physical access to the device's address space, = in >>>> the sense the device is still free to perform remapping of chunks = of NVM >>>> underneath us. The problem is that when a block or address range = (as >>>> small as a cache line) goes bad, the device maintains a poison bit = for >>>> every affected cache line. Behind the scenes, it may have already >>>> remapped the range, but the cache line poison has to be kept so = that >>>> there is a notification to the user/owner of the data that = something has >>>> been lost. Since NVM is byte addressable memory sitting on the = memory >>>> bus, such a poisoned cache line results in memory errors and = SIGBUSes. >>>> Compared to tradational storage where an app will get nice and = friendly >>>> (relatively speaking..) -EIOs. The whole badblocks implementation = was >>>> done so that the driver can intercept IO (i.e. reads) to _known_ = bad >>>> locations, and short-circuit them with an EIO. If the driver = doesn't >>>> catch these, the reads will turn into a memory bus access, and the >>>> poison will cause a SIGBUS. >>>=20 >>> "driver" ... you mean XFS? Or do you mean the thing that makes pmem >>> look kind of like a traditional block device? :) >>=20 >> Yes, the thing that makes pmem look like a block device :) -- >> drivers/nvdimm/pmem.c >>=20 >>>=20 >>>> This effort is to try and make this badblock checking smarter - and = try >>>> and reduce the penalty on every IO to a smaller range, which only = the >>>> filesystem can do. >>>=20 >>> Though... now that XFS merged the reverse mapping support, I've been >>> wondering if there'll be a resubmission of the device errors = callback? >>> It still would be useful to be able to inform the user that part of >>> their fs has gone bad, or, better yet, if the buffer is still in = memory >>> someplace else, just write it back out. >>>=20 >>> Or I suppose if we had some kind of raid1 set up between memories we >>> could read one of the other copies and rewrite it into the failing >>> region immediately. >>=20 >> Yes, that is kind of what I was hoping to accomplish via this >> discussion. How much would filesystems want to be involved in this = sort >> of badblocks handling, if at all. I can refresh my patches that = provide >> the fs notification, but that's the easy bit, and a starting point. >>=20 >=20 > I have some questions. Why moving badblock handling to file system > level avoid the checking phase? In file system level for each I/O I > still have to check the badblock list, right? Do you mean during mount > it can go through the pmem device and locates all the data structures > mangled by badblocks and handle them accordingly, so that during > normal running the badblocks will never be accessed? Or, if there is > replicataion/snapshot support, use a copy to recover the badblocks? With ext4 badblocks, the main outcome is that the bad blocks would be pemanently marked in the allocation bitmap as being used, and they would never be allocated to a file, so they should never be accessed unless doing a full device scan (which ext4 and e2fsck never do). That would avoid the need to check every I/O against the bad blocks list, if the driver knows that the filesystem will handle this. The one caveat is that ext4 only allows 32-bit block numbers in the badblocks list, since this feature hasn't been used in a long time. This is good for up to 16TB filesystems, but if there was a demand to use this feature again it would be possible allow 64-bit block numbers. Cheers, Andreas --Apple-Mail=_29505F4B-67B1-4A11-A0F5-E97EA94963CA Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename=signature.asc Content-Type: application/pgp-signature; name=signature.asc Content-Description: Message signed with OpenPGP using GPGMail -----BEGIN PGP SIGNATURE----- Comment: GPGTools - http://gpgtools.org iQIVAwUBWH6z13Kl2rkXzB/gAQj/JBAAmhaeZgpXMbthrDFsV9YUmkPZtrPptiYj tiMQlJvSo5xpdt0eVVeYhJRX73cc9fbgt4h8soxx1noIoQg8c1TQqU4DazHuhX0R geAhDNTkyFrtEY92vAABM4s1nzDQG4EZ8C6bniwQASR9QZhlCgATW6hIM7lv62C8 vWUL+Ym0NLNcH3dsi/qd28rSG/p9xQlJyyQjDoU0rePbuWzI7QlmGyfB6QTmvTl5 K9lhV3MafVPRte1IOvJcr16nDqsRgCRQ/XYob+Fw9yUVpy1LzOuWsvxEI19/3i1Z hV0n0B3TANIk6BKjkZvsmtbi0eksNRzSigt5pzNoG2loaB9RT8zXhpZxbN6aFoHF If1friETgvRuPwxOJGF7hMb6OOQMF8DjiJqCzPMOMAeR5LEEVNDuNa5Gw6gWNHGK D47T41f/kAypXvYJBMBtG21OADHmKsAq56neCGEPiRqYQgBpfWC7AKDBeiBoZTre YwwqYMHpnXOTLPeTcLHqHT8KM+CZ9GlW+IOlUDPpTCQSGnXITAMMeZmVdNtvh0K8 XGTyQroTiWeqqv9s2Ym5mV2rVmq3oNGB8/NRPs7GPUYUEOF+jST4Bru7sSdmo+LH +1yPwSronOy7u8VcvF21S+eh1xaf8l5msozYg7zCQNm5x2mK8FreTE7aJ8uun3YJ VCmN0Nty0bE= =HWez -----END PGP SIGNATURE----- --Apple-Mail=_29505F4B-67B1-4A11-A0F5-E97EA94963CA--