From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mail-yw0-f193.google.com ([209.85.161.193]:35207 "EHLO
        mail-yw0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1751854AbdARCBQ (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Tue, 17 Jan 2017 21:01:16 -0500
MIME-Version: 1.0
In-Reply-To: <1BAF6FD6-1FDB-4F7C-A915-891F46E78B8C@dilger.ca>
References: <at1mp6pou4lenesjdgh22k4p.1484345585589@email.android.com>
 <b9rbflutjt10mb4ofherta8j.1484345610771@email.android.com>
 <SN2PR04MB2191756EABCB0E9DAA3B5328887B0@SN2PR04MB2191.namprd04.prod.outlook.com>
 <20170114004910.GA4880@omniknight.lm.intel.com> <20170117063355.GL14033@birch.djwong.org>
 <20170117213549.GB4880@omniknight.lm.intel.com> <CAOvWMLYMR-VvAVNuVjXPC-woxY6afQX5-hMC=Vj2p=3AGj9tyA@mail.gmail.com>
 <1BAF6FD6-1FDB-4F7C-A915-891F46E78B8C@dilger.ca>
From: Andiry Xu <andiry@gmail.com>
Date: Tue, 17 Jan 2017 18:01:14 -0800
Message-ID: <CAOvWMLZA092iUCnFxCxPZmDNX-hH08xbSnweBhK-E-m9Ko0yuw@mail.gmail.com>
Subject: Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
To: Andreas Dilger <adilger@dilger.ca>
Cc: Vishal Verma <vishal.l.verma@intel.com>,
        "Darrick J. Wong" <darrick.wong@oracle.com>,
        Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>,
        "lsf-pc@lists.linux-foundation.org"
        <lsf-pc@lists.linux-foundation.org>,
        "linux-nvdimm@lists.01.org" <linux-nvdimm@ml01.01.org>,
        "linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
        Linux FS Devel <linux-fsdevel@vger.kernel.org>,
        Viacheslav Dubeyko <slava@dubeyko.com>
Content-Type: text/plain; charset=UTF-8
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On Tue, Jan 17, 2017 at 4:16 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> On Jan 17, 2017, at 3:15 PM, Andiry Xu <andiry@gmail.com> wrote:
>> On Tue, Jan 17, 2017 at 1:35 PM, Vishal Verma <vishal.l.verma@intel.com> wrote:
>>> On 01/16, Darrick J. Wong wrote:
>>>> On Fri, Jan 13, 2017 at 05:49:10PM -0700, Vishal Verma wrote:
>>>>> On 01/14, Slava Dubeyko wrote:
>>>>>>
>>>>>> ---- Original Message ----
>>>>>> Subject: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
>>>>>> Sent: Jan 13, 2017 1:40 PM
>>>>>> From: "Verma, Vishal L" <vishal.l.verma@intel.com>
>>>>>> To: lsf-pc@lists.linux-foundation.org
>>>>>> Cc: linux-nvdimm@lists.01.org, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org
>>>>>>
>>>>>>> The current implementation of badblocks, where we consult the
>>>>>>> badblocks list for every IO in the block driver works, and is a
>>>>>>> last option failsafe, but from a user perspective, it isn't the
>>>>>>> easiest interface to work with.
>>>>>>
>>>>>> As I remember, FAT and HFS+ specifications contain description of bad blocks
>>>>>> (physical sectors) table. I believe that this table was used for the case of
>>>>>> floppy media. But, finally, this table becomes to be the completely obsolete
>>>>>> artefact because mostly storage devices are reliably enough. Why do you need
>>>>
>>>> ext4 has a badblocks inode to own all the bad spots on disk, but ISTR it
>>>> doesn't support(??) extents or 64-bit filesystems, and might just be a
>>>> vestigial organ at this point.  XFS doesn't have anything to track bad
>>>> blocks currently....
>>>>
>>>>>> in exposing the bad blocks on the file system level?  Do you expect that next
>>>>>> generation of NVM memory will be so unreliable that file system needs to manage
>>>>>> bad blocks? What's about erasure coding schemes? Do file system really need to suffer
>>>>>> from the bad block issue?
>>>>>>
>>>>>> Usually, we are using LBAs and it is the responsibility of storage device to map
>>>>>> a bad physical block/page/sector into valid one. Do you mean that we have
>>>>>> access to physical NVM memory address directly? But it looks like that we can
>>>>>> have a "bad block" issue even we will access data into page cache's memory
>>>>>> page (if we will use NVM memory for page cache, of course). So, what do you
>>>>>> imply by "bad block" issue?
>>>>>
>>>>> We don't have direct physical access to the device's address space, in
>>>>> the sense the device is still free to perform remapping of chunks of NVM
>>>>> underneath us. The problem is that when a block or address range (as
>>>>> small as a cache line) goes bad, the device maintains a poison bit for
>>>>> every affected cache line. Behind the scenes, it may have already
>>>>> remapped the range, but the cache line poison has to be kept so that
>>>>> there is a notification to the user/owner of the data that something has
>>>>> been lost. Since NVM is byte addressable memory sitting on the memory
>>>>> bus, such a poisoned cache line results in memory errors and SIGBUSes.
>>>>> Compared to tradational storage where an app will get nice and friendly
>>>>> (relatively speaking..) -EIOs. The whole badblocks implementation was
>>>>> done so that the driver can intercept IO (i.e. reads) to _known_ bad
>>>>> locations, and short-circuit them with an EIO. If the driver doesn't
>>>>> catch these, the reads will turn into a memory bus access, and the
>>>>> poison will cause a SIGBUS.
>>>>
>>>> "driver" ... you mean XFS?  Or do you mean the thing that makes pmem
>>>> look kind of like a traditional block device? :)
>>>
>>> Yes, the thing that makes pmem look like a block device :) --
>>> drivers/nvdimm/pmem.c
>>>
>>>>
>>>>> This effort is to try and make this badblock checking smarter - and try
>>>>> and reduce the penalty on every IO to a smaller range, which only the
>>>>> filesystem can do.
>>>>
>>>> Though... now that XFS merged the reverse mapping support, I've been
>>>> wondering if there'll be a resubmission of the device errors callback?
>>>> It still would be useful to be able to inform the user that part of
>>>> their fs has gone bad, or, better yet, if the buffer is still in memory
>>>> someplace else, just write it back out.
>>>>
>>>> Or I suppose if we had some kind of raid1 set up between memories we
>>>> could read one of the other copies and rewrite it into the failing
>>>> region immediately.
>>>
>>> Yes, that is kind of what I was hoping to accomplish via this
>>> discussion. How much would filesystems want to be involved in this sort
>>> of badblocks handling, if at all. I can refresh my patches that provide
>>> the fs notification, but that's the easy bit, and a starting point.
>>>
>>
>> I have some questions. Why moving badblock handling to file system
>> level avoid the checking phase? In file system level for each I/O I
>> still have to check the badblock list, right? Do you mean during mount
>> it can go through the pmem device and locates all the data structures
>> mangled by badblocks and handle them accordingly, so that during
>> normal running the badblocks will never be accessed? Or, if there is
>> replicataion/snapshot support, use a copy to recover the badblocks?
>
> With ext4 badblocks, the main outcome is that the bad blocks would be
> pemanently marked in the allocation bitmap as being used, and they would
> never be allocated to a file, so they should never be accessed unless
> doing a full device scan (which ext4 and e2fsck never do).  That would
> avoid the need to check every I/O against the bad blocks list, if the
> driver knows that the filesystem will handle this.
>

Thank you for explanation. However this only works for free blocks,
right? What about allocated blocks, like file data and metadata?

Thanks,
Andiry

> The one caveat is that ext4 only allows 32-bit block numbers in the
> badblocks list, since this feature hasn't been used in a long time.
> This is good for up to 16TB filesystems, but if there was a demand to
> use this feature again it would be possible allow 64-bit block numbers.
>
> Cheers, Andreas
>
>
>
>
>