linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Vishal Verma <vishal.l.verma@intel.com>
To: Jan Kara <jack@suse.cz>
Cc: Andiry Xu <andiry@gmail.com>,
	Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@ml01.01.org>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	Viacheslav Dubeyko <slava@dubeyko.com>,
	Linux FS Devel <linux-fsdevel@vger.kernel.org>,
	"lsf-pc@lists.linux-foundation.org"
	<lsf-pc@lists.linux-foundation.org>
Subject: Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
Date: Thu, 19 Jan 2017 14:17:19 -0700	[thread overview]
Message-ID: <20170119211719.GG4880@omniknight.lm.intel.com> (raw)
In-Reply-To: <20170118093801.GA24789@quack2.suse.cz>

On 01/18, Jan Kara wrote:
> On Tue 17-01-17 15:37:05, Vishal Verma wrote:
> > I do mean that in the filesystem, for every IO, the badblocks will be
> > checked. Currently, the pmem driver does this, and the hope is that the
> > filesystem can do a better job at it. The driver unconditionally checks
> > every IO for badblocks on the whole device. Depending on how the
> > badblocks are represented in the filesystem, we might be able to quickly
> > tell if a file/range has existing badblocks, and error out the IO
> > accordingly.
> > 
> > At mount the the fs would read the existing badblocks on the block
> > device, and build its own representation of them. Then during normal
> > use, if the underlying badblocks change, the fs would get a notification
> > that would allow it to also update its own representation.
> 
> So I believe we have to distinguish three cases so that we are on the same
> page.
> 
> 1) PMEM is exposed only via a block interface for legacy filesystems to
> use. Here, all the bad blocks handling IMO must happen in NVDIMM driver.
> Looking from outside, the IO either returns with EIO or succeeds. As a
> result you cannot ever ger rid of bad blocks handling in the NVDIMM driver.

Correct.

> 
> 2) PMEM is exposed for DAX aware filesystem. This seems to be what you are
> mostly interested in. We could possibly do something more efficient than
> what NVDIMM driver does however the complexity would be relatively high and
> frankly I'm far from convinced this is really worth it. If there are so
> many badblocks this would matter, the HW has IMHO bigger problems than
> performance.

Correct, and Dave was of the opinion that once at least XFS has reverse
mapping support (which it does now), adding badblocks information to
that should not be a hard lift, and should be a better solution. I
suppose should try to benchmark how much of a penalty the current badblock
checking in the NVVDIMM driver imposes. The penalty is not because there
may be a large number of badblocks, but just due to the fact that we
have to do this check for every IO, in fact, every 'bvec' in a bio.

> 
> 3) PMEM filesystem - there things are even more difficult as was already
> noted elsewhere in the thread. But for now I'd like to leave those aside
> not to complicate things too much.

Agreed that that merits consideration and a whole discussion  by itself,
based on the points Audiry raised.

> 
> Now my question: Why do we bother with badblocks at all? In cases 1) and 2)
> if the platform can recover from MCE, we can just always access persistent
> memory using memcpy_mcsafe(), if that fails, return -EIO. Actually that
> seems to already happen so we just need to make sure all places handle
> returned errors properly (e.g. fs/dax.c does not seem to) and we are done.
> No need for bad blocks list at all, no slow down unless we hit a bad cell
> and in that case who cares about performance when the data is gone...

Even when we have MCE recovery, we cannot do away with the badblocks
list:
1. My understanding is that the hardware's ability to do MCE recovery is
limited/best-effort, and is not guaranteed. There can be circumstances
that cause a "Processor Context Corrupt" state, which is unrecoverable.
2. We still need to maintain a badblocks list so that we know what
blocks need to be cleared (via the ACPI method) on writes.

> 
> For platforms that cannot recover from MCE - just buy better hardware ;).
> Seriously, I have doubts people can seriously use a machine that will
> unavoidably randomly reboot (as there is always a risk you hit error that
> has not been uncovered by background scrub). But maybe for big cloud providers
> the cost savings may offset for the inconvenience, I don't know. But still
> for that case a bad blocks handling in NVDIMM code like we do now looks
> good enough?

The current handling is good enough for those systems, yes.

> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

  reply	other threads:[~2017-01-19 21:18 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <at1mp6pou4lenesjdgh22k4p.1484345585589@email.android.com>
     [not found] ` <b9rbflutjt10mb4ofherta8j.1484345610771@email.android.com>
2017-01-14  0:00   ` [LSF/MM TOPIC] Badblocks checking/representation in filesystems Slava Dubeyko
2017-01-14  0:49     ` Vishal Verma
2017-01-16  2:27       ` Slava Dubeyko
2017-01-17 14:37         ` [Lsf-pc] " Jan Kara
2017-01-17 15:08           ` Christoph Hellwig
2017-01-17 22:14           ` Vishal Verma
2017-01-18 10:16             ` Jan Kara
2017-01-18 20:39               ` Jeff Moyer
2017-01-18 21:02                 ` Darrick J. Wong
2017-01-18 21:32                   ` Dan Williams
2017-01-18 21:56                     ` Verma, Vishal L
2017-01-19  8:10                       ` Jan Kara
2017-01-19 18:59                         ` Vishal Verma
2017-01-19 19:03                           ` Dan Williams
2017-01-20  9:03                             ` Jan Kara
2017-01-17 23:15           ` Slava Dubeyko
2017-01-18 20:47             ` Jeff Moyer
2017-01-19  2:56               ` Slava Dubeyko
2017-01-19 19:33                 ` Jeff Moyer
2017-01-17  6:33       ` Darrick J. Wong
2017-01-17 21:35         ` Vishal Verma
2017-01-17 22:15           ` Andiry Xu
2017-01-17 22:37             ` Vishal Verma
2017-01-17 23:20               ` Andiry Xu
2017-01-17 23:51                 ` Vishal Verma
2017-01-18  1:58                   ` Andiry Xu
2017-01-20  0:32                     ` Verma, Vishal L
2017-01-18  9:38               ` [Lsf-pc] " Jan Kara
2017-01-19 21:17                 ` Vishal Verma [this message]
2017-01-20  9:47                   ` Jan Kara
2017-01-20 15:42                     ` Dan Williams
2017-01-24  7:46                       ` Jan Kara
2017-01-24 19:59                         ` Vishal Verma
2017-01-18  0:16             ` Andreas Dilger
2017-01-18  2:01               ` Andiry Xu
     [not found]                 ` <CAOvWMLZA092iUCnFxCxPZmDNX-hH08xbSnweBhK-E-m9Ko0yuw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-01-18  3:08                   ` Lu Zhang
2017-01-20  0:46                     ` Vishal Verma
2017-01-20  9:24                       ` Yasunori Goto
2017-01-21  0:23                         ` Kani, Toshimitsu
2017-01-20  0:55                 ` Verma, Vishal L

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170119211719.GG4880@omniknight.lm.intel.com \
    --to=vishal.l.verma@intel.com \
    --cc=Vyacheslav.Dubeyko@wdc.com \
    --cc=andiry@gmail.com \
    --cc=darrick.wong@oracle.com \
    --cc=jack@suse.cz \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-nvdimm@ml01.01.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=slava@dubeyko.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).