From: Vishal Verma <vishal.l.verma@intel.com> To: Jan Kara <jack@suse.cz> Cc: Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>, "Darrick J. Wong" <darrick.wong@oracle.com>, "linux-nvdimm@lists.01.org" <linux-nvdimm@ml01.01.org>, "linux-block@vger.kernel.org" <linux-block@vger.kernel.org>, Linux FS Devel <linux-fsdevel@vger.kernel.org>, Viacheslav Dubeyko <slava@dubeyko.com>, Andiry Xu <andiry@gmail.com>, "lsf-pc@lists.linux-foundation.org" <lsf-pc@lists.linux-foundation.org> Subject: Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems Date: Thu, 19 Jan 2017 14:17:19 -0700 [thread overview] Message-ID: <20170119211719.GG4880@omniknight.lm.intel.com> (raw) In-Reply-To: <20170118093801.GA24789@quack2.suse.cz> On 01/18, Jan Kara wrote: > On Tue 17-01-17 15:37:05, Vishal Verma wrote: > > I do mean that in the filesystem, for every IO, the badblocks will be > > checked. Currently, the pmem driver does this, and the hope is that the > > filesystem can do a better job at it. The driver unconditionally checks > > every IO for badblocks on the whole device. Depending on how the > > badblocks are represented in the filesystem, we might be able to quickly > > tell if a file/range has existing badblocks, and error out the IO > > accordingly. > > > > At mount the the fs would read the existing badblocks on the block > > device, and build its own representation of them. Then during normal > > use, if the underlying badblocks change, the fs would get a notification > > that would allow it to also update its own representation. > > So I believe we have to distinguish three cases so that we are on the same > page. > > 1) PMEM is exposed only via a block interface for legacy filesystems to > use. Here, all the bad blocks handling IMO must happen in NVDIMM driver. > Looking from outside, the IO either returns with EIO or succeeds. As a > result you cannot ever ger rid of bad blocks handling in the NVDIMM driver. Correct. > > 2) PMEM is exposed for DAX aware filesystem. This seems to be what you are > mostly interested in. We could possibly do something more efficient than > what NVDIMM driver does however the complexity would be relatively high and > frankly I'm far from convinced this is really worth it. If there are so > many badblocks this would matter, the HW has IMHO bigger problems than > performance. Correct, and Dave was of the opinion that once at least XFS has reverse mapping support (which it does now), adding badblocks information to that should not be a hard lift, and should be a better solution. I suppose should try to benchmark how much of a penalty the current badblock checking in the NVVDIMM driver imposes. The penalty is not because there may be a large number of badblocks, but just due to the fact that we have to do this check for every IO, in fact, every 'bvec' in a bio. > > 3) PMEM filesystem - there things are even more difficult as was already > noted elsewhere in the thread. But for now I'd like to leave those aside > not to complicate things too much. Agreed that that merits consideration and a whole discussion by itself, based on the points Audiry raised. > > Now my question: Why do we bother with badblocks at all? In cases 1) and 2) > if the platform can recover from MCE, we can just always access persistent > memory using memcpy_mcsafe(), if that fails, return -EIO. Actually that > seems to already happen so we just need to make sure all places handle > returned errors properly (e.g. fs/dax.c does not seem to) and we are done. > No need for bad blocks list at all, no slow down unless we hit a bad cell > and in that case who cares about performance when the data is gone... Even when we have MCE recovery, we cannot do away with the badblocks list: 1. My understanding is that the hardware's ability to do MCE recovery is limited/best-effort, and is not guaranteed. There can be circumstances that cause a "Processor Context Corrupt" state, which is unrecoverable. 2. We still need to maintain a badblocks list so that we know what blocks need to be cleared (via the ACPI method) on writes. > > For platforms that cannot recover from MCE - just buy better hardware ;). > Seriously, I have doubts people can seriously use a machine that will > unavoidably randomly reboot (as there is always a risk you hit error that > has not been uncovered by background scrub). But maybe for big cloud providers > the cost savings may offset for the inconvenience, I don't know. But still > for that case a bad blocks handling in NVDIMM code like we do now looks > good enough? The current handling is good enough for those systems, yes. > > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
WARNING: multiple messages have this Message-ID (diff)
From: Vishal Verma <vishal.l.verma@intel.com> To: Jan Kara <jack@suse.cz> Cc: Andiry Xu <andiry@gmail.com>, Slava Dubeyko <Vyacheslav.Dubeyko@wdc.com>, "Darrick J. Wong" <darrick.wong@oracle.com>, "linux-nvdimm@lists.01.org" <linux-nvdimm@ml01.01.org>, "linux-block@vger.kernel.org" <linux-block@vger.kernel.org>, Viacheslav Dubeyko <slava@dubeyko.com>, Linux FS Devel <linux-fsdevel@vger.kernel.org>, "lsf-pc@lists.linux-foundation.org" <lsf-pc@lists.linux-foundation.org> Subject: Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems Date: Thu, 19 Jan 2017 14:17:19 -0700 [thread overview] Message-ID: <20170119211719.GG4880@omniknight.lm.intel.com> (raw) In-Reply-To: <20170118093801.GA24789@quack2.suse.cz> On 01/18, Jan Kara wrote: > On Tue 17-01-17 15:37:05, Vishal Verma wrote: > > I do mean that in the filesystem, for every IO, the badblocks will be > > checked. Currently, the pmem driver does this, and the hope is that the > > filesystem can do a better job at it. The driver unconditionally checks > > every IO for badblocks on the whole device. Depending on how the > > badblocks are represented in the filesystem, we might be able to quickly > > tell if a file/range has existing badblocks, and error out the IO > > accordingly. > > > > At mount the the fs would read the existing badblocks on the block > > device, and build its own representation of them. Then during normal > > use, if the underlying badblocks change, the fs would get a notification > > that would allow it to also update its own representation. > > So I believe we have to distinguish three cases so that we are on the same > page. > > 1) PMEM is exposed only via a block interface for legacy filesystems to > use. Here, all the bad blocks handling IMO must happen in NVDIMM driver. > Looking from outside, the IO either returns with EIO or succeeds. As a > result you cannot ever ger rid of bad blocks handling in the NVDIMM driver. Correct. > > 2) PMEM is exposed for DAX aware filesystem. This seems to be what you are > mostly interested in. We could possibly do something more efficient than > what NVDIMM driver does however the complexity would be relatively high and > frankly I'm far from convinced this is really worth it. If there are so > many badblocks this would matter, the HW has IMHO bigger problems than > performance. Correct, and Dave was of the opinion that once at least XFS has reverse mapping support (which it does now), adding badblocks information to that should not be a hard lift, and should be a better solution. I suppose should try to benchmark how much of a penalty the current badblock checking in the NVVDIMM driver imposes. The penalty is not because there may be a large number of badblocks, but just due to the fact that we have to do this check for every IO, in fact, every 'bvec' in a bio. > > 3) PMEM filesystem - there things are even more difficult as was already > noted elsewhere in the thread. But for now I'd like to leave those aside > not to complicate things too much. Agreed that that merits consideration and a whole discussion by itself, based on the points Audiry raised. > > Now my question: Why do we bother with badblocks at all? In cases 1) and 2) > if the platform can recover from MCE, we can just always access persistent > memory using memcpy_mcsafe(), if that fails, return -EIO. Actually that > seems to already happen so we just need to make sure all places handle > returned errors properly (e.g. fs/dax.c does not seem to) and we are done. > No need for bad blocks list at all, no slow down unless we hit a bad cell > and in that case who cares about performance when the data is gone... Even when we have MCE recovery, we cannot do away with the badblocks list: 1. My understanding is that the hardware's ability to do MCE recovery is limited/best-effort, and is not guaranteed. There can be circumstances that cause a "Processor Context Corrupt" state, which is unrecoverable. 2. We still need to maintain a badblocks list so that we know what blocks need to be cleared (via the ACPI method) on writes. > > For platforms that cannot recover from MCE - just buy better hardware ;). > Seriously, I have doubts people can seriously use a machine that will > unavoidably randomly reboot (as there is always a risk you hit error that > has not been uncovered by background scrub). But maybe for big cloud providers > the cost savings may offset for the inconvenience, I don't know. But still > for that case a bad blocks handling in NVDIMM code like we do now looks > good enough? The current handling is good enough for those systems, yes. > > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR
next prev parent reply other threads:[~2017-01-19 21:17 UTC|newest] Thread overview: 86+ messages / expand[flat|nested] mbox.gz Atom feed top [not found] <at1mp6pou4lenesjdgh22k4p.1484345585589@email.android.com> [not found] ` <b9rbflutjt10mb4ofherta8j.1484345610771@email.android.com> 2017-01-14 0:00 ` [LSF/MM TOPIC] Badblocks checking/representation in filesystems Slava Dubeyko 2017-01-14 0:00 ` Slava Dubeyko 2017-01-14 0:00 ` Slava Dubeyko 2017-01-14 0:49 ` Vishal Verma 2017-01-14 0:49 ` Vishal Verma 2017-01-16 2:27 ` Slava Dubeyko 2017-01-16 2:27 ` Slava Dubeyko 2017-01-16 2:27 ` Slava Dubeyko 2017-01-17 14:37 ` [Lsf-pc] " Jan Kara 2017-01-17 14:37 ` Jan Kara 2017-01-17 15:08 ` Christoph Hellwig 2017-01-17 15:08 ` Christoph Hellwig 2017-01-17 22:14 ` Vishal Verma 2017-01-17 22:14 ` Vishal Verma 2017-01-18 10:16 ` Jan Kara 2017-01-18 10:16 ` Jan Kara 2017-01-18 20:39 ` Jeff Moyer 2017-01-18 20:39 ` Jeff Moyer 2017-01-18 21:02 ` Darrick J. Wong 2017-01-18 21:02 ` Darrick J. Wong 2017-01-18 21:32 ` Dan Williams 2017-01-18 21:32 ` Dan Williams [not found] ` <CAPcyv4hd7bpCa7d9msX0Y8gLz7WsqXT3VExQwwLuAcsmMxVTPg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2017-01-18 21:56 ` Verma, Vishal L 2017-01-18 21:56 ` Verma, Vishal L 2017-01-18 21:56 ` Verma, Vishal L [not found] ` <1484776549.4358.33.camel-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> 2017-01-19 8:10 ` Jan Kara 2017-01-19 8:10 ` Jan Kara [not found] ` <20170119081011.GA2565-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org> 2017-01-19 18:59 ` Vishal Verma 2017-01-19 18:59 ` Vishal Verma [not found] ` <20170119185910.GF4880-PxNA6LsHknajYZd8rzuJLNh3ngVCH38I@public.gmane.org> 2017-01-19 19:03 ` Dan Williams 2017-01-19 19:03 ` Dan Williams [not found] ` <CAPcyv4jZz_iqLutd0gPEL3udqbFxvBH8CZY5oDgUjG5dGbC2gg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2017-01-20 9:03 ` Jan Kara 2017-01-20 9:03 ` Jan Kara 2017-01-17 23:15 ` Slava Dubeyko 2017-01-17 23:15 ` Slava Dubeyko 2017-01-17 23:15 ` Slava Dubeyko 2017-01-18 20:47 ` Jeff Moyer 2017-01-18 20:47 ` Jeff Moyer 2017-01-19 2:56 ` Slava Dubeyko 2017-01-19 2:56 ` Slava Dubeyko 2017-01-19 2:56 ` Slava Dubeyko 2017-01-19 19:33 ` Jeff Moyer 2017-01-19 19:33 ` Jeff Moyer 2017-01-17 6:33 ` Darrick J. Wong 2017-01-17 6:33 ` Darrick J. Wong 2017-01-17 21:35 ` Vishal Verma 2017-01-17 21:35 ` Vishal Verma 2017-01-17 22:15 ` Andiry Xu 2017-01-17 22:15 ` Andiry Xu 2017-01-17 22:37 ` Vishal Verma 2017-01-17 22:37 ` Vishal Verma 2017-01-17 23:20 ` Andiry Xu 2017-01-17 23:20 ` Andiry Xu 2017-01-17 23:51 ` Vishal Verma 2017-01-17 23:51 ` Vishal Verma 2017-01-18 1:58 ` Andiry Xu 2017-01-18 1:58 ` Andiry Xu [not found] ` <CAOvWMLZCt39EDg-1uppVVUeRG40JvOo9sKLY2XMuynZdnc0W9w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2017-01-20 0:32 ` Verma, Vishal L 2017-01-20 0:32 ` Verma, Vishal L 2017-01-20 0:32 ` Verma, Vishal L 2017-01-18 9:38 ` [Lsf-pc] " Jan Kara 2017-01-18 9:38 ` Jan Kara 2017-01-19 21:17 ` Vishal Verma [this message] 2017-01-19 21:17 ` Vishal Verma 2017-01-20 9:47 ` Jan Kara 2017-01-20 9:47 ` Jan Kara 2017-01-20 15:42 ` Dan Williams 2017-01-20 15:42 ` Dan Williams 2017-01-24 7:46 ` Jan Kara 2017-01-24 7:46 ` Jan Kara 2017-01-24 19:59 ` Vishal Verma 2017-01-24 19:59 ` Vishal Verma 2017-01-18 0:16 ` Andreas Dilger 2017-01-18 2:01 ` Andiry Xu 2017-01-18 2:01 ` Andiry Xu 2017-01-18 3:08 ` Lu Zhang 2017-01-18 3:08 ` Lu Zhang 2017-01-20 0:46 ` Vishal Verma 2017-01-20 0:46 ` Vishal Verma 2017-01-20 9:24 ` Yasunori Goto 2017-01-20 9:24 ` Yasunori Goto [not found] ` <20170120182435.0E12.E1E9C6FF-+CUm20s59erQFUHtdCDX3A@public.gmane.org> 2017-01-21 0:23 ` Kani, Toshimitsu 2017-01-21 0:23 ` Kani, Toshimitsu 2017-01-21 0:23 ` Kani, Toshimitsu 2017-01-20 0:55 ` Verma, Vishal L 2017-01-20 0:55 ` Verma, Vishal L
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20170119211719.GG4880@omniknight.lm.intel.com \ --to=vishal.l.verma@intel.com \ --cc=Vyacheslav.Dubeyko@wdc.com \ --cc=andiry@gmail.com \ --cc=darrick.wong@oracle.com \ --cc=jack@suse.cz \ --cc=linux-block@vger.kernel.org \ --cc=linux-fsdevel@vger.kernel.org \ --cc=linux-nvdimm@ml01.01.org \ --cc=lsf-pc@lists.linux-foundation.org \ --cc=slava@dubeyko.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.