From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yb0-f193.google.com ([209.85.213.193]:33218 "EHLO mail-yb0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751825AbdARB6p (ORCPT ); Tue, 17 Jan 2017 20:58:45 -0500 MIME-Version: 1.0 In-Reply-To: <20170117235150.GE4880@omniknight.lm.intel.com> References: <20170114004910.GA4880@omniknight.lm.intel.com> <20170117063355.GL14033@birch.djwong.org> <20170117213549.GB4880@omniknight.lm.intel.com> <20170117223705.GD4880@omniknight.lm.intel.com> <20170117235150.GE4880@omniknight.lm.intel.com> From: Andiry Xu Date: Tue, 17 Jan 2017 17:58:43 -0800 Message-ID: Subject: Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems To: Vishal Verma Cc: "Darrick J. Wong" , Slava Dubeyko , "lsf-pc@lists.linux-foundation.org" , "linux-nvdimm@lists.01.org" , "linux-block@vger.kernel.org" , Linux FS Devel , Viacheslav Dubeyko Content-Type: text/plain; charset=UTF-8 Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Tue, Jan 17, 2017 at 3:51 PM, Vishal Verma wrote: > On 01/17, Andiry Xu wrote: > > > >> >> >> >> The pmem_do_bvec() read logic is like this: >> >> >> >> pmem_do_bvec() >> >> if (is_bad_pmem()) >> >> return -EIO; >> >> else >> >> memcpy_from_pmem(); >> >> >> >> Note memcpy_from_pmem() is calling memcpy_mcsafe(). Does this imply >> >> that even if a block is not in the badblock list, it still can be bad >> >> and causes MCE? Does the badblock list get changed during file system >> >> running? If that is the case, should the file system get a >> >> notification when it gets changed? If a block is good when I first >> >> read it, can I still trust it to be good for the second access? >> > >> > Yes, if a block is not in the badblocks list, it can still cause an >> > MCE. This is the latent error case I described above. For a simple read() >> > via the pmem driver, this will get handled by memcpy_mcsafe. For mmap, >> > an MCE is inevitable. >> > >> > Yes the badblocks list may change while a filesystem is running. The RFC >> > patches[1] I linked to add a notification for the filesystem when this >> > happens. >> > >> >> This is really bad and it makes file system implementation much more >> complicated. And badblock notification does not help very much, >> because any block can be bad potentially, no matter it is in badblock >> list or not. And file system has to perform checking for every read, >> using memcpy_mcsafe. This is disaster for file system like NOVA, which >> uses pointer de-reference to access data structures on pmem. Now if I >> want to read a field in an inode on pmem, I have to copy it to DRAM >> first and make sure memcpy_mcsafe() does not report anything wrong. > > You have a good point, and I don't know if I have an answer for this.. > Assuming a system with MCE recovery, maybe NOVA can add a mce handler > similar to nfit_handle_mce(), and handle errors as they happen, but I'm > being very hand-wavey here and don't know how much/how well that might > work.. > >> >> > No, if the media, for some reason, 'dvelops' a bad cell, a second >> > consecutive read does have a chance of being bad. Once a location has >> > been marked as bad, it will stay bad till the ACPI clear error 'DSM' has >> > been called to mark it as clean. >> > >> >> I wonder what happens to write in this case? If a block is bad but not >> reported in badblock list. Now I write to it without reading first. Do >> I clear the poison with the write? Or still require a ACPI DSM? > > With writes, my understanding is there is still a possibility that an > internal read-modify-write can happen, and cause a MCE (this is the same > as writing to a bad DRAM cell, which can also cause an MCE). You can't > really use the ACPI DSM preemptively because you don't know whether the > location was bad. The error flow will be something like write causes the > MCE, a badblock gets added (either through the mce handler or after the > next reboot), and the recovery path is now the same as a regular badblock. > This is different from my understanding. Right now write_pmem() in pmem_do_bvec() does not use memcpy_mcsafe(). If the block is bad it clears poison and writes to pmem again. Seems to me writing to bad blocks does not cause MCE. Do we need memcpy_mcsafe for pmem stores? Thanks, Andiry >> >> > [1]: http://www.linux.sgi.com/archives/xfs/2016-06/msg00299.html >> > >> >> Thank you for the patchset. I will look into it. >>