From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=a5Vl=N2=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.7 required=3.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,
	HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS
	autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 040DCC433EF
	for <linux-mm@archiver.kernel.org>; Sat,  4 Sep 2021 00:03:53 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 7BA07610A2
	for <linux-mm@archiver.kernel.org>; Sat,  4 Sep 2021 00:03:52 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 7BA07610A2
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org
Received: by kanga.kvack.org (Postfix)
	id 93A616B0071; Fri,  3 Sep 2021 20:03:51 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 8E9A66B0072; Fri,  3 Sep 2021 20:03:51 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 7B0156B0073; Fri,  3 Sep 2021 20:03:51 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0154.hostedemail.com [216.40.44.154])
	by kanga.kvack.org (Postfix) with ESMTP id 64F546B0071
	for <linux-mm@kvack.org>; Fri,  3 Sep 2021 20:03:51 -0400 (EDT)
Received: from smtpin28.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id F2CBC184548A6
	for <linux-mm@kvack.org>; Sat,  4 Sep 2021 00:03:50 +0000 (UTC)
X-FDA: 78547942620.28.A568F94
Received: from mail-ej1-f42.google.com (mail-ej1-f42.google.com [209.85.218.42])
	by imf25.hostedemail.com (Postfix) with ESMTP id B09D7B000183
	for <linux-mm@kvack.org>; Sat,  4 Sep 2021 00:03:50 +0000 (UTC)
Received: by mail-ej1-f42.google.com with SMTP id lc21so1219166ejc.7
        for <linux-mm@kvack.org>; Fri, 03 Sep 2021 17:03:50 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc:content-transfer-encoding;
        bh=upNfbIalhw9PMkccWpEkSSJqvChk2fWlPLU+suGygDA=;
        b=ohrddgKlR4aTbsqA6TPutgP7+QqE3uCWs0kgdSOXfPKleEiXlxlhWIrCU6OBbEX+id
         z3cjUlVMHEXEd0gySnEe3HW9AlOcuDpXq1xpC6EqFNq3o6J9ybKevCelQuhZMNkzhLRu
         eL4JFIV1aSEjPMVrLxCgewVHkuEYMknmZBTXEcmuv01YY5nK03szdp2BuJKMEPlKWnvm
         pwzS03Jx45uC69efD3Qnx+1xkgaiJxSecnsISAxXGaaCQcl5zwaQhK7L2BHbsoZLcXoz
         NoGKqHmOsq7o+2YnEn1mQLeVPYz8EKeq3Sbhw0f931av8PFVIpI9K2cKGRCAgV21NBFZ
         7eQQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc:content-transfer-encoding;
        bh=upNfbIalhw9PMkccWpEkSSJqvChk2fWlPLU+suGygDA=;
        b=A33LHsRy5EBH0fORhTA4KhyDIHEQFzQwAq3TQ7Jdbq+oPXtPutdg8BXmpjrfmoM+nR
         IrP7w3OiVbFVWjjYlMpfUY+vve0Qd1guwfvAQUWe+GJvSbGBCr3PDvgXaoGkEYL183ji
         8oimu4WxP3znqH9fy7u8Nn8GgUPb2WMbT9Jp7VZi3zd0HIiCId6WFfsTqdJkKANHhtkL
         3BUjgbwsV5+4TkaSLFdgKOgqXd8AK+vQ152qxiP2KEU9tCsm7pLcffC3lI2OPZts4X5U
         LtTe/OahbtaLYgLzygXrNMiDaexQyt/Q7j7tlPf5bhxM0kz1OOkBB4nCZWkPFP8uPyVC
         ilcA==
X-Gm-Message-State: AOAM532nO+F0EhPFS1QDcfIYc8Tz/S3FcqPaPPN8hkZDYOnKq2s97Ghb
	V+0t1jen7Q00wAoXDXZ7BhzXKPvETtFu8AIItGE=
X-Google-Smtp-Source: ABdhPJwCx+1+JF4vqHBRYuF1KE5IeAzLdUSixwoR61+2Df9cK9XSarQTvf/bek1+MAbzXCRhzTBNKa+2U3yO+YKBUe4=
X-Received: by 2002:a17:906:3497:: with SMTP id g23mr1523819ejb.85.1630713829277;
 Fri, 03 Sep 2021 17:03:49 -0700 (PDT)
MIME-Version: 1.0
References: <20210824221322.7663-1-shy828301@gmail.com> <20210826061724.GA2864786@hori.linux.bs1.fc.nec.co.jp>
 <CAHbLzkpP9Wm-VhXhJb-j9_mRnRskzesRpZiyUDw+xyRrLykp4w@mail.gmail.com>
 <CAHbLzkpvR77xfs+ip1T8G09=ixz4Ko3E-6iKTEZkFCfGTxi6Aw@mail.gmail.com>
 <20210827035739.GA3247360@hori.linux.bs1.fc.nec.co.jp> <CAHbLzkr35uVHCJB_cr_fZdz3_FXXGam7dsrAn15j5BPfmfX-_A@mail.gmail.com>
 <CAHbLzko+XqFLx9=e2=E3rGRsLzcm32dZnpDf20gnUb2dAR0d_Q@mail.gmail.com>
 <20210902030728.GA1860112@hori.linux.bs1.fc.nec.co.jp> <CAHbLzkoBz2DwPXC5Nj+Kd6-pRaHgtZ-MGNPzYW4azN6Xpz3oyw@mail.gmail.com>
 <20210903115311.GA2477773@hori.linux.bs1.fc.nec.co.jp> <CAHbLzkoXaB_Hz_ZUw4BvR0PKV_ZV4F+hHekRS9RgUnxBYkMV_g@mail.gmail.com>
In-Reply-To: <CAHbLzkoXaB_Hz_ZUw4BvR0PKV_ZV4F+hHekRS9RgUnxBYkMV_g@mail.gmail.com>
From: Yang Shi <shy828301@gmail.com>
Date: Fri, 3 Sep 2021 17:03:37 -0700
Message-ID: <CAHbLzkpDK5qb1=VM8qeDdfjdqRyuL+ehwOK=8uZPYuR7J+xMEQ@mail.gmail.com>
Subject: Re: [PATCH] mm: hwpoison: deal with page cache THP
To: =?UTF-8?B?SE9SSUdVQ0hJIE5BT1lBKOWggOWPoyDnm7TkuZ8p?= <naoya.horiguchi@nec.com>
Cc: "osalvador@suse.de" <osalvador@suse.de>, "hughd@google.com" <hughd@google.com>, 
	"kirill.shutemov@linux.intel.com" <kirill.shutemov@linux.intel.com>, 
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>, "linux-mm@kvack.org" <linux-mm@kvack.org>, 
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Authentication-Results: imf25.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20210112 header.b=ohrddgKl;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf25.hostedemail.com: domain of shy828301@gmail.com designates 209.85.218.42 as permitted sender) smtp.mailfrom=shy828301@gmail.com
X-Rspamd-Server: rspam06
X-Rspamd-Queue-Id: B09D7B000183
X-Stat-Signature: 1ax1uxwfe1o4aj4cuosjfsscy634obp7
X-HE-Tag: 1630713830-322937
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Fri, Sep 3, 2021 at 11:01 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Fri, Sep 3, 2021 at 4:53 AM HORIGUCHI NAOYA(=E5=A0=80=E5=8F=A3=E3=80=
=80=E7=9B=B4=E4=B9=9F)
> <naoya.horiguchi@nec.com> wrote:
> >
> > On Thu, Sep 02, 2021 at 11:32:49AM -0700, Yang Shi wrote:
> > > On Wed, Sep 1, 2021 at 8:07 PM HORIGUCHI NAOYA(=E5=A0=80=E5=8F=A3=E3=
=80=80=E7=9B=B4=E4=B9=9F)
> > > <naoya.horiguchi@nec.com> wrote:
> > > >
> > > > On Mon, Aug 30, 2021 at 04:44:06PM -0700, Yang Shi wrote:
> > > > > On Thu, Aug 26, 2021 at 10:02 PM Yang Shi <shy828301@gmail.com> w=
rote:
> > > > > >
> > > > > > On Thu, Aug 26, 2021 at 8:57 PM HORIGUCHI NAOYA(=E5=A0=80=E5=8F=
=A3=E3=80=80=E7=9B=B4=E4=B9=9F)
> > > > > > <naoya.horiguchi@nec.com> wrote:
> > > > > > >
> > > > > > > On Thu, Aug 26, 2021 at 03:03:57PM -0700, Yang Shi wrote:
> > > > > > > > On Thu, Aug 26, 2021 at 1:03 PM Yang Shi <shy828301@gmail.c=
om> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, Aug 25, 2021 at 11:17 PM HORIGUCHI NAOYA(=E5=A0=
=80=E5=8F=A3=E3=80=80=E7=9B=B4=E4=B9=9F)
> > > > > > > > > <naoya.horiguchi@nec.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Tue, Aug 24, 2021 at 03:13:22PM -0700, Yang Shi wrot=
e:
> > > > > > > ...
> > > > > > > > > >
> > > > > > > > > > There was a discussion about another approach of keepin=
g error pages in page
> > > > > > > > > > cache for filesystem without backend storage.
> > > > > > > > > > https://lore.kernel.org/lkml/alpine.LSU.2.11.2103111312=
310.7859@eggly.anvils/
> > > > > > > > > > This approach seems to me less complicated, but one con=
cern is that this
> > > > > > > > > > change affects user-visible behavior of memory errors. =
 Keeping error pages
> > > > > > > > > > in page cache means that the errors are persistent unti=
l next system reboot,
> > > > > > > > > > so we might need to define the way to clear the errors =
to continue to use
> > > > > > > > > > the error file.  Current implementation is just to send=
 SIGBUS to the
> > > > > > > > > > mapping processes (at least once), then forget about th=
e error, so there is
> > > > > > > > > > no such issue.
> > > > > > > > > >
> > > > > > > > > > Another thought of possible solution might be to send S=
IGBUS immediately when
> > > > > > > > > > a memory error happens on a shmem thp. We can find all =
the mapping processes
> > > > > > > > > > before splitting shmem thp, so send SIGBUS first, then =
split it and contain
> > > > > > > > > > the error page.  This is not elegant (giving up any opt=
ional actions) but
> > > > > > > > > > anyway we can avoid the silent data lost.
> > > > > > > > >
> > > > > > > > > Thanks a lot. I apologize I didn't notice you already pos=
ted a similar
> > > > > > > > > patch before.
> > > > > > > > >
> > > > > > > > > Yes, I think I focused on the soft offline part too much =
and missed
> > > > > > > > > the uncorrected error part and I admit I did underestimat=
e the
> > > > > > > > > problem.
> > > > > > > > >
> > > > > > > > > I think Hugh's suggestion makes sense if we treat tmpfs a=
s a regular
> > > > > > > > > filesystem (just memory backed). AFAIK, some filesystem, =
e.g. btrfs,
> > > > > > > > > may do checksum after reading from storage block then ret=
urn an error
> > > > > > > > > if checksum is not right since it may indicate hardware f=
ailure on
> > > > > > > > > disk. Then the syscalls or page fault return error or SIG=
BUS.
> > > > > > > > >
> > > > > > > > > So in shmem/tmpfs case, if hwpoisoned page is met, just r=
eturn error
> > > > > > > > > (-EIO or whatever) for syscall or SIGBUS for page fault. =
It does align
> > > > > > > > > with the behavior of other filesystems. It is definitely =
applications'
> > > > > > > > > responsibility to check the return value of read/write sy=
scalls.
> > > > > > > >
> > > > > > > > BTW, IIUC the dirty regular page cache (storage backed) wou=
ld be left
> > > > > > > > in the page cache too, the clean page cache would be trunca=
ted since
> > > > > > > > they can be just reread from storage, right?
> > > > > > >
> > > > > > > A dirty page cache is also removed on error (me_pagecache_dir=
ty() falls
> > > > > > > through me_pagecache_clean(), then truncate_error_page() is c=
alled).
> > > > > > > The main purpose of this is to separate off the error page fr=
om exising
> > > > > > > data structures to minimize the risk of later accesses (maybe=
 by race or bug).
> > > > > > > But we can change this behavior for specific file systems by =
updating
> > > > > > > error_remove_page() callbacks in address_space_operation.
> > > > > >
> > > > > > Yeah, if fs's error_remove_page() is defined. It seems the file=
systems
> > > > > > which have error_remove_page() defined just use generic_remove_=
page()
> > > > > > except hugetlbfs. And the generic implementation just clears th=
e dirty
> > > > > > flag and removes the page from page cache.
> > > > > >
> > > > > > If error_remove_page() is not defined, the page would stay in p=
age
> > > > > > cache since invalidate_inode_page() can't remove dirty page.
> > > > > >
> > > > > > >
> > > > > > > Honestly, it seems to me that how dirty data is lost does not=
 depend on
> > > > > > > file system, and I'm still not sure that this is really a rig=
ht approach
> > > > > > > for the current issue.
> > > > > >
> > > > > > IMHO the biggest problem is that applications may see
> > > > > > obsolete/inconsistent data silently, right? Actually keeping th=
e
> > > > > > corrupted page in page cache should be able to notify applicati=
ons
> > > > > > that they are accessing inconsistent data.
> > > > >
> > > > > The removal from page cache behavior may be much worse for shmem/=
tmpfs
> > > > > since it actually removes the whole data blocks for the file. The=
 user
> > > > > will get all zero if the corrupted blocks are read without any
> > > > > notification.
> > > > >
> > > > > The more I stared at the code and had tests done, the more I thin=
k we
> > > > > should keep the corrupted page in page cache and notify the users=
.
> > > > >
> > > > > It seems easier for readonly filesystem. Just remove the page fro=
m
> > > > > page cache since it always could read data from disk. This is als=
o the
> > > > > current behavior.
> > > > >
> > > > > For shmem, the page could be kept in page cache with dirty flag s=
et
> > > > > since it won't be written back.
> > > > >
> > > > > For regular filesystems that could do writeback, things are a lit=
tle
> > > > > bit more complicated since we need to prevent from writing back b=
y
> > > > > clearing dirty flag. Other than writeback we also need to disting=
uish
> > > > > cache drop from truncation/hole punch/unlink. We don't want cache=
 drop
> > > > > (e.g. echo 1 > /proc/sys/vm/drop_caches) drop corrupted page. But
> > > > > truncate/hole punch/unlink should be fine to remove the page sinc=
e the
> > > > > underlying data blocks will be gone too.
> > > > >
> > > > > Thanks to the refcount pin done by memory failure, cache drop can=
't
> > > > > drop the page since it checks if the refcount is expected or not.
> > > > > Truncate/hole punch/unlink doesn't check refcount so they could
> > > > > proceed. But inode evict (slab shrinking path) may call truncate,=
 so
> > > > > the corrupted page may still be removed from page cache when the
> > > > > underlying data blocks still exist IIUC. There might be other pat=
hs in
> > > > > filesystems to have page cache truncate but the underlying data b=
locks
> > > > > are still present.
> > > > >
> > > > > The read/write syscalls also need check hwpoisoned flag. I'm not =
sure
> > > > > if I miss other syscalls or not.
> > > >
> > > > Thanks for analyzing, I think that filesystems supporting writeback=
 might
> > > > call filemap_write_and_wait() when evicting inode, which can notify=
 error
> > > > by AS_EIO flag in struct address_space.  But it does nothing for no=
-writeback
> > > > filesystems like shmem.  So we need better error report for them.
> > >
> > > AFAIK the address_space error just works for fsync. Anyway I could be=
 wrong.
> > >
> > > I think clearing the dirty flag might be the easiest way? It seems
> > > unnecessary to notify the users when writing back since the most writ=
e
> > > back happens asynchronously. They should be notified when the page is
> > > accessed, e.g. read/write and page fault.
> > >
> > > I did some further investigation and got a clearer picture for
> > > writeback filesystem:
> > > 1. The page should be not written back: clearing dirty flag could
> > > prevent from writeback
> > > 2. The page should be not dropped (it shows as a clean page): the
> > > refcount pin from hwpoison could prevent from invalidating (called by
> > > cache drop, inode cache shrinking, etc), but it doesn't avoid
> > > invalidation in DIO path (easy to deal with)
> > > 3. The page should be able to get truncated/hole punched/unlinked: it
> > > works as it is
> > > 4. Notify users when the page is accessed, e.g. read/write, page faul=
t
> > > and other paths: this is hard
> > >
> > > The hardest part is #4. Since there are too many paths in filesystems
> > > that do *NOT* check if page is poisoned or not, e.g. read/write,
> > > compression (btrfs, f2fs), etc. A couple of ways to handle it off the
> > > top of my head:
> > > 1. Check hwpoison flag for every path, the most straightforward way,
> > > but a lot work
> > > 2. Return NULL for poisoned page from page cache lookup, the most
> > > callsites check if NULL is returned, this should have least work I
> > > think. But the error handling in filesystems just return -ENOMEM, the
> > > error code will incur confusion to the users obviously.
> > > 3. To improve #2, we could return error pointer, e.g. ERR_PTR(-EIO),
> > > but this will involve significant amount of code change as well since
> > > all the paths need check if the pointer is ERR or not.
> >
> > I think the approach #3 sounds good for now, it seems to me that these
> > statements are about general ways to handle error pages on all page cac=
he
> > users, so then the amount of code changes is a big problem, but when
> > focusing on shmem/tmpfs, could the amount of changes be more handlable,=
 or
> > still large?
>
> Yeah, I agree #3 makes more sense. Just return an error when finding
> out corrupted page. I think this is the right semantic.
>
> The amount of work for shmem should not be big.
>
> For other writeback filesystems we could make the filesystems check if
> the page is error pointer or not one at a time. Once all filesystems
> have been done, we change hwpoison code to keep corrupted page in page
> cache so that we don't have to do all the things altogether in one
> gigantic patchset. This approach should be more manageable and review
> friendly IMHO.
>
> >
> > > 4. #3 may break truncate, so maybe need convert hwpoison page to
> > > exceptional entry? I'm not sure if I'm over-engineered or not.
> >
> > I think of taking refcount of inode associated with error pages to prev=
ent
> > the truncation via inode eviction, which might be worth considering.
>
> The memory_failure already does it by get_hwpoison_page(), right?
>
> >
> > We somehow need keep error information on memory, which should be kept =
from
> > being removed, maybe except system reboot and explicit removal operatio=
n
> > from the file's user.  So it might be helpful if we can deal with
> > inode_evict from slab shrinking and from explicit calls from users.
>
> The refcount pin from get_hwpoison_page() could prevent the page from
> being released by inode eviction from slab shrinking path since
> invalidate_mapping_pages() is used which does need to freeze refcount.
>
> In the beginning I thought the problem is if we return error pointer
> for truncate/invalidate, how can we convert it to regular page pointer
> since truncate still need remove the page from page cache. But the
> further investigation shows they call find_get_entry() which just
> iterates xarray to return page. So it seems we don't need to do
> anything. It doesn't matter to return poisoned page for
> invalidate/truncate paths.
>
> There are some other callers of find_get_entry() may need attention. A
> quick look shows they could backoff sooner if poisoned page is found.
>
> >
> > > 5. We also could define a new FGP flag to return poisoned page, NULL
> > > or error pointer. This also should need significant code change since
> > > a lt callsites need to be contemplated.
> >
> > Could you explain a little more about which callers should use the flag=
?
>
> Just to solve the above invalidate/truncate problem and page fault
> doesn't expect an error pointer. But it seems the above
> invalidate/truncate paths don't matter. Page fault should be the only
> user since page fault may need unlock the page if poisoned page is
> returned.

It seems page fault check IS_ERR(page) then just return
VM_FAULT_HWPOISON. But I found a couple of places in shmem which want
to return head page then handle subpage or just return the page but
don't care the content of the page. They should ignore hwpoison. So I
guess we'd better to have a FGP flag for such cases.

>
> >
> > >
> > > Most filesystems use generic page fault handler (filemap_fault), just
> > > very few have their own implementation, e.g. ceph, DAX, etc.
> > >
> > > I don't know which way we should go, should need more advice.
> >
> > I think that DAX doesn't use page cache, so I think that it anyway need=
s
> > special handling as a separate issue.
>
> Yes, DAX doesn't use page cache, but the data blocks of filesystem
> actually are memory. It seems PMEM driver already has some mechanism
> to handle poisoned page. I don't think we need handle it right now.
>
> >
> > Thanks,
> > Naoya Horiguchi
> >
> > >
> > > >
> > > > >
> > > > > I'm not a filesystem expert so I'm not sure if I'm missing someth=
ing
> > > > > else or not. But I'm supposed the most should be covered.
> > > > >
> > > > > I'd like to start with shmem/tmpfs since it is relatively easier =
and
> > > > > this also could unblock shmem THP hwpoison support. Any comment i=
s
> > > > > welcome.
> > > >
> > > > I think that keeping corrupted pages in page cache can be a better =
solution.
> > > > So if you plan to write a patchset, I'm glad to review/test it.  St=
arting with
> > > > shmem/tmpfs sounds nice to me.
> > >
> > > Thank you very much. Yes, I'm working on it. And I already had some
> > > patches work (just check if page is hwpoisoned in every path, #1
> > > approach listed above). I could clean up the patches and send out to
> > > gather some feedback.