linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Yang Shi <shy828301@gmail.com>
To: "HORIGUCHI NAOYA(堀口 直也)" <naoya.horiguchi@nec.com>
Cc: "osalvador@suse.de" <osalvador@suse.de>,
	"hughd@google.com" <hughd@google.com>,
	 "kirill.shutemov@linux.intel.com"
	<kirill.shutemov@linux.intel.com>,
	 "akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	 "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: ##freemail## Re: [PATCH] mm: hwpoison: deal with page cache THP
Date: Tue, 7 Sep 2021 20:14:26 -0700	[thread overview]
Message-ID: <CAHbLzkptCkXdVdG9mr1BU8TEyEpW57go_Pj1uVopkxORqYi8aw@mail.gmail.com> (raw)
In-Reply-To: <20210908025024.GA4117799@hori.linux.bs1.fc.nec.co.jp>

On Tue, Sep 7, 2021 at 7:50 PM HORIGUCHI NAOYA(堀口 直也)
<naoya.horiguchi@nec.com> wrote:
>
> On Tue, Sep 07, 2021 at 02:34:24PM -0700, Yang Shi wrote:
> > On Fri, Sep 3, 2021 at 5:03 PM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > On Fri, Sep 3, 2021 at 11:01 AM Yang Shi <shy828301@gmail.com> wrote:
> ...
> > > > > >
> > > > > > AFAIK the address_space error just works for fsync. Anyway I could be wrong.
> > > > > >
> > > > > > I think clearing the dirty flag might be the easiest way? It seems
> > > > > > unnecessary to notify the users when writing back since the most write
> > > > > > back happens asynchronously. They should be notified when the page is
> > > > > > accessed, e.g. read/write and page fault.
> > > > > >
> > > > > > I did some further investigation and got a clearer picture for
> > > > > > writeback filesystem:
> > > > > > 1. The page should be not written back: clearing dirty flag could
> > > > > > prevent from writeback
> > > > > > 2. The page should be not dropped (it shows as a clean page): the
> > > > > > refcount pin from hwpoison could prevent from invalidating (called by
> > > > > > cache drop, inode cache shrinking, etc), but it doesn't avoid
> > > > > > invalidation in DIO path (easy to deal with)
> > > > > > 3. The page should be able to get truncated/hole punched/unlinked: it
> > > > > > works as it is
> > > > > > 4. Notify users when the page is accessed, e.g. read/write, page fault
> > > > > > and other paths: this is hard
> > > > > >
> > > > > > The hardest part is #4. Since there are too many paths in filesystems
> > > > > > that do *NOT* check if page is poisoned or not, e.g. read/write,
> > > > > > compression (btrfs, f2fs), etc. A couple of ways to handle it off the
> > > > > > top of my head:
> > > > > > 1. Check hwpoison flag for every path, the most straightforward way,
> > > > > > but a lot work
> > > > > > 2. Return NULL for poisoned page from page cache lookup, the most
> > > > > > callsites check if NULL is returned, this should have least work I
> > > > > > think. But the error handling in filesystems just return -ENOMEM, the
> > > > > > error code will incur confusion to the users obviously.
> > > > > > 3. To improve #2, we could return error pointer, e.g. ERR_PTR(-EIO),
> > > > > > but this will involve significant amount of code change as well since
> > > > > > all the paths need check if the pointer is ERR or not.
> > > > >
> > > > > I think the approach #3 sounds good for now, it seems to me that these
> > > > > statements are about general ways to handle error pages on all page cache
> > > > > users, so then the amount of code changes is a big problem, but when
> > > > > focusing on shmem/tmpfs, could the amount of changes be more handlable, or
> > > > > still large?
> > > >
> > > > Yeah, I agree #3 makes more sense. Just return an error when finding
> > > > out corrupted page. I think this is the right semantic.
> >
> > I had a prototype patchset for #3, but now I have to consider stepping
> > back to rethink which way we should go. I actually did a patchset for
> > #1 too. By comparing the two patchsets and some deeper investigation
> > during preparing the two patchsets, I realized #3 may not be the best
> > way.
> >
> > For #3 ERR_PTR will be returned so all the callers need to check the
> > return value otherwise invalid pointer may be dereferenced, but not
> > all callers really care about the content of the page, for example,
> > partial truncate which just set the truncated range in one page to 0.
> > So for such paths it needs additional modification if ERR_PTR is
> > returned. And if the callers have their own way to handle the
> > problematic pages we need to add a new FGP flag to tell FGP functions
> > to return the pointer to the page.
> >
> > But we don't need to do anything for such paths if we go with #1. For
> > most paths (e.g. read/write) both ways need to check the validity of
> > the page by checking ERR PTR or page flag. But a lot of extra code
> > could be saved for the paths which don't care about the actual data
> > for #1.
>
> OK, so it's not clear which is better. Maybe we need discussion over
> patches...

Yeah, the conclusion is #1 would need fewer changes than #3.

>
> ...
>
> > > >
> > > > >
> > > > > > 5. We also could define a new FGP flag to return poisoned page, NULL
> > > > > > or error pointer. This also should need significant code change since
> > > > > > a lt callsites need to be contemplated.
> > > > >
> > > > > Could you explain a little more about which callers should use the flag?
> > > >
> > > > Just to solve the above invalidate/truncate problem and page fault
> > > > doesn't expect an error pointer. But it seems the above
> > > > invalidate/truncate paths don't matter. Page fault should be the only
> > > > user since page fault may need unlock the page if poisoned page is
> > > > returned.
>
> Orignally PG_hwpoison is detected in page fault handler for file-backed pages
> like below,
>
>   static vm_fault_t __do_fault(struct vm_fault *vmf)
>   ...
>           ret = vma->vm_ops->fault(vmf);
>           if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
>                               VM_FAULT_DONE_COW)))
>                   return ret;
>
>           if (unlikely(PageHWPoison(vmf->page))) {
>                   if (ret & VM_FAULT_LOCKED)
>                           unlock_page(vmf->page);
>                   put_page(vmf->page);
>                   vmf->page = NULL;
>                   return VM_FAULT_HWPOISON;
>           }
>
> so it seems to me that if a filesystem switches to the new scheme of keeping
> error pages in page cache, ->fault() callback for the filesystem needs to be
> aware of hwpoisoned pages inside it and returns VM_FAULT_HWPOISON when it
> detects an error page (maybe by detecting pagecache_get_page() to return
> PTR_ERR(-EIO or -EHWPOISON) or some other ways).  In the other filesystems
> with the old scheme, fault() callback does not return VM_FAULT_HWPOISON so
> that page fault handler falls back to the generic PageHWPoison check above.

Yes, exactly. If we return ERR_PTR we need modify the above code
accordingly. But returning the page, no change is required.

>
> > >
> > > It seems page fault check IS_ERR(page) then just return
> > > VM_FAULT_HWPOISON. But I found a couple of places in shmem which want
> > > to return head page then handle subpage or just return the page but
> > > don't care the content of the page. They should ignore hwpoison. So I
> > > guess we'd better to have a FGP flag for such cases.
>
> At least currently thp are supposed to be split before error handling.

Not for file THP.

> We could loosen this assumption to support hwpoison on a subpage of thp,
> but I'm not sure whether that has some benefit.

I don't quite get your point. Currently the hwpoison flag is set on
specific subpage instead of head page.

> And also this discussion makes me aware that we need to have a way to
> prevent hwpoisoned pages from being used to thp collapse operation.

Actually not. The refcount from hwpoison could prevent it from
collapsing. But if we return ERR_PTR (#3) we need extra code in
khugepaged to handle it.

So I realized #1 would require fewer changes. And leaving the page
state check to callers seem reasonable since the callers usually check
other states, e.g. uptodate.


>
> Thanks,
> Naoya Horiguchi


  reply	other threads:[~2021-09-08  3:14 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-24 22:13 [PATCH] mm: hwpoison: deal with page cache THP Yang Shi
2021-08-26  6:17 ` HORIGUCHI NAOYA(堀口 直也)
2021-08-26 20:03   ` Yang Shi
2021-08-26 22:03     ` Yang Shi
2021-08-27  3:57       ` HORIGUCHI NAOYA(堀口 直也)
2021-08-27  5:02         ` Yang Shi
2021-08-30 23:44           ` Yang Shi
2021-09-02  3:07             ` HORIGUCHI NAOYA(堀口 直也)
2021-09-02 18:32               ` Yang Shi
2021-09-03 11:53                 ` HORIGUCHI NAOYA(堀口 直也)
2021-09-03 18:01                   ` Yang Shi
2021-09-04  0:03                     ` Yang Shi
2021-09-07 21:34                       ` Yang Shi
2021-09-08  2:50                         ` ##freemail## " HORIGUCHI NAOYA(堀口 直也)
2021-09-08  3:14                           ` Yang Shi [this message]
2021-09-08  4:25                             ` HORIGUCHI NAOYA(堀口 直也)
2021-09-09 23:07                               ` Yang Shi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAHbLzkptCkXdVdG9mr1BU8TEyEpW57go_Pj1uVopkxORqYi8aw@mail.gmail.com \
    --to=shy828301@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=hughd@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=naoya.horiguchi@nec.com \
    --cc=osalvador@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).