Re: ##freemail## Re: [PATCH] mm: hwpoison: deal with page cache THP

From: Yang Shi <shy828301@gmail.com>
To: "HORIGUCHI NAOYA(堀口 直也)" <naoya.horiguchi@nec.com>
Cc: "osalvador@suse.de" <osalvador@suse.de>,
	"hughd@google.com" <hughd@google.com>,
	 "kirill.shutemov@linux.intel.com"
	<kirill.shutemov@linux.intel.com>,
	 "akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	 "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: ##freemail## Re: [PATCH] mm: hwpoison: deal with page cache THP
Date: Tue, 7 Sep 2021 20:14:26 -0700	[thread overview]
Message-ID: <CAHbLzkptCkXdVdG9mr1BU8TEyEpW57go_Pj1uVopkxORqYi8aw@mail.gmail.com> (raw)
In-Reply-To: <20210908025024.GA4117799@hori.linux.bs1.fc.nec.co.jp>

On Tue, Sep 7, 2021 at 7:50 PM HORIGUCHI NAOYA(堀口　直也)
<naoya.horiguchi@nec.com> wrote:
>
> On Tue, Sep 07, 2021 at 02:34:24PM -0700, Yang Shi wrote:
> > On Fri, Sep 3, 2021 at 5:03 PM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > On Fri, Sep 3, 2021 at 11:01 AM Yang Shi <shy828301@gmail.com> wrote:
> ...
> > > > > >
> > > > > > AFAIK the address_space error just works for fsync. Anyway I could be wrong.
> > > > > >
> > > > > > I think clearing the dirty flag might be the easiest way? It seems
> > > > > > unnecessary to notify the users when writing back since the most write
> > > > > > back happens asynchronously. They should be notified when the page is
> > > > > > accessed, e.g. read/write and page fault.
> > > > > >
> > > > > > I did some further investigation and got a clearer picture for
> > > > > > writeback filesystem:
> > > > > > 1. The page should be not written back: clearing dirty flag could
> > > > > > prevent from writeback
> > > > > > 2. The page should be not dropped (it shows as a clean page): the
> > > > > > refcount pin from hwpoison could prevent from invalidating (called by
> > > > > > cache drop, inode cache shrinking, etc), but it doesn't avoid
> > > > > > invalidation in DIO path (easy to deal with)
> > > > > > 3. The page should be able to get truncated/hole punched/unlinked: it
> > > > > > works as it is
> > > > > > 4. Notify users when the page is accessed, e.g. read/write, page fault
> > > > > > and other paths: this is hard
> > > > > >
> > > > > > The hardest part is #4. Since there are too many paths in filesystems
> > > > > > that do *NOT* check if page is poisoned or not, e.g. read/write,
> > > > > > compression (btrfs, f2fs), etc. A couple of ways to handle it off the
> > > > > > top of my head:
> > > > > > 1. Check hwpoison flag for every path, the most straightforward way,
> > > > > > but a lot work
> > > > > > 2. Return NULL for poisoned page from page cache lookup, the most
> > > > > > callsites check if NULL is returned, this should have least work I
> > > > > > think. But the error handling in filesystems just return -ENOMEM, the
> > > > > > error code will incur confusion to the users obviously.
> > > > > > 3. To improve #2, we could return error pointer, e.g. ERR_PTR(-EIO),
> > > > > > but this will involve significant amount of code change as well since
> > > > > > all the paths need check if the pointer is ERR or not.
> > > > >
> > > > > I think the approach #3 sounds good for now, it seems to me that these
> > > > > statements are about general ways to handle error pages on all page cache
> > > > > users, so then the amount of code changes is a big problem, but when
> > > > > focusing on shmem/tmpfs, could the amount of changes be more handlable, or
> > > > > still large?
> > > >
> > > > Yeah, I agree #3 makes more sense. Just return an error when finding
> > > > out corrupted page. I think this is the right semantic.
> >
> > I had a prototype patchset for #3, but now I have to consider stepping
> > back to rethink which way we should go. I actually did a patchset for
> > #1 too. By comparing the two patchsets and some deeper investigation
> > during preparing the two patchsets, I realized #3 may not be the best
> > way.
> >
> > For #3 ERR_PTR will be returned so all the callers need to check the
> > return value otherwise invalid pointer may be dereferenced, but not
> > all callers really care about the content of the page, for example,
> > partial truncate which just set the truncated range in one page to 0.
> > So for such paths it needs additional modification if ERR_PTR is
> > returned. And if the callers have their own way to handle the
> > problematic pages we need to add a new FGP flag to tell FGP functions
> > to return the pointer to the page.
> >
> > But we don't need to do anything for such paths if we go with #1. For
> > most paths (e.g. read/write) both ways need to check the validity of
> > the page by checking ERR PTR or page flag. But a lot of extra code
> > could be saved for the paths which don't care about the actual data
> > for #1.
>
> OK, so it's not clear which is better. Maybe we need discussion over
> patches...

Yeah, the conclusion is #1 would need fewer changes than #3.

>
> ...
>
> > > >
> > > > >
> > > > > > 5. We also could define a new FGP flag to return poisoned page, NULL
> > > > > > or error pointer. This also should need significant code change since
> > > > > > a lt callsites need to be contemplated.
> > > > >
> > > > > Could you explain a little more about which callers should use the flag?
> > > >
> > > > Just to solve the above invalidate/truncate problem and page fault
> > > > doesn't expect an error pointer. But it seems the above
> > > > invalidate/truncate paths don't matter. Page fault should be the only
> > > > user since page fault may need unlock the page if poisoned page is
> > > > returned.
>
> Orignally PG_hwpoison is detected in page fault handler for file-backed pages
> like below,
>
>   static vm_fault_t __do_fault(struct vm_fault *vmf)
>   ...
>           ret = vma->vm_ops->fault(vmf);
>           if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
>                               VM_FAULT_DONE_COW)))
>                   return ret;
>
>           if (unlikely(PageHWPoison(vmf->page))) {
>                   if (ret & VM_FAULT_LOCKED)
>                           unlock_page(vmf->page);
>                   put_page(vmf->page);
>                   vmf->page = NULL;
>                   return VM_FAULT_HWPOISON;
>           }
>
> so it seems to me that if a filesystem switches to the new scheme of keeping
> error pages in page cache, ->fault() callback for the filesystem needs to be
> aware of hwpoisoned pages inside it and returns VM_FAULT_HWPOISON when it
> detects an error page (maybe by detecting pagecache_get_page() to return
> PTR_ERR(-EIO or -EHWPOISON) or some other ways).  In the other filesystems
> with the old scheme, fault() callback does not return VM_FAULT_HWPOISON so
> that page fault handler falls back to the generic PageHWPoison check above.

Yes, exactly. If we return ERR_PTR we need modify the above code
accordingly. But returning the page, no change is required.

>
> > >
> > > It seems page fault check IS_ERR(page) then just return
> > > VM_FAULT_HWPOISON. But I found a couple of places in shmem which want
> > > to return head page then handle subpage or just return the page but
> > > don't care the content of the page. They should ignore hwpoison. So I
> > > guess we'd better to have a FGP flag for such cases.
>
> At least currently thp are supposed to be split before error handling.

Not for file THP.

> We could loosen this assumption to support hwpoison on a subpage of thp,
> but I'm not sure whether that has some benefit.

I don't quite get your point. Currently the hwpoison flag is set on
specific subpage instead of head page.

> And also this discussion makes me aware that we need to have a way to
> prevent hwpoisoned pages from being used to thp collapse operation.

Actually not. The refcount from hwpoison could prevent it from
collapsing. But if we return ERR_PTR (#3) we need extra code in
khugepaged to handle it.

So I realized #1 would require fewer changes. And leaving the page
state check to callers seem reasonable since the callers usually check
other states, e.g. uptodate.

>
> Thanks,
> Naoya Horiguchi