Re: ##freemail## Re: [PATCH] mm: hwpoison: deal with page cache THP

From: "HORIGUCHI NAOYA(堀口　直也)" <naoya.horiguchi@nec.com>
To: Yang Shi <shy828301@gmail.com>
Cc: "osalvador@suse.de" <osalvador@suse.de>,
	"hughd@google.com" <hughd@google.com>,
	"kirill.shutemov@linux.intel.com"
	<kirill.shutemov@linux.intel.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: ##freemail## Re: [PATCH] mm: hwpoison: deal with page cache THP
Date: Wed, 8 Sep 2021 02:50:25 +0000	[thread overview]
Message-ID: <20210908025024.GA4117799@hori.linux.bs1.fc.nec.co.jp> (raw)
In-Reply-To: <CAHbLzko6yTTnBopzASaYeuD5bBsc_kdhQxQoVd-wqEhkzX1agg@mail.gmail.com>

On Tue, Sep 07, 2021 at 02:34:24PM -0700, Yang Shi wrote:
> On Fri, Sep 3, 2021 at 5:03 PM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Fri, Sep 3, 2021 at 11:01 AM Yang Shi <shy828301@gmail.com> wrote:
...
> > > > >
> > > > > AFAIK the address_space error just works for fsync. Anyway I could be wrong.
> > > > >
> > > > > I think clearing the dirty flag might be the easiest way? It seems
> > > > > unnecessary to notify the users when writing back since the most write
> > > > > back happens asynchronously. They should be notified when the page is
> > > > > accessed, e.g. read/write and page fault.
> > > > >
> > > > > I did some further investigation and got a clearer picture for
> > > > > writeback filesystem:
> > > > > 1. The page should be not written back: clearing dirty flag could
> > > > > prevent from writeback
> > > > > 2. The page should be not dropped (it shows as a clean page): the
> > > > > refcount pin from hwpoison could prevent from invalidating (called by
> > > > > cache drop, inode cache shrinking, etc), but it doesn't avoid
> > > > > invalidation in DIO path (easy to deal with)
> > > > > 3. The page should be able to get truncated/hole punched/unlinked: it
> > > > > works as it is
> > > > > 4. Notify users when the page is accessed, e.g. read/write, page fault
> > > > > and other paths: this is hard
> > > > >
> > > > > The hardest part is #4. Since there are too many paths in filesystems
> > > > > that do *NOT* check if page is poisoned or not, e.g. read/write,
> > > > > compression (btrfs, f2fs), etc. A couple of ways to handle it off the
> > > > > top of my head:
> > > > > 1. Check hwpoison flag for every path, the most straightforward way,
> > > > > but a lot work
> > > > > 2. Return NULL for poisoned page from page cache lookup, the most
> > > > > callsites check if NULL is returned, this should have least work I
> > > > > think. But the error handling in filesystems just return -ENOMEM, the
> > > > > error code will incur confusion to the users obviously.
> > > > > 3. To improve #2, we could return error pointer, e.g. ERR_PTR(-EIO),
> > > > > but this will involve significant amount of code change as well since
> > > > > all the paths need check if the pointer is ERR or not.
> > > >
> > > > I think the approach #3 sounds good for now, it seems to me that these
> > > > statements are about general ways to handle error pages on all page cache
> > > > users, so then the amount of code changes is a big problem, but when
> > > > focusing on shmem/tmpfs, could the amount of changes be more handlable, or
> > > > still large?
> > >
> > > Yeah, I agree #3 makes more sense. Just return an error when finding
> > > out corrupted page. I think this is the right semantic.
> 
> I had a prototype patchset for #3, but now I have to consider stepping
> back to rethink which way we should go. I actually did a patchset for
> #1 too. By comparing the two patchsets and some deeper investigation
> during preparing the two patchsets, I realized #3 may not be the best
> way.
> 
> For #3 ERR_PTR will be returned so all the callers need to check the
> return value otherwise invalid pointer may be dereferenced, but not
> all callers really care about the content of the page, for example,
> partial truncate which just set the truncated range in one page to 0.
> So for such paths it needs additional modification if ERR_PTR is
> returned. And if the callers have their own way to handle the
> problematic pages we need to add a new FGP flag to tell FGP functions
> to return the pointer to the page.
> 
> But we don't need to do anything for such paths if we go with #1. For
> most paths (e.g. read/write) both ways need to check the validity of
> the page by checking ERR PTR or page flag. But a lot of extra code
> could be saved for the paths which don't care about the actual data
> for #1.

OK, so it's not clear which is better. Maybe we need discussion over
patches...

...

> > >
> > > >
> > > > > 5. We also could define a new FGP flag to return poisoned page, NULL
> > > > > or error pointer. This also should need significant code change since
> > > > > a lt callsites need to be contemplated.
> > > >
> > > > Could you explain a little more about which callers should use the flag?
> > >
> > > Just to solve the above invalidate/truncate problem and page fault
> > > doesn't expect an error pointer. But it seems the above
> > > invalidate/truncate paths don't matter. Page fault should be the only
> > > user since page fault may need unlock the page if poisoned page is
> > > returned.

Orignally PG_hwpoison is detected in page fault handler for file-backed pages
like below,

  static vm_fault_t __do_fault(struct vm_fault *vmf)
  ...
          ret = vma->vm_ops->fault(vmf);
          if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
                              VM_FAULT_DONE_COW)))
                  return ret;

          if (unlikely(PageHWPoison(vmf->page))) {
                  if (ret & VM_FAULT_LOCKED)
                          unlock_page(vmf->page);
                  put_page(vmf->page);
                  vmf->page = NULL;
                  return VM_FAULT_HWPOISON;
          }

so it seems to me that if a filesystem switches to the new scheme of keeping
error pages in page cache, ->fault() callback for the filesystem needs to be
aware of hwpoisoned pages inside it and returns VM_FAULT_HWPOISON when it
detects an error page (maybe by detecting pagecache_get_page() to return
PTR_ERR(-EIO or -EHWPOISON) or some other ways).  In the other filesystems
with the old scheme, fault() callback does not return VM_FAULT_HWPOISON so
that page fault handler falls back to the generic PageHWPoison check above.

> >
> > It seems page fault check IS_ERR(page) then just return
> > VM_FAULT_HWPOISON. But I found a couple of places in shmem which want
> > to return head page then handle subpage or just return the page but
> > don't care the content of the page. They should ignore hwpoison. So I
> > guess we'd better to have a FGP flag for such cases.

At least currently thp are supposed to be split before error handling.
We could loosen this assumption to support hwpoison on a subpage of thp,
but I'm not sure whether that has some benefit.
And also this discussion makes me aware that we need to have a way to
prevent hwpoisoned pages from being used to thp collapse operation.

Thanks,
Naoya Horiguchi