All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: Jue Wang <juew@google.com>
Cc: James Houghton <jthoughton@google.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Shuah Khan <shuah@kernel.org>, Linux MM <linux-mm@kvack.org>,
	Linuxkselftest <linux-kselftest@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 1/3] userfaultfd/selftests: fix feature support detection
Date: Fri, 24 Sep 2021 16:09:37 -0400	[thread overview]
Message-ID: <YU4wgSmStmkxxSt5@t490s> (raw)
In-Reply-To: <CAPcxDJ6E3c2gcnJ8pDeQidf-yHDP7S=Knah_b3hy+FL1kOObqA@mail.gmail.com>

On Wed, Sep 22, 2021 at 10:43:40PM -0700, Jue Wang wrote:

[...]

> > > Could I know what's the workaround?  Normally if the workaround works solidly,
> > > then there's less need to introduce a kernel interface for that.  Otherwise I'm
> > > glad to look into such a formal proposal.
> >
> > The workaround is, for the region that you want to zap, run through
> > this sequence of syscalls: mumap, mmap, and re-register with
> > userfaultfd if it was registered before. If we're using tmpfs, we can
> > use madvise(DONTNEED) instead, but this is kind of an abuse of the
> > API. I don't think there's a guarantee that the PTEs will get zapped,
> > but currently they will always get zapped if we're using tmpfs. I
> > really like the idea of adding a new madvise() mode that is guaranteed
> > to zap the PTEs.

I see.

> >
> > >
> > > > It's also useful for memory poisoning, I think, if the host
> > > > decides some page(s) are "bad" and wants to intercept any future guest
> > > > accesses to those page(s).
> > >
> > > Curious: isn't hwpoison information come from MCEs; or say, host kernel side?
> > > Then I thought the host kernel will have full control of it already.
> > >
> > > Or there's other way that the host can try to detect some pages are going to be
> > > rotten?  So the userspace can do something before the kernel handles those
> > > exceptions?
> >
> > Here's a general idea of how we would like to use userfaultfd to support MPR:
> >
> > If a guest accesses a poisoned page for the first time, we will get an
> > MCE through the host kernel and send an MCE to the guest. The guest
> > will now no longer be able to access this page, and we have to enforce
> > this. After a live migration, the pages that were poisoned before
> > probably won't still be poisoned (from the host's perspective), so we
> > can't rely on the host kernel's MCE handling path. This is where
> > userfaultfd and this new madvise mode come in: we can just
> > madvise(MADV_ZAP) the poisoned page(s) on the target during a
> > migration. Now all accesses will be routed to the VMM and we can
> > inject an MCE. We don't *need* the new madvise mode, as we can also
> > use fallocate(PUNCH_HOLE) (works for tmpfs and hugetlbfs), but it
> > would be more convenient if we didn't have to use fallocate.
> >
> > Jue Wang can provide more context here, so I've cc'd him. There may be
> > some things I'm wrong about, so Jue feel free to correct me.
> >
> James is right.
> 
> The page is marked PG_HWPoison in the source VM host's kernel. The need
> of intercepting guest accesses to it exist on the target VM host, where
> the same physical page is no longer poisoned.
> 
> On the target host, the hypervisor needs to intercept all guest accesses
> to pages poisoned from the source VM host.

Thanks for these information, James, Jue, Axel.  I'm not familiar with memory
failures yet, so please bare with me with a few naive questions.

So now I can undertand that hw-poisonsed pages on src host do not mean these
pages will be hw-poisoned on dest host too, but I may have missed the reason on
why dest host needs to trap it with pgtable removed.

AFAIU after pages got hw-poisoned on src, and after vmm injects MCEs into the
guest, the guest shouldn't be accessing these pages any more, am I right?  Then
after migration completes, IIUC the guest shouldn't be accessing these pages
too.  My current understanding is, instead of trapping these pages on dest, we
should just (somehow, which I have no real idea...) un-hw-poison these pages
after migration because these pages are very possibly normal pages there.  When
there's real hw-poisoned pages reported on dst host, we should re-inject MCE
errors to guest with another set of pages.

Could you tell me where did I miss?

Thanks,

-- 
Peter Xu


  reply	other threads:[~2021-09-24 20:09 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-21 16:33 [PATCH 1/3] userfaultfd/selftests: fix feature support detection Axel Rasmussen
2021-09-21 16:33 ` Axel Rasmussen
2021-09-21 16:33 ` [PATCH 2/3] userfaultfd/selftests: fix calculation of expected ioctls Axel Rasmussen
2021-09-21 16:33   ` Axel Rasmussen
2021-09-21 16:33 ` [PATCH 3/3] userfaultfd/selftests: don't rely on GNU extensions for random numbers Axel Rasmussen
2021-09-21 16:33   ` Axel Rasmussen
2021-09-21 18:03   ` Peter Xu
2021-09-21 17:44 ` [PATCH 1/3] userfaultfd/selftests: fix feature support detection Peter Xu
2021-09-21 18:26   ` Axel Rasmussen
2021-09-21 18:26     ` Axel Rasmussen
2021-09-21 19:21     ` Peter Xu
2021-09-21 20:31       ` Axel Rasmussen
2021-09-21 20:31         ` Axel Rasmussen
2021-09-22  0:29         ` Peter Xu
2021-09-22 17:04           ` Axel Rasmussen
2021-09-22 17:04             ` Axel Rasmussen
2021-09-22 17:32             ` Peter Xu
2021-09-22 20:54               ` Axel Rasmussen
2021-09-22 20:54                 ` Axel Rasmussen
2021-09-22 21:51                 ` Peter Xu
2021-09-22 22:29                   ` Axel Rasmussen
2021-09-22 22:29                     ` Axel Rasmussen
2021-09-22 23:49                     ` Peter Xu
2021-09-23  4:17                       ` James Houghton
2021-09-23  4:17                         ` James Houghton
2021-09-23  5:43                         ` Jue Wang
2021-09-23  5:43                           ` Jue Wang
2021-09-24 20:09                           ` Peter Xu [this message]
2021-09-24 20:22                             ` Jue Wang
2021-09-24 20:22                               ` Jue Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YU4wgSmStmkxxSt5@t490s \
    --to=peterx@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=jthoughton@google.com \
    --cc=juew@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=shuah@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.