All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: "Christian König" <christian.koenig@amd.com>
Cc: Dmitry Osipenko <dmitry.osipenko@collabora.com>,
	kvm@vger.kernel.org, David Airlie <airlied@linux.ie>,
	Antonio Caggiano <antonio.caggiano@collabora.com>,
	dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org,
	Gert Wollny <gert.wollny@collabora.com>,
	Huang Rui <ray.huang@amd.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Dmitry Osipenko <digetx@gmail.com>,
	kernel@collabora.com, virtualization@lists.linux-foundation.org,
	Trigger Huang <Trigger.Huang@gmail.com>
Subject: Re: [PATCH v1] drm/ttm: Refcount allocated tail pages
Date: Wed, 11 Jan 2023 17:05:05 +0000	[thread overview]
Message-ID: <Y77sQZI0IfFVx7Jo@google.com> (raw)
In-Reply-To: <b1963713-4df6-956f-c16f-81a0cf1a978b@amd.com>

On Thu, Aug 18, 2022, Christian König wrote:
> Am 18.08.22 um 01:13 schrieb Dmitry Osipenko:
> > On 8/18/22 01:57, Dmitry Osipenko wrote:
> > > On 8/15/22 18:54, Dmitry Osipenko wrote:
> > > > On 8/15/22 17:57, Dmitry Osipenko wrote:
> > > > > On 8/15/22 16:53, Christian König wrote:
> > > > > > Am 15.08.22 um 15:45 schrieb Dmitry Osipenko:
> > > > > > > [SNIP]
> > > > > > > > Well that comment sounds like KVM is doing the right thing, so I'm
> > > > > > > > wondering what exactly is going on here.
> > > > > > > KVM actually doesn't hold the page reference, it takes the temporal
> > > > > > > reference during page fault and then drops the reference once page is
> > > > > > > mapped, IIUC. Is it still illegal for TTM? Or there is a possibility for
> > > > > > > a race condition here?
> > > > > > > 
> > > > > > Well the question is why does KVM grab the page reference in the first
> > > > > > place?
> > > > > > 
> > > > > > If that is to prevent the mapping from changing then yes that's illegal
> > > > > > and won't work. It can always happen that you grab the address, solve
> > > > > > the fault and then immediately fault again because the address you just
> > > > > > grabbed is invalidated.
> > > > > > 
> > > > > > If it's for some other reason than we should probably investigate if we
> > > > > > shouldn't stop doing this.

...

> > > > If we need to bump the refcount only for VM_MIXEDMAP and not for
> > > > VM_PFNMAP, then perhaps we could add a flag for that to the kvm_main
> > > > code that will denote to kvm_release_page_clean whether it needs to put
> > > > the page?
> > > The other variant that kind of works is to mark TTM pages reserved using
> > > SetPageReserved/ClearPageReserved, telling KVM not to mess with the page
> > > struct. But the potential consequences of doing this are unclear to me.
> > > 
> > > Christian, do you think we can do it?
> > Although, no. It also doesn't work with KVM without additional changes
> > to KVM.
> 
> Well my fundamental problem is that I can't fit together why KVM is grabing
> a page reference in the first place.

It's to workaround a deficiency in KVM.

> See the idea of the page reference is that you have one reference is that
> you count the reference so that the memory is not reused while you access
> it, e.g. for I/O or mapping it into different address spaces etc...
> 
> But none of those use cases seem to apply to KVM. If I'm not totally
> mistaken in KVM you want to make sure that the address space mapping, e.g.
> the translation between virtual and physical address, don't change while you
> handle it, but grabbing a page reference is the completely wrong approach
> for that.

TL;DR: 100% agree, and we're working on fixing this in KVM, but were still months
away from a full solution.

Yep.  KVM uses mmu_notifiers to react to mapping changes, with a few caveats that
we are (slowly) fixing, though those caveats are only tangentially related.

The deficiency in KVM is that KVM's internal APIs to translate a virtual address
to a physical address spit out only the resulting host PFN.  The details of _how_
that PFN was acquired are not captured.  Specifically, KVM loses track of whether
or not a PFN was acquired via gup() or follow_pte() (KVM is very permissive when
it comes to backing guest memory).

Because gup() gifts the caller a reference, that means KVM also loses track of
whether or not KVM holds a page refcount.  To avoid pinning guest memory, KVM does
quickly put the reference gifted by gup(), but because KVM doesn't _know_ if it
holds a reference, KVM uses a heuristic, which is essentially "is the PFN associated
with a 'normal' struct page?".

   /*
    * Returns a 'struct page' if the pfn is "valid" and backed by a refcounted
    * page, NULL otherwise.  Note, the list of refcounted PG_reserved page types
    * is likely incomplete, it has been compiled purely through people wanting to
    * back guest with a certain type of memory and encountering issues.
    */
   struct page *kvm_pfn_to_refcounted_page(kvm_pfn_t pfn)

That heuristic also triggers if follow_pte() resolves to a PFN that is associated
with a "struct page", and so to avoid putting a reference it doesn't own, KVM does
the silly thing of manually getting a reference immediately after follow_pte().

And that in turn gets tripped up non-refcounted tail pages because KVM sees a
normal, valid "struct page" and assumes it's refcounted.  To fudge around that
issue, KVM requires "struct page" memory to be refcounted.

The long-term solution is to refactor KVM to precisely track whether or not KVM
holds a reference.  Patches have been prosposed to do exactly that[1], but they
were put on hold due to the aforementioned caveats with mmu_notifiers.  The
caveats are that most flows where KVM plumbs a physical address into hardware
structures aren't wired up to KVM's mmu_notifier.

KVM could support non-refcounted struct page memory without first fixing the
mmu_notifier issues, but I was (and still am) concerned that that would create an
even larger hole in KVM until the mmu_notifier issues are sorted out[2].
 
[1] https://lore.kernel.org/all/20211129034317.2964790-1-stevensd@google.com
[2] https://lore.kernel.org/all/Ydhq5aHW+JFo15UF@google.com

WARNING: multiple messages have this Message-ID (diff)
From: Sean Christopherson <seanjc@google.com>
To: "Christian König" <christian.koenig@amd.com>
Cc: Dmitry Osipenko <dmitry.osipenko@collabora.com>,
	David Airlie <airlied@linux.ie>, Huang Rui <ray.huang@amd.com>,
	Daniel Vetter <daniel@ffwll.ch>,
	Trigger Huang <Trigger.Huang@gmail.com>,
	Gert Wollny <gert.wollny@collabora.com>,
	Antonio Caggiano <antonio.caggiano@collabora.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org,
	Dmitry Osipenko <digetx@gmail.com>,
	kvm@vger.kernel.org, kernel@collabora.com,
	virtualization@lists.linux-foundation.org
Subject: Re: [PATCH v1] drm/ttm: Refcount allocated tail pages
Date: Wed, 11 Jan 2023 17:05:05 +0000	[thread overview]
Message-ID: <Y77sQZI0IfFVx7Jo@google.com> (raw)
In-Reply-To: <b1963713-4df6-956f-c16f-81a0cf1a978b@amd.com>

On Thu, Aug 18, 2022, Christian König wrote:
> Am 18.08.22 um 01:13 schrieb Dmitry Osipenko:
> > On 8/18/22 01:57, Dmitry Osipenko wrote:
> > > On 8/15/22 18:54, Dmitry Osipenko wrote:
> > > > On 8/15/22 17:57, Dmitry Osipenko wrote:
> > > > > On 8/15/22 16:53, Christian König wrote:
> > > > > > Am 15.08.22 um 15:45 schrieb Dmitry Osipenko:
> > > > > > > [SNIP]
> > > > > > > > Well that comment sounds like KVM is doing the right thing, so I'm
> > > > > > > > wondering what exactly is going on here.
> > > > > > > KVM actually doesn't hold the page reference, it takes the temporal
> > > > > > > reference during page fault and then drops the reference once page is
> > > > > > > mapped, IIUC. Is it still illegal for TTM? Or there is a possibility for
> > > > > > > a race condition here?
> > > > > > > 
> > > > > > Well the question is why does KVM grab the page reference in the first
> > > > > > place?
> > > > > > 
> > > > > > If that is to prevent the mapping from changing then yes that's illegal
> > > > > > and won't work. It can always happen that you grab the address, solve
> > > > > > the fault and then immediately fault again because the address you just
> > > > > > grabbed is invalidated.
> > > > > > 
> > > > > > If it's for some other reason than we should probably investigate if we
> > > > > > shouldn't stop doing this.

...

> > > > If we need to bump the refcount only for VM_MIXEDMAP and not for
> > > > VM_PFNMAP, then perhaps we could add a flag for that to the kvm_main
> > > > code that will denote to kvm_release_page_clean whether it needs to put
> > > > the page?
> > > The other variant that kind of works is to mark TTM pages reserved using
> > > SetPageReserved/ClearPageReserved, telling KVM not to mess with the page
> > > struct. But the potential consequences of doing this are unclear to me.
> > > 
> > > Christian, do you think we can do it?
> > Although, no. It also doesn't work with KVM without additional changes
> > to KVM.
> 
> Well my fundamental problem is that I can't fit together why KVM is grabing
> a page reference in the first place.

It's to workaround a deficiency in KVM.

> See the idea of the page reference is that you have one reference is that
> you count the reference so that the memory is not reused while you access
> it, e.g. for I/O or mapping it into different address spaces etc...
> 
> But none of those use cases seem to apply to KVM. If I'm not totally
> mistaken in KVM you want to make sure that the address space mapping, e.g.
> the translation between virtual and physical address, don't change while you
> handle it, but grabbing a page reference is the completely wrong approach
> for that.

TL;DR: 100% agree, and we're working on fixing this in KVM, but were still months
away from a full solution.

Yep.  KVM uses mmu_notifiers to react to mapping changes, with a few caveats that
we are (slowly) fixing, though those caveats are only tangentially related.

The deficiency in KVM is that KVM's internal APIs to translate a virtual address
to a physical address spit out only the resulting host PFN.  The details of _how_
that PFN was acquired are not captured.  Specifically, KVM loses track of whether
or not a PFN was acquired via gup() or follow_pte() (KVM is very permissive when
it comes to backing guest memory).

Because gup() gifts the caller a reference, that means KVM also loses track of
whether or not KVM holds a page refcount.  To avoid pinning guest memory, KVM does
quickly put the reference gifted by gup(), but because KVM doesn't _know_ if it
holds a reference, KVM uses a heuristic, which is essentially "is the PFN associated
with a 'normal' struct page?".

   /*
    * Returns a 'struct page' if the pfn is "valid" and backed by a refcounted
    * page, NULL otherwise.  Note, the list of refcounted PG_reserved page types
    * is likely incomplete, it has been compiled purely through people wanting to
    * back guest with a certain type of memory and encountering issues.
    */
   struct page *kvm_pfn_to_refcounted_page(kvm_pfn_t pfn)

That heuristic also triggers if follow_pte() resolves to a PFN that is associated
with a "struct page", and so to avoid putting a reference it doesn't own, KVM does
the silly thing of manually getting a reference immediately after follow_pte().

And that in turn gets tripped up non-refcounted tail pages because KVM sees a
normal, valid "struct page" and assumes it's refcounted.  To fudge around that
issue, KVM requires "struct page" memory to be refcounted.

The long-term solution is to refactor KVM to precisely track whether or not KVM
holds a reference.  Patches have been prosposed to do exactly that[1], but they
were put on hold due to the aforementioned caveats with mmu_notifiers.  The
caveats are that most flows where KVM plumbs a physical address into hardware
structures aren't wired up to KVM's mmu_notifier.

KVM could support non-refcounted struct page memory without first fixing the
mmu_notifier issues, but I was (and still am) concerned that that would create an
even larger hole in KVM until the mmu_notifier issues are sorted out[2].
 
[1] https://lore.kernel.org/all/20211129034317.2964790-1-stevensd@google.com
[2] https://lore.kernel.org/all/Ydhq5aHW+JFo15UF@google.com

  reply	other threads:[~2023-01-11 17:05 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-15  9:54 [PATCH v1] drm/ttm: Refcount allocated tail pages Dmitry Osipenko
2022-08-15  9:54 ` Dmitry Osipenko
2022-08-15 10:05 ` Christian König
2022-08-15 10:05   ` Christian König
2022-08-15 10:05   ` Christian König via Virtualization
2022-08-15 10:09   ` Dmitry Osipenko
2022-08-15 10:09     ` Dmitry Osipenko
2022-08-15 10:11     ` Christian König
2022-08-15 10:11       ` Christian König
2022-08-15 10:11       ` Christian König via Virtualization
2022-08-15 10:14       ` Christian König via Virtualization
2022-08-15 10:14         ` Christian König
2022-08-15 10:14         ` Christian König
2022-08-15 10:18         ` Dmitry Osipenko
2022-08-15 10:18           ` Dmitry Osipenko
2022-08-15 10:42           ` Christian König
2022-08-15 10:42             ` Christian König
2022-08-15 10:42             ` Christian König via Virtualization
2022-08-15 10:47           ` Dmitry Osipenko
2022-08-15 10:47             ` Dmitry Osipenko
2022-08-15 10:51             ` Christian König
2022-08-15 10:51               ` Christian König
2022-08-15 10:51               ` Christian König via Virtualization
2022-08-15 11:19               ` Dmitry Osipenko
2022-08-15 11:19                 ` Dmitry Osipenko
2022-08-15 11:28                 ` Christian König
2022-08-15 11:28                   ` Christian König
2022-08-15 11:28                   ` Christian König via Virtualization
2022-08-15 11:50                   ` Dmitry Osipenko
2022-08-15 11:50                     ` Dmitry Osipenko
2022-08-15 13:06                     ` Christian König
2022-08-15 13:06                       ` Christian König
2022-08-15 13:06                       ` Christian König via Virtualization
2022-08-15 13:45                       ` Dmitry Osipenko
2022-08-15 13:45                         ` Dmitry Osipenko
2022-08-15 13:53                         ` Christian König
2022-08-15 13:53                           ` Christian König
2022-08-15 13:53                           ` Christian König via Virtualization
2022-08-15 14:57                           ` Dmitry Osipenko
2022-08-15 14:57                             ` Dmitry Osipenko
2022-08-15 15:54                             ` Dmitry Osipenko
2022-08-15 15:54                               ` Dmitry Osipenko
2022-08-17 22:57                               ` Dmitry Osipenko
2022-08-17 22:57                                 ` Dmitry Osipenko
2022-08-17 23:13                                 ` Dmitry Osipenko
2022-08-17 23:13                                   ` Dmitry Osipenko
2022-08-18  9:41                                   ` Christian König
2022-08-18  9:41                                     ` Christian König
2022-08-18  9:41                                     ` Christian König via Virtualization
2023-01-11 17:05                                     ` Sean Christopherson [this message]
2023-01-11 17:05                                       ` Sean Christopherson
2023-01-11 21:24                                       ` Dmitry Osipenko
2023-01-11 21:24                                         ` Dmitry Osipenko
2022-09-06 20:01   ` Daniel Vetter
2022-09-06 20:01     ` Daniel Vetter
2022-09-06 20:01     ` Daniel Vetter
2022-09-06 20:05     ` Daniel Vetter
2022-09-06 20:05       ` Daniel Vetter
2022-09-07  6:48       ` Christian König via Virtualization
2022-09-07  6:48         ` Christian König
2023-01-11 17:13       ` Sean Christopherson
2022-09-08 11:04     ` Rob Clark
2022-09-08 11:04       ` Rob Clark

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y77sQZI0IfFVx7Jo@google.com \
    --to=seanjc@google.com \
    --cc=Trigger.Huang@gmail.com \
    --cc=airlied@linux.ie \
    --cc=antonio.caggiano@collabora.com \
    --cc=christian.koenig@amd.com \
    --cc=digetx@gmail.com \
    --cc=dmitry.osipenko@collabora.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=gert.wollny@collabora.com \
    --cc=kernel@collabora.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=ray.huang@amd.com \
    --cc=virtualization@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.